Uploaded image for project: 'DRP 2-D Pipeline'
  1. DRP 2-D Pipeline
  2. PIPE2D-1145

Make fitPfsFluxReference.py faster

    XMLWordPrintable

    Details

    • Type: Task
    • Status: Done (View Workflow)
    • Priority: Normal
    • Resolution: Done
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None

      Description

      The current fitPfsFluxReference.py is slow.
      It takes a few hours to process a single visit.

      We are going to aim at 1 hour/visit for now,
      though we are not sure whether it is acceptable or not.

      Here is the output of the profiler profiling the processing of the integration test.

      ncalls  tottime  percall  cumtime  percall filename:lineno(function)
           1    0.000    0.000 1981.366 1981.366 cmdLineTask.py:621(parseAndRun)
       12088    1.538    0.000 1360.010    0.113 fitPfsFluxReference.py:427(computeContinuum)
       48326    0.515    0.000  318.018    0.007 fitPfsFluxReference.py:981(convolveLsf)
      

      (Because the integration test contains two visits, we have to divide these things by 2
      to get per-visit values. We then have to multiply them by 10 because "6k" model set is
      used in the integration test, whereas "60k" model set is used in the actual data processing.)

      The two hot spots are:

      • computeContinuum (Fit a continuum to a model spectrum)
      • convolveLsf (Convolve a model spectrum with an LSF)

      We already have a mechanism to skip these two calls when they are unnecessary:
      "If prior[model] / max(prior) <= th, then skip computing likelihood[model]." (th = 1e-8)
      The prior probability distribution is computed from broad-band fluxes.
      Since the integration test contains only a single broad-band flux (i2_hsc),
      the prior is the uniform distribution. Therefore this mechanism does not work
      in the integration test, but it indeed appears to lead to time saving
      in the actual data processing.

      We have 85 prior probability distributions as --debug by-products
      obtained from processing visit=82596. Using these distributions,
      we can count how many calls to the two functions will happen if
      we set the threshold th to various values.
      We can then speculate, from the profiler output, execution time
      of fitPfsFluxReference as a function of the threshold.

      We can see from this plot that we have to set th=0.01 or above
      if we want the per-visit execution time to be less than an hour.

      One concern is that the model that would be chosen were it not
      for the threshold can be discarded too hastily if we set the threshold
      to such a large value.
      We can examine whether a FLUXSTD fiber will be affected by the threshold,
      by seeing prior[argmax(posterior)] / max(prior).
      (The posterior distributions are also --debug by-products obtained from
      processing visit=82596.)
      If prior[argmax(posterior)] / max(prior) is less than the threshold,
      the best model (argmax(posterior)) won't be selected when we set
      th to the threshold value.

      It appears that the samples below 0.01 are outliers, which can be neglected.
      (More than 40% of FLUXSTD fibers are outliers for now, but it is another problem.)

      In conclusion, we will:

      • Modify the program to make the threshold a config parameter.
      • Change the default threshold from 1e-8 to 0.01.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                sogo.mineo sogo.mineo
                Reporter:
                sogo.mineo sogo.mineo
                Reviewers:
                price
              • Votes:
                0 Vote for this issue
                Watchers:
                4 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: