[PIPE2D-920] How to put the large set of the AMBRE model templates in github for flux calibration Created: 25/Oct/21  Updated: 22/Dec/21  Resolved: 22/Dec/21

Status: Done
Project: DRP 2-D Pipeline
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Story Priority: Normal
Reporter: Takuji Yamashita Assignee: sogo.mineo
Resolution: Done Votes: 0
Labels: flux-calibration, model-templates
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Blocks
blocks PIPE2D-921 Write a code to read the model templa... Done
Epic Link: flux calibration
Reviewers: price

 Description   

Initially, we plan to store the 6k templates (2.4GB) in GitHub for flux calibration. Because this might be large, Mineo-san and Yamashita are discussing reducing the size.
One option is to upload a small subset of them. This enables us to run a test code for the blue-part in flux calibration but the output is inaccurate. The full data will be uploaded sometime in the future.
Another option is to compress the 6k templates. We are now discussing whether saving them in fixed-point numbers works well or not.
 
We have ~6k templates. In the current plan, a user generates a large template set (5.6k) with smaller parameter grids for stellar typing using the 6k templates and RBF interpolation in the f_star repository, on a user-side. Stellar typing for flux calibration refers to the 5.6k templates. 1 template (1 FITS file) has ~400kB. The size of the first template set of 6k is 6k*400kB = 2.4 GB in total.
 



 Comments   
Comment by Takuji Yamashita [ 26/Oct/21 ]

We lean to saving the 6k templates in fixed-point numbers to reduce the file size. The size could reduce roughly by half, ~1-2GB. We will convert the templates to log and then convert them to fixed-point numbers. We need to test the accuracy.

Comment by price [ 26/Oct/21 ]

We should also use WCS to do the wavelengths if we can.

Comment by Takuji Yamashita [ 26/Oct/21 ]

The spectra are saved in FITS again. We can use WCS for wavelengths. 

Comment by sogo.mineo [ 26/Oct/21 ]

The size estimate appearing in the description does not include wavelength column. FITS files contain flux only. Wavelengths are computed by means of WCS.

Comment by hassan [ 27/Oct/21 ]

Missing important comment from @rhl:

As we may need to do something cleverer someday, I'd hide it behind an API

That way, the access and processing software are decoupled from the stored data format.

Comment by hassan [ 27/Oct/21 ]

Discussed this a little further with rhl and price. Is it possible to store the data on a server somewhere? That would be easier to manage than to store the data under git-lfs.

Comment by sogo.mineo [ 27/Oct/21 ]

We can indeed put the heavy things in hscdata.mtk.nao.ac.jp, for example. The problem is how to let the test process see them. I would like to hook the first call to getModelSpectrum() and get all spectra downloaded into ... some directory. I don't want to use /tmp since astropy does this and fills up the limited capacity of /tmp all too soon, then killed and leaving the system unstable. Another solution might be to make valid a path starting with https:// and download the models one by one every time they are requested, just as we actually open and read the model files one by one every time they are requested,

Comment by sogo.mineo [ 27/Oct/21 ]

I would like to take the last route ("Another solution might be to make valid a path...") because it is the easiest thing to do. If I take this route, it may be that we no longer have to reduce the model size but that we can use all of 60k models, which was Yamashita-san's first plan. One problem is that the task of making calibration references will take a few hours (even if the models are in a local storage) to process a single fiber, so that it will take at least a few hours for a unit test to be proved.

Comment by rhl [ 28/Oct/21 ]

I was assuming that the test data would just be a dependency, so it'd be installed once (using as it were curl) and then used whenever you run the tests.

Comment by sogo.mineo [ 28/Oct/21 ]

Then what I have to do are:

  1. Upload the 60k (or smaller 6k) spectra to some http server. (The package might need ups/ directory)
  2. Ask people (I don't know who) to install it, as a dependency, into the server in which automatic tests are run.
  3. Push branches that require the package.

Do I understand correctly?

Comment by price [ 01/Nov/21 ]

Yes, that's great. Please be sure to include a README file that explains what the data are and where they came from, and include a version string (usually the date) in the directory name.

You're welcome to put it on the tiger cluster at Princeton (e.g., /projects/HSC/PFS/fluxCal/fluxCal-20211101), and we can serve it via http from there.

Comment by sogo.mineo [ 02/Nov/21 ]

I tentatively created a package just now, but I found myself not sure whether or not the synthetic spectra are redistributable. I am now checking it.
The package name will be fluxmodeldata-ambre-20210512-full in which ambre-20210512-full is its version name.

Comment by sogo.mineo [ 22/Nov/21 ]

Tanaka-san said we have been given permission of redistribution of the synthetic spectra by the author. Yamashita-san found some flaws in converting the original spectra to the format he uses. He is now recreating the data files.

Comment by sogo.mineo [ 06/Dec/21 ]

I have uploaded the smaller dataset here https://hscdata.mtk.nao.ac.jp/hsc_bin_dist/pfs/fluxmodeldata-ambre-20190419-small.tar.xz .
I intend this package to be used by unit tests because Yamashita-san's algorithm to make flux reference takes several hours per fiber when the full dataset is used.

The full dataset has not been completed yet. We found that we had to add more spectra to the dataset, and the spectra have yet to be made. We must also examine whether the tremendous amount of the full dataset and the eon-long execution time really contribute to accuracy of the calibration task.

Comment by sogo.mineo [ 07/Dec/21 ]

If the above dataset is approved and installed in the server where the tests run, I would like to push changes to drp_stella and drp_pfs_data that are named this issue. With the changes, the broadband photometry table referred to by FitBroadbandSEDTask is moved from drp_pfs_data to the above package. The broadband photometry table must reside close to the spectrum set because the two must match with each other.

Comment by Takuji Yamashita [ 14/Dec/21 ]

We can close this ticket because we have discussed this issue and Mineo-san has loaded the model template dataset. I will file two new tickets for the works Mineo-san said above:
1. check and approve the dataset that Mineo-san has uploaded, and then install it on the server.
2. move the broad-band photometry table to the dataset package
May I assign the 1st one to price and the 2nd one to sogo.mineo ?

Comment by price [ 18/Dec/21 ]

I've retrieved the tarball listed above, and placed it in /projects/HSC/PFS/fluxCal on our Tiger cluster. It looks good to me.

Comment by sogo.mineo [ 20/Dec/21 ]

I have made two PR, one of which is to make drp_stella depend on fluxmodeldata package. The other one is to remove the photometry table from drp_pfs_data. The former change should be made before the latter change. Could you review these things?

Comment by price [ 21/Dec/21 ]

I don't think we can require that every installation of the pipeline contains a 2.4 GB data package. You should make the data package setupOptional in the table file, and protect the tests with checks, e.g., here.

Comment by sogo.mineo [ 21/Dec/21 ]

I changed fluxmodeldata from required to optional. Could you review the newly pushed patch?

Comment by price [ 22/Dec/21 ]

Awesome, thanks!

Comment by sogo.mineo [ 22/Dec/21 ]

Thanks for the review. I merged my two pull requests to master.

Generated at Sat Feb 10 15:59:50 JST 2024 using Jira 8.3.4#803005-sha1:1f96e09b3c60279a408a2ae47be3c745f571388b.