[DAMD-145] Add facility to hold history/metrics of spectra Created: 21/Jan/23  Updated: 17/Feb/23  Resolved: 17/Feb/23

Status: Done
Project: Data Model
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Story Priority: Normal
Reporter: price Assignee: price
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Blocks
blocks PIPE2D-1143 sky subtraction is bad for a certain ... Done
Sprint: 2DDRP-2023 A

 Description   

We need a facility to allow us to track corrections to and problems with individual spectra. We want to be able to store values (e.g., blackspot correction of 10%) in addition to flags (e.g., aperture correction failed). This would be a good place to put QA metrics measured during pipeline operation too.

The desire to have key-value pairs where the values are of varying types (bool, int, float, string) makes it difficult/inconvenient to represent in FITS using conventional means. However, we would like to avoid using something python-specific like pickle, so that the data is easily accessible from any programming language. I think that the proposed representation, while perhaps unconventional, is simple and meets these design requirements.

I propose to add a new HDU to spectral products (pfsArm, pfsMerged, pfsSingle, pfsObject). The reduction notes (NOTES) HDU is a FITS image HDU consisting of a UTF-8-encoded JSON representation of (for multi-spectra products, an array of NFIBER collections of) key-value pairs that record operations performed and measurements made during reduction. The HDU may be compressed using standard FITS compression conventions.

I have a proposed implementation created in the course of working on PIPE2D-1143, and welcome comments and suggestions.



 Comments   
Comment by vlebrun [ 24/Jan/23 ]

sounds good to me, are such comments readable/usable by the subsequent modules of the pipeline (eg PIPE1D), if we want to avoid errors which would be hard to detect in the spectrum itself but easy to anticipate from those comments, or will they be redundant to the flags anyway ?

Comment by price [ 24/Jan/23 ]

Here's an alternative proposal that avoids JSON:

The reduction notes (NOTES) HDU is a FITS table HDU consisting of the following columns:

  • name (string): the keyword name
  • blank (variable-length array of int): indices of spectra for which the keyword is undefined.
  • bool (variable-length array of bool): values for spectra, if the field is of boolean type; otherwise, empty.
  • int (variable-length array of int): values for spectra, if the field is of integer type; otherwise, empty.
  • float (variable-length array of float): values for spectra, if the field is of floating-point type; otherwise, empty.
  • string (variable-length array of string): values for spectra (NUL-separated), if the field is of string type; otherwise empty.

It's a fair bit more complicated, but it might save space by not having to write the keyword name many times and writing numbers as strings.

Comment by price [ 24/Jan/23 ]

vlebrun: the notes will be implemented in the python PfsObject class that 1D operates on. It describes the spectrum as a whole, whereas the mask ("flags") describe individual pixels within the spectrum. It may contain indications that the spectrum is not of good quality, but 1D should probably process it anyway and allow the user to choose whether to include it in their analysis or not.

Comment by Pierre-Yves CHABAUD [ 24/Jan/23 ]

price: If I understand correctly your alternative proposal, for a given file, the size of the "variable-length" arrays is NFIBER, isn't it ?

Comment by price [ 25/Jan/23 ]

The number of entries in blank plus the number of entries in one of bool, int, float or string should be NFIBER.

Comment by Kiyoto Yabe [ 25/Jan/23 ]

I think that will be very useful for QA as well. I like JSON indeed, if permitted. I'm not sure if there is any restriction or policy on FITS when NAOJ delivers the output to users. And also, this is somewhat related to QA, so I take the liberty of inviting Masayuki Tanaka to this discussion.

Comment by price [ 26/Jan/23 ]

I think I prefer the alternative proposal (FITS table) to the original (JSON). It's a bit more complicated, but it's probably more compact and doesn't require parsing text.

Comment by rhl [ 26/Jan/23 ]

Adding an HDU for this sort of thing is a good idea, which I endorse. I also agree that pickle is a non-starter.

However, I'm not sure that I like a free-form non-datamodel controlled HDU. How will we use this? I'd rather provide a light-weight way to add more keys to this HDU, or more precisely the python object which it maps to. Do you have a sketch of this object?

As part of this, how would we manage schema evolution? – I think we only need to worry about adding more fields.

Comment by Masayuki Tanaka [ 27/Jan/23 ]

Not sure what sort of QA we are thinking of here, but I had a bit of discussion with Hamano-san yesterday.  Hamano-san made a good point; it makes a lot of sense to include QA metrics from the pipeline in the fits header (e.g., number/area of CR), but as we agreed before, QAs that NAOJ/IPMU are working on is a post-processing task, and we would prefer not to update the fits header

Comment by price [ 27/Jan/23 ]

The FITS header doesn't make sense for QA about individual spectra, because there are 650 spectra per spectrograph. It needs to go in a separate HDU.

Comment by Masayuki Tanaka [ 27/Jan/23 ]

Sorry, I meant to say HDU instead of fits header.  It might be OK to include detector-wide QA in the header, though.

Comment by sogo.mineo [ 30/Jan/23 ]

Koike-san is afraid of JSON being unable to distinguish between integers and floats. Many 64bit integers like pfsDesignId exceeds 53bits. If they appear in the saved data, the stringified JSON won't be compatible though python's json module can deal with them correctly. I add Infinity and NaN to him.

Comment by price [ 02/Feb/23 ]

I'm going to take this as generally approved, with the amendments:

  • The alternative proposal will be used for persistence.
  • The implementation will enforce a controlled schema.

I'll change the implementation on PIPE2D-1143, and close this once it's done.

Comment by price [ 04/Feb/23 ]

There's a new implementation on PIPE2D-1143.

Comment by price [ 17/Feb/23 ]

PIPE2D-1143 merged to master.

Generated at Sat Feb 10 15:34:44 JST 2024 using Jira 8.3.4#803005-sha1:1f96e09b3c60279a408a2ae47be3c745f571388b.