[INSTRM-1367] obslog: global search on header keyword Created: 18/Sep/21  Updated: 14/Oct/22  Resolved: 14/Oct/22

Status: Done
Project: Instrument control development
Component/s: spt_operational_database
Affects Version/s: None
Fix Version/s: None

Type: Task Priority: Normal
Reporter: hassan Assignee: michitaro
Resolution: Done Votes: 0
Labels: EngRun
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File Screen Shot 2021-12-09 at 4.15.03.png    
Story Points: 6

 Description   

Following discussions in the #obslog slack channel during Engineering Run 3, it would be very useful to be able perform a global search on all raw FITS files for a keyword value matching a certain condition. For example:

W_CLNEOT>14

would return all visits/files matching the case where the keyword W_CLNEOT (Neon lamp request time) has a value of 14 or greater.



 Comments   
Comment by hassan [ 19/Sep/21 ]

Another example is

W_AITNEO='T'

(or corresponding correct boolean expression)

to find all exposures which have been taken under the Neon DCB lamp.

Comment by michitaro [ 20/Sep/21 ]

Kiyoto Yabe-san,
This task requires some changes on opdb and the component inserting exposure records into opdb.
I can provide codes for extra schema definition for opdb and a routine to insert FITS header information from a FITS file into the database.

How do we go ahead?

Comment by Kiyoto Yabe [ 20/Sep/21 ]

I still don't know how larger the fix is, but anyway if you make a fix easily please make a pull request.

Comment by michitaro [ 07/Oct/21 ]

I'm thinking about how to store FITS headers in the opdb.
There are the following 3 feasible ways to do so as far as I think.

Method 1 (naive)

To make a wide table with columns for all keywords the FITS headers have.

class FitsHeader(Base):
    __tablename__ = 'fitsheader'

    filename = Column(String, primary_key=True)
    hdu_index = Column(Integer, primary_key=True)

    SIMPLE = Column(String)
    BITPIX = Column(Integer)
    NAXIS = Column(Integer)
    EXTEND = Column(Boolean)
    DATA_TYP = Column(String)
    FRAMEID = Column(String)
    EXP_ID = Column(String)
    DETECTOR = Column(String)
    GAIN = Column(Float)
    DET_TMP = Column(Float)
    DET_ID = Column(String)
    DISPAXIS = Column(Integer)
    DISPRSR = Column(Float)
    WAV_MIN = Column(Float)
    WAV_MAX = Column(FLoat)
    ...

Pros

Simple and no violation of SQL standards.

Cons

Schema change always will be required when the set of keywords changes.
We cannot store cards with a keyword that does not exist in the table.

Method 2 (EAV)

To make a table whose each record represents a FITS card.
This method is known as the EAV pattern (considered a bad design in general.)

class FitsHdu(Base):
    __tablename__ = 'fits_hdu'
    id = Column(Integer, primary_key=True)

class FitsCardString(Base):
    __tablename__ = 'fits_card_string'
    hdu_id = Column(Integer, ForeignKey('fits_hdu.id'))
    keyword = Column(String, nullable=False)
    value = Column(String, nullable=False)

class FitsCardFloat(Base):
    __tablename__ = 'fits_card_float'
    hdu_id = Column(Integer, ForeignKey('fits_hdu.id'))
    keyword = Column(String, nullable=False)
    value = Column(Float, nullable=False)

class FitsCardBoolean(Base):
    __tablename__ = 'fits_card_boolean'
    hdu_id = Column(Integer, ForeignKey('fits_hdu.id'))
    keyword = Column(String, nullable=False)
    value = Column(Boolean, nullable=False)

Pros

No violation of SQL standards.
No schema change will be required when the set of keywords changes.

Cons

This approach requires a complicated implementation.

Method 3 (JSON)

To make a table with a column of JSON(B) type to store the header cards.

class FitsHeader(Base):
    __tablename__ = 'fitsheader'

    id = Column(Integer, primary_key=True)
    cards = Column(JSONB)

with Session(future=True) as session:
    records = [
        FitsHeader(cards={
            'FRAMEID': 'PFS001',
            'favorite_food': 'rice',
        }),
        FitsHeader(cards={
            'FRAMEID': 'PFS002',
            'favorite_food': 'noodle',
        }),
    ]
    session.add_all(records)
    session.commit()
    q = session.query(FitsHeader).filter(
        FitsHeader.cards['FRAMEID'].astext == 'PFS002')
    r = list(q)[0]
    assert r.cards['favorite_food'] == 'noodle'  # type: ignore

Pros

Simple implementation.
Expression index for a particular keyword helps quick search.

Cons

Less potable. A JSON column is out of SQL standards (, but Postgres, MySQL, SQLite and MSSQLServer support it.)
(https://docs.sqlalchemy.org/en/14/core/type_basics.html#sqlalchemy.types.JSON)
We may have to write some codes for specific DB backend.

Comment by michitaro [ 07/Oct/21 ]

If we have no plan to run the code on DBMS other than Postgres, I like the 3rd method.

Comment by michitaro [ 08/Oct/21 ]

cloomis-san, what do you think about introducing JSON to opdb?

Comment by Kiyoto Yabe [ 14/Oct/21 ]

I'm now positively thinking about using JSON columns. We do not probably use any backends other than Postgres.

Comment by michitaro [ 15/Oct/21 ]

Thank you for the comment.

I will make a prototype using JSON column approach.

Comment by michitaro [ 18/Nov/21 ]

I have made a PR for this.

https://github.com/Subaru-PFS/spt_operational_database/pull/54

cloomis san, to be able to search by FITS headers, it's necessary to add an `obslog_fits_header` record when a new `pfs_visit` record is issued.

As the PR shows, codes to add a `obslog_fits_header ` record from a `pyfits.HDUList` will be something like this:

import opdb.obslog
import astropy.io.fits as pyfits
from opdb.models import pfs_visit

db = SessionClass()

pfs_visit_id = 66453
filestem = 'PFSA06645311'

hdul = pyfits.HDUList([
  pyfits.PrimaryHDU(header=pyfits.Header([
    ('PI', 3.14, "ratio of a circle's circumference to its diameter"),
  ])),
])

db.add_all(opdb.obslog.fits_headers_from_hdulist(hdul, pfs_visit_id, filestem):
db.commit()

Is it feasible to add some codes like the above to your code?

Sorry for the delayed progress.

Comment by cloomis [ 18/Nov/21 ]

[ Sorry, I missed your much much earlier request. Yes, I'm fine with the clever JSONB encoding. ]

I have opened INSTRM-1463 to insert the headers into opdb just before gen2 archiving – that is not the earliest possible time but close.

Comment by michitaro [ 19/Nov/21 ]

Thanks! 

Comment by michitaro [ 09/Dec/21 ]

I have added the header cards in existing sps and mcs FITS files to opdb.

Now we can query visits by their FITS header value on obslog like this:

 

where fits_header['CHECKSUM'] = 'V4faV3fYV3faV3fW'

where fits_header['OBSERVER'] like '%tamura%'

where fits_header['DET-TMP']::float between 163 and 164

where fits_header['W_ENIISH'] = 'true' -- not 'T' or 'F'

Query on a boolean value, we should use 'true' and 'false' instead of 'T' or 'F.'

 

 

Comment by Kiyoto Yabe [ 09/Dec/21 ]

Thank you, Koike-san! That would be very useful.

Can we combine the previous search method (for instance just typing `defocused` to select defocused data) and this new query method (for instance, `where fits_header['EXPTIME']::float < 10`)?

In other words, how can we query to obslog columns such as description and notes? Sorry, but I probably don't understand the search mechanism.

Comment by michitaro [ 09/Dec/21 ]

I have to make some documents for this.

If the text in the search box starts with `where ` (like the example above), the search query will be sent to the server as is.

But if the search query does not start with `where ` (for instance, just `defocused`) the search query will be compiled into a particular format before the query will be sent to the server.

For the case of `defocused`, it will be compiled into `where any_column like '%defocused%'.

`any_column` is a special virtual column that will be expanded to some real columns in the opdb.

# any_column will be expanded the columns below
M.pfs_visit.pfs_visit_id, String,
M.pfs_visit.pfs_visit_description,
M.obslog_visit_note.body,
visit_note_user.account_name,
M.obslog_visit_set_note.body,
visit_set_note_user.account_name,
M.iic_sequence.name,
M.iic_sequence.sequence_type,
M.sps_annotation.notes,
M.obslog_mcs_exposure_note.body,
mcs_exposure_note_user.account_name,

In addition to `fits_header`, we can use more virtual columns below.

(ast.String('any_column'),): AnyColumn,
(ast.String('visit_id'),): M.pfs_visit.pfs_visit_id,
(ast.String('id'),): M.pfs_visit.pfs_visit_id,
(ast.String('sequence_type'),): M.iic_sequence.sequence_type,
(ast.String('issued_at'),): M.pfs_visit.issued_at,
(ast.String('is_sps_visit'),): M.sps_visit.pfs_visit_id != None,
(ast.String('is_mcs_visit'),): M.mcs_exposure.mcs_frame_id != None,
(ast.String('is_agc_visit'),): M.agc_exposure.agc_exposure_id != None,
(ast.String('fits_header'),): M.obslog_fits_header.cards_dict,

 
> Can we combine the previous search method
Yes, `where any_column like '%defocus%' and fits_header['EXPTIME']::float < 10` will work for you.
 
Anyway, I will make documents for this soon.
 

Comment by Kiyoto Yabe [ 09/Dec/21 ]

Thank you! That information is very helpful. And thank you for documenting.

Comment by hassan [ 30/Mar/22 ]

michitaro Kiyoto Yabe Is this ticket complete? Can we close this?

Comment by michitaro [ 30/Mar/22 ]

Yes, we can close this ticket, but we need to register FITS files on the DB manually periodically.
The automatic registration will be handled in this ticket↓.
https://pfspipe.ipmu.jp/jira/browse/INSTRM-1463

Comment by yuki.moritani [ 14/Oct/22 ]

Following the above comments, I close this ticket.

Generated at Sat Feb 10 16:34:43 JST 2024 using Jira 8.3.4#803005-sha1:1f96e09b3c60279a408a2ae47be3c745f571388b.