[PIPE2D-571] Ingest into a common data repo Created: 05/May/20  Updated: 05/Jan/21  Resolved: 22/May/20

Status: Done
Project: DRP 2-D Pipeline
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Story Priority: Normal
Reporter: price Assignee: price
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Text File ingest-LAM.log     Text File ingest-subaru.log    
Issue Links:
Relates
relates to INSTRM-823 Restructure /data at Subaru Done
Story Points: 3
Sprint: 2DDRP-2021 A
Reviewers: hassan

 Description   

We have been generating data repos individually. Instead, we should use a shared data repo, as we do for HSC. This makes reducing data easier, and also eases collaboration.

  • Clean up the different raw repos on /tigress
  • Ingest all data (LAM, Subaru)
    • Identify and fix data that doesn't ingest
  • Send cloomis the command for ingestion so that his script can run it regularly on new data.


 Comments   
Comment by price [ 16/May/20 ]

Cleaning up all the raw files first.

pprice@tiger2-sumire:/projects/HSC/PFS/LAM-raw $ find . | xargs -n 20 -P 10 md5sum > ~/LAM-raw.txt
pprice@tiger2-sumire:/projects/HSC/PFS/LAM/raw $ find . | xargs -n 20 -P 10 md5sum > ~/raw.txt

>>> with open("raw.txt") as fd:
...  raw = dict(line.strip().split() for line in fd.readlines())
>>> with open("LAM-raw.txt") as fd:
...  lamRaw = dict(line.strip().split() for line in fd.readlines())
>>> len(set(lamRaw.keys()) - set(raw.keys()))
0
>>> len(set(lamRaw.values()) - set(raw.values()))
0

==> Blew away /projects/HSC/PFS/LAM-raw, as everything is in /projects/HSC/PFS/LAM/raw.


pprice@tiger2-sumire:/projects/HSC/PFS/JHU/raw $ find . | xargs -n 100 -P 20 md5sum > ~/jhu.txt
pprice@tiger2-sumire:/projects/HSC/PFS/raw $ find . | xargs -n 100 -P 20 md5sum > ~/raw.txt
pprice@tiger2-sumire:~ $ diff --new-line-format="" --unchanged-line-format="" <(sort raw.txt) <(sort jhu.txt)
761aad7694f64002e40f6b7005eb90ea  ./INVENTORY_NOTES.txt
c56bfdedd8287bc8cba86b14de520076  ./INVENTORY_NOTES.txt~

>>> with open("raw.txt") as fd:
...     raw = dict(line.strip().split() for line in fd.readlines())
... 
>>> with open("jhu.txt") as fd:
...     jhu = dict(line.strip().split() for line in fd.readlines())
... 
>>> len(raw)
1523
>>> len(jhu)
4477
>>> len(set(raw.keys()) - set(jhu.keys()))
1
>>> set(raw.keys()) - set(jhu.keys())
{'c56bfdedd8287bc8cba86b14de520076'}
>>> raw['c56bfdedd8287bc8cba86b14de520076']
'./INVENTORY_NOTES.txt~'

Craig Loomis  3:32 PM
That `raw` directory and INVENTORY_NOTES.txt represents quite a bit of time.  Basically, it selects the very early r1 and b1 which are actually useable. So I’d put in JHU/keep/

==> Moved /projects/HSC/PFS/raw to /projects/HSC/PFS/JHU/keep


pprice@tiger2-sumire:/projects/HSC/PFS/LAM $ for dd in HgArFeb2019_raw KrFeb2019_raw NeonApr2019_raw NeonFeb2019_raw ; do find $dd | xargs -n 10 -P 20 md5sum > $dd.txt ; done
pprice@tiger2-sumire:/projects/HSC/PFS/LAM $ find raw | xargs -n 100 -P 20 md5sum > raw.txt
pprice@tiger2-sumire:/projects/HSC/PFS/LAM $ diff --new-line-format="" --unchanged-line-format="" <(awk '{print $1}' HgArFeb2019_raw.txt | sort -u) <(awk '{print $1}' raw.txt | sort -u)
pprice@tiger2-sumire:/projects/HSC/PFS/LAM $ diff --new-line-format="" --unchanged-line-format="" <(awk '{print $1}' KrFeb2019_raw.txt | sort -u) <(awk '{print $1}' raw.txt | sort -u)
pprice@tiger2-sumire:/projects/HSC/PFS/LAM $ diff --new-line-format="" --unchanged-line-format="" <(awk '{print $1}' NeonApr2019_raw.txt | sort -u) <(awk '{print $1}' raw.txt | sort -u)
pprice@tiger2-sumire:/projects/HSC/PFS/LAM $ diff --new-line-format="" --unchanged-line-format="" <(awk '{print $1}' NeonFeb2019_raw.txt | sort -u) <(awk '{print $1}' raw.txt | sort -u)

==> Can blow away HgArFeb2019_raw KrFeb2019_raw NeonApr2019_raw NeonFeb2019_raw.
==> Done.
Comment by price [ 22/May/20 ]
Neven Caplar:princeton:  4:56 PM
I suggest Sep 09 2018 as starting date (start of https://people.lam.fr/madec.fabrice/pfs/ait_logbook_SM1.html) (edited) 


(lsst-scipipe) pprice@tiger2-sumire:/projects/HSC/PFS/LAM/raw $ ingestPfsImages.py /projects/HSC/PFS/LAM --pfsConfigDir /projects/HSC/PFS/LAM/raw/pfsDesign --config parse.pfsDesignId=1099528409104 clobber=True -- '201[89]-*/PFLA*.fits'


RuntimeError: Unable to find PfsConfig or PfsDesign for pfsDesignId=0x0000010001001000
RuntimeError: Unable to find PfsConfig or PfsDesign for pfsDesignId=0x0000100000001111

>>> pfs.utils.dummyCableB.DummyCableBDatabase().interpret(0x0000010001001000)
['red1', 'red4', 'red8']
>>> pfs.utils.dummyCableB.DummyCableBDatabase().interpret(0x0000100000001111)
['blue', 'green', 'orange', 'red1', 'yellow']

(lsst-scipipe) pprice@tiger2-sumire:/projects/HSC/PFS/LAM/raw/pfsDesign $ makeDummyCableBDesign.py red1 red4 red8
(lsst-scipipe) pprice@tiger2-sumire:/projects/HSC/PFS/LAM/raw/pfsDesign $ makeDummyCableBDesign.py blue green orange red1 yellow

ValueError: could not convert string to float: 'NO CURRENT VALUE'
Fixed on tickets/PIPE2D-571

(lsst-scipipe) pprice@tiger2-sumire:/projects/HSC/PFS/LAM/raw $ ingestPfsImages.py /projects/HSC/PFS/LAM --pfsConfigDir /projects/HSC/PFS/LAM/raw/pfsDesign --config parse.pfsDesignId=1099528409104 -- '201[89]-*/PFLA*.fits' | tee ingest.log

(lsst-scipipe) pprice@tiger2-sumire:/projects/HSC/PFS/LAM/raw $ grep "Unable to find PfsConfig" ingest.log | sed -e 's|^.*pfsDesignId=||' | sort -u
0x0000000000000001
0x0000000000000010
0x0000000001000000

>>> pfs.utils.dummyCableB.DummyCableBDatabase().interpret(0x0000000000000001)
['blue']
>>> pfs.utils.dummyCableB.DummyCableBDatabase().interpret(0x0000000000000010)
['green']
>>> pfs.utils.dummyCableB.DummyCableBDatabase().interpret(0x0000000001000000)
['red4']

(lsst-scipipe) pprice@tiger2-sumire:/projects/HSC/PFS/LAM/raw/pfsDesign $ makeDummyCableBDesign.py blue
Wrote pfsDesign-0x0000000000000001.fits
(lsst-scipipe) pprice@tiger2-sumire:/projects/HSC/PFS/LAM/raw/pfsDesign $ makeDummyCableBDesign.py green
Wrote pfsDesign-0x0000000000000010.fits
(lsst-scipipe) pprice@tiger2-sumire:/projects/HSC/PFS/LAM/raw/pfsDesign $ makeDummyCableBDesign.py red4
Wrote pfsDesign-0x0000000001000000.fits


66 files have multiple header problems (e.g., no DATE-OBS, EXPTIME, IMAGETYP)
that aren't worth saving for now.

price@price-laptop:~/pfs/obs_pfs (tickets/PIPE2D-571=) $ grep "No locations for get" ingest.log | sed -e 's|^.*Failed to ingest file \(.*\.fits\):.*$|\1|'
2019-03-01/PFLA01340412.fits
2019-03-01/PFLA01340812.fits
2019-03-01/PFLA01340712.fits
2019-03-01/PFLA01340612.fits
2019-03-01/PFLA01340512.fits
2019-02-22/PFLA01253114.fits
2019-02-22/PFLA01254114.fits
2019-02-22/PFLA01253714.fits
2019-02-22/PFLA01253214.fits
2019-02-22/PFLA01253314.fits
2019-02-22/PFLA01252914.fits
2019-02-22/PFLA01253814.fits
2019-02-22/PFLA01253914.fits
2019-02-22/PFLA01253014.fits
2019-02-22/PFLA01254014.fits
2019-02-28/PFLA01317312.fits
2019-02-28/PFLA01316212.fits
2019-02-28/PFLA01316912.fits
2019-02-28/PFLA01315912.fits
2019-02-28/PFLA01318212.fits
2019-02-28/PFLA01317712.fits
2019-02-28/PFLA01318112.fits
2019-02-28/PFLA01316612.fits
2019-02-28/PFLA01317512.fits
2019-02-28/PFLA01316112.fits
2019-02-28/PFLA01316812.fits
2019-02-28/PFLA01317212.fits
2019-02-28/PFLA01317012.fits
2019-02-28/PFLA01317912.fits
2019-02-28/PFLA01317612.fits
2019-02-28/PFLA01316312.fits
2019-02-28/PFLA01317112.fits
2019-02-28/PFLA01316512.fits
2019-02-28/PFLA01317412.fits
2019-02-28/PFLA01318012.fits
2019-02-28/PFLA01317812.fits
2019-02-28/PFLA01316012.fits
2019-02-28/PFLA01316412.fits
2019-02-28/PFLA01316712.fits
2019-03-05/PFLA01351311.fits
2019-02-27/PFLA01314112.fits
2019-02-27/PFLA01313912.fits
2019-02-27/PFLA01314612.fits
2019-02-27/PFLA01314012.fits
2019-02-27/PFLA01314512.fits
2019-02-27/PFLA01315412.fits
2019-02-27/PFLA01313812.fits
2019-02-27/PFLA01292014.fits
2019-02-27/PFLA01315712.fits
2019-02-27/PFLA01315612.fits
2019-02-27/PFLA01292114.fits
2019-02-27/PFLA01315312.fits
2019-02-27/PFLA01315212.fits
2019-02-27/PFLA01314712.fits
2019-02-27/PFLA01291714.fits
2019-02-27/PFLA01315812.fits
2019-02-27/PFLA01314312.fits
2019-02-27/PFLA01315512.fits
2019-02-27/PFLA01315112.fits
2019-02-27/PFLA01291814.fits
2019-02-27/PFLA01314812.fits
2019-02-27/PFLA01291914.fits
2019-02-27/PFLA01314212.fits
2019-02-27/PFLA01314412.fits
2019-02-27/PFLA01314912.fits
2019-02-27/PFLA01315012.fits


ingestPfs WARN: Failed to ingest file 2019-02-18/PFLA01249512.fits: 
  File "src/PropertySet.cc", line 486, in void lsst::daf::base::PropertySet::add(const string&, const T&) [with T = double; std::string = std::basic_string<char>]
    W_ENFCAX has mismatched type {0}
lsst::pex::exceptions::TypeError: 'W_ENFCAX has mismatched type'

Also for 2019-02-18/PFLA01249612.fits

They have LOTS of duplicate header values with different types that kills lsst.afw.fits.readMetadata.
Not going to bother fixing them either.



(lsst-scipipe) pprice@tiger2-sumire:/projects/HSC/PFS/Subaru/raw $ gethead */PFSA*.fits W_PFDSGN > pfsDesignId.txt
(lsst-scipipe) pprice@tiger2-sumire:/projects/HSC/PFS/Subaru/raw $ awk '{print $2}' pfsDesignId.txt | sort -u
1099528409104
16
4503599627370496
72057594037927936
-9998
(lsst-scipipe) pprice@tiger2-sumire:/projects/HSC/PFS/Subaru/raw $ grep -- '-9998' pfsDesignId.txt 
PFSA00000214.fits -9998

A single file has a bad pfsDesignId (negative!?). Looks like it's one of the
first exposures, so don't care.


(lsst-scipipe) pprice@tiger2-sumire:/projects/HSC/PFS/Subaru/raw $ ingestPfsImages.py /projects/HSC/PFS/Subaru --pfsConfigDir /projects/HSC/PFS/Subaru/drp/pfsDesign --config parse.pfsDesignId=1099528409104 -- '*/PFSA*.fits' 2>&1 | tee ingest.log
Comment by price [ 22/May/20 ]

hassan, could you please review these changes to obs_pfs to support the ingestion of LAM data?

Comment by price [ 22/May/20 ]

cloomis: The command used to ingest was:

ingestPfsImages.py /projects/HSC/PFS/Subaru --pfsConfigDir /projects/HSC/PFS/Subaru/drp/pfsDesign --config parse.pfsDesignId=1099528409104 -- '*/PFSA*.fits'

I didn't add the -c clobber=True register.ignore=True because this was from scratch. You shouldn't need it if you only run the command on new data.

Comment by price [ 22/May/20 ]

Merged to master.

Comment by cloomis [ 22/May/20 ]

One question, one point:

Is that --config parse.pfsDesignId=1099528409104 required? Or does it usually use the W_PFDSGN value?

So we need to have a populated /projects/HSC/PFS/Subaru/drp/pfsDesign/ directory before calling ingest. That will take a bit of attention at the Subaru side.

Comment by price [ 22/May/20 ]

It usually uses the W_PFDSGN value. parse.pfsDesignId provides a default value in the event W_PFDSGN isn't supplied (which was necessary for old LAM data, but probably less necessary for modern Subaru data).

Generated at Sat Feb 10 15:54:59 JST 2024 using Jira 8.3.4#803005-sha1:1f96e09b3c60279a408a2ae47be3c745f571388b.