[PIPE2D-571] Ingest into a common data repo Created: 05/May/20 Updated: 05/Jan/21 Resolved: 22/May/20 |
|
| Status: | Done |
| Project: | DRP 2-D Pipeline |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Story | Priority: | Normal |
| Reporter: | price | Assignee: | price |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Story Points: | 3 | ||||||||
| Sprint: | 2DDRP-2021 A | ||||||||
| Reviewers: | hassan | ||||||||
| Description |
|
We have been generating data repos individually. Instead, we should use a shared data repo, as we do for HSC. This makes reducing data easier, and also eases collaboration.
|
| Comments |
| Comment by price [ 16/May/20 ] |
|
Cleaning up all the raw files first. pprice@tiger2-sumire:/projects/HSC/PFS/LAM-raw $ find . | xargs -n 20 -P 10 md5sum > ~/LAM-raw.txt pprice@tiger2-sumire:/projects/HSC/PFS/LAM/raw $ find . | xargs -n 20 -P 10 md5sum > ~/raw.txt >>> with open("raw.txt") as fd: ... raw = dict(line.strip().split() for line in fd.readlines()) >>> with open("LAM-raw.txt") as fd: ... lamRaw = dict(line.strip().split() for line in fd.readlines()) >>> len(set(lamRaw.keys()) - set(raw.keys())) 0 >>> len(set(lamRaw.values()) - set(raw.values())) 0 ==> Blew away /projects/HSC/PFS/LAM-raw, as everything is in /projects/HSC/PFS/LAM/raw. pprice@tiger2-sumire:/projects/HSC/PFS/JHU/raw $ find . | xargs -n 100 -P 20 md5sum > ~/jhu.txt pprice@tiger2-sumire:/projects/HSC/PFS/raw $ find . | xargs -n 100 -P 20 md5sum > ~/raw.txt pprice@tiger2-sumire:~ $ diff --new-line-format="" --unchanged-line-format="" <(sort raw.txt) <(sort jhu.txt) 761aad7694f64002e40f6b7005eb90ea ./INVENTORY_NOTES.txt c56bfdedd8287bc8cba86b14de520076 ./INVENTORY_NOTES.txt~ >>> with open("raw.txt") as fd: ... raw = dict(line.strip().split() for line in fd.readlines()) ... >>> with open("jhu.txt") as fd: ... jhu = dict(line.strip().split() for line in fd.readlines()) ... >>> len(raw) 1523 >>> len(jhu) 4477 >>> len(set(raw.keys()) - set(jhu.keys())) 1 >>> set(raw.keys()) - set(jhu.keys()) {'c56bfdedd8287bc8cba86b14de520076'} >>> raw['c56bfdedd8287bc8cba86b14de520076'] './INVENTORY_NOTES.txt~' Craig Loomis 3:32 PM That `raw` directory and INVENTORY_NOTES.txt represents quite a bit of time. Basically, it selects the very early r1 and b1 which are actually useable. So I’d put in JHU/keep/ ==> Moved /projects/HSC/PFS/raw to /projects/HSC/PFS/JHU/keep pprice@tiger2-sumire:/projects/HSC/PFS/LAM $ for dd in HgArFeb2019_raw KrFeb2019_raw NeonApr2019_raw NeonFeb2019_raw ; do find $dd | xargs -n 10 -P 20 md5sum > $dd.txt ; done pprice@tiger2-sumire:/projects/HSC/PFS/LAM $ find raw | xargs -n 100 -P 20 md5sum > raw.txt pprice@tiger2-sumire:/projects/HSC/PFS/LAM $ diff --new-line-format="" --unchanged-line-format="" <(awk '{print $1}' HgArFeb2019_raw.txt | sort -u) <(awk '{print $1}' raw.txt | sort -u) pprice@tiger2-sumire:/projects/HSC/PFS/LAM $ diff --new-line-format="" --unchanged-line-format="" <(awk '{print $1}' KrFeb2019_raw.txt | sort -u) <(awk '{print $1}' raw.txt | sort -u) pprice@tiger2-sumire:/projects/HSC/PFS/LAM $ diff --new-line-format="" --unchanged-line-format="" <(awk '{print $1}' NeonApr2019_raw.txt | sort -u) <(awk '{print $1}' raw.txt | sort -u) pprice@tiger2-sumire:/projects/HSC/PFS/LAM $ diff --new-line-format="" --unchanged-line-format="" <(awk '{print $1}' NeonFeb2019_raw.txt | sort -u) <(awk '{print $1}' raw.txt | sort -u) ==> Can blow away HgArFeb2019_raw KrFeb2019_raw NeonApr2019_raw NeonFeb2019_raw. ==> Done. |
| Comment by price [ 22/May/20 ] |
Neven Caplar:princeton: 4:56 PM I suggest Sep 09 2018 as starting date (start of https://people.lam.fr/madec.fabrice/pfs/ait_logbook_SM1.html) (edited) (lsst-scipipe) pprice@tiger2-sumire:/projects/HSC/PFS/LAM/raw $ ingestPfsImages.py /projects/HSC/PFS/LAM --pfsConfigDir /projects/HSC/PFS/LAM/raw/pfsDesign --config parse.pfsDesignId=1099528409104 clobber=True -- '201[89]-*/PFLA*.fits' RuntimeError: Unable to find PfsConfig or PfsDesign for pfsDesignId=0x0000010001001000 RuntimeError: Unable to find PfsConfig or PfsDesign for pfsDesignId=0x0000100000001111 >>> pfs.utils.dummyCableB.DummyCableBDatabase().interpret(0x0000010001001000) ['red1', 'red4', 'red8'] >>> pfs.utils.dummyCableB.DummyCableBDatabase().interpret(0x0000100000001111) ['blue', 'green', 'orange', 'red1', 'yellow'] (lsst-scipipe) pprice@tiger2-sumire:/projects/HSC/PFS/LAM/raw/pfsDesign $ makeDummyCableBDesign.py red1 red4 red8 (lsst-scipipe) pprice@tiger2-sumire:/projects/HSC/PFS/LAM/raw/pfsDesign $ makeDummyCableBDesign.py blue green orange red1 yellow ValueError: could not convert string to float: 'NO CURRENT VALUE' Fixed on tickets/PIPE2D-571 (lsst-scipipe) pprice@tiger2-sumire:/projects/HSC/PFS/LAM/raw $ ingestPfsImages.py /projects/HSC/PFS/LAM --pfsConfigDir /projects/HSC/PFS/LAM/raw/pfsDesign --config parse.pfsDesignId=1099528409104 -- '201[89]-*/PFLA*.fits' | tee ingest.log (lsst-scipipe) pprice@tiger2-sumire:/projects/HSC/PFS/LAM/raw $ grep "Unable to find PfsConfig" ingest.log | sed -e 's|^.*pfsDesignId=||' | sort -u 0x0000000000000001 0x0000000000000010 0x0000000001000000 >>> pfs.utils.dummyCableB.DummyCableBDatabase().interpret(0x0000000000000001) ['blue'] >>> pfs.utils.dummyCableB.DummyCableBDatabase().interpret(0x0000000000000010) ['green'] >>> pfs.utils.dummyCableB.DummyCableBDatabase().interpret(0x0000000001000000) ['red4'] (lsst-scipipe) pprice@tiger2-sumire:/projects/HSC/PFS/LAM/raw/pfsDesign $ makeDummyCableBDesign.py blue Wrote pfsDesign-0x0000000000000001.fits (lsst-scipipe) pprice@tiger2-sumire:/projects/HSC/PFS/LAM/raw/pfsDesign $ makeDummyCableBDesign.py green Wrote pfsDesign-0x0000000000000010.fits (lsst-scipipe) pprice@tiger2-sumire:/projects/HSC/PFS/LAM/raw/pfsDesign $ makeDummyCableBDesign.py red4 Wrote pfsDesign-0x0000000001000000.fits 66 files have multiple header problems (e.g., no DATE-OBS, EXPTIME, IMAGETYP) that aren't worth saving for now. price@price-laptop:~/pfs/obs_pfs (tickets/PIPE2D-571=) $ grep "No locations for get" ingest.log | sed -e 's|^.*Failed to ingest file \(.*\.fits\):.*$|\1|' 2019-03-01/PFLA01340412.fits 2019-03-01/PFLA01340812.fits 2019-03-01/PFLA01340712.fits 2019-03-01/PFLA01340612.fits 2019-03-01/PFLA01340512.fits 2019-02-22/PFLA01253114.fits 2019-02-22/PFLA01254114.fits 2019-02-22/PFLA01253714.fits 2019-02-22/PFLA01253214.fits 2019-02-22/PFLA01253314.fits 2019-02-22/PFLA01252914.fits 2019-02-22/PFLA01253814.fits 2019-02-22/PFLA01253914.fits 2019-02-22/PFLA01253014.fits 2019-02-22/PFLA01254014.fits 2019-02-28/PFLA01317312.fits 2019-02-28/PFLA01316212.fits 2019-02-28/PFLA01316912.fits 2019-02-28/PFLA01315912.fits 2019-02-28/PFLA01318212.fits 2019-02-28/PFLA01317712.fits 2019-02-28/PFLA01318112.fits 2019-02-28/PFLA01316612.fits 2019-02-28/PFLA01317512.fits 2019-02-28/PFLA01316112.fits 2019-02-28/PFLA01316812.fits 2019-02-28/PFLA01317212.fits 2019-02-28/PFLA01317012.fits 2019-02-28/PFLA01317912.fits 2019-02-28/PFLA01317612.fits 2019-02-28/PFLA01316312.fits 2019-02-28/PFLA01317112.fits 2019-02-28/PFLA01316512.fits 2019-02-28/PFLA01317412.fits 2019-02-28/PFLA01318012.fits 2019-02-28/PFLA01317812.fits 2019-02-28/PFLA01316012.fits 2019-02-28/PFLA01316412.fits 2019-02-28/PFLA01316712.fits 2019-03-05/PFLA01351311.fits 2019-02-27/PFLA01314112.fits 2019-02-27/PFLA01313912.fits 2019-02-27/PFLA01314612.fits 2019-02-27/PFLA01314012.fits 2019-02-27/PFLA01314512.fits 2019-02-27/PFLA01315412.fits 2019-02-27/PFLA01313812.fits 2019-02-27/PFLA01292014.fits 2019-02-27/PFLA01315712.fits 2019-02-27/PFLA01315612.fits 2019-02-27/PFLA01292114.fits 2019-02-27/PFLA01315312.fits 2019-02-27/PFLA01315212.fits 2019-02-27/PFLA01314712.fits 2019-02-27/PFLA01291714.fits 2019-02-27/PFLA01315812.fits 2019-02-27/PFLA01314312.fits 2019-02-27/PFLA01315512.fits 2019-02-27/PFLA01315112.fits 2019-02-27/PFLA01291814.fits 2019-02-27/PFLA01314812.fits 2019-02-27/PFLA01291914.fits 2019-02-27/PFLA01314212.fits 2019-02-27/PFLA01314412.fits 2019-02-27/PFLA01314912.fits 2019-02-27/PFLA01315012.fits ingestPfs WARN: Failed to ingest file 2019-02-18/PFLA01249512.fits: File "src/PropertySet.cc", line 486, in void lsst::daf::base::PropertySet::add(const string&, const T&) [with T = double; std::string = std::basic_string<char>] W_ENFCAX has mismatched type {0} lsst::pex::exceptions::TypeError: 'W_ENFCAX has mismatched type' Also for 2019-02-18/PFLA01249612.fits They have LOTS of duplicate header values with different types that kills lsst.afw.fits.readMetadata. Not going to bother fixing them either. (lsst-scipipe) pprice@tiger2-sumire:/projects/HSC/PFS/Subaru/raw $ gethead */PFSA*.fits W_PFDSGN > pfsDesignId.txt (lsst-scipipe) pprice@tiger2-sumire:/projects/HSC/PFS/Subaru/raw $ awk '{print $2}' pfsDesignId.txt | sort -u 1099528409104 16 4503599627370496 72057594037927936 -9998 (lsst-scipipe) pprice@tiger2-sumire:/projects/HSC/PFS/Subaru/raw $ grep -- '-9998' pfsDesignId.txt PFSA00000214.fits -9998 A single file has a bad pfsDesignId (negative!?). Looks like it's one of the first exposures, so don't care. (lsst-scipipe) pprice@tiger2-sumire:/projects/HSC/PFS/Subaru/raw $ ingestPfsImages.py /projects/HSC/PFS/Subaru --pfsConfigDir /projects/HSC/PFS/Subaru/drp/pfsDesign --config parse.pfsDesignId=1099528409104 -- '*/PFSA*.fits' 2>&1 | tee ingest.log |
| Comment by price [ 22/May/20 ] |
|
hassan, could you please review these changes to obs_pfs to support the ingestion of LAM data? |
| Comment by price [ 22/May/20 ] |
|
cloomis: The command used to ingest was:
ingestPfsImages.py /projects/HSC/PFS/Subaru --pfsConfigDir /projects/HSC/PFS/Subaru/drp/pfsDesign --config parse.pfsDesignId=1099528409104 -- '*/PFSA*.fits'
I didn't add the -c clobber=True register.ignore=True because this was from scratch. You shouldn't need it if you only run the command on new data. |
| Comment by price [ 22/May/20 ] |
|
Merged to master. |
| Comment by cloomis [ 22/May/20 ] |
|
One question, one point: Is that --config parse.pfsDesignId=1099528409104 required? Or does it usually use the W_PFDSGN value? So we need to have a populated /projects/HSC/PFS/Subaru/drp/pfsDesign/ directory before calling ingest. That will take a bit of attention at the Subaru side. |
| Comment by price [ 22/May/20 ] |
|
It usually uses the W_PFDSGN value. parse.pfsDesignId provides a default value in the event W_PFDSGN isn't supplied (which was necessary for old LAM data, but probably less necessary for modern Subaru data). |