[PIPE2D-1272] Switch drp summit registry to postgres Created: 26/Jul/23  Updated: 30/Aug/23

Status: Open
Project: DRP 2-D Pipeline
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Normal
Reporter: arnaud.lefur Assignee: arnaud.lefur
Resolution: Unresolved Votes: 0
Labels: engRun12
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Relates
relates to PIPE2D-1058 Avoid/fix registry sqlite deadlocks Done
Story Points: 1
Sprint: Eng12July, Eng13Oct

 Description   

Since summit drp is not heavily used we though we could get away with using sqlite but some ingest failures started to happen few days ago.

(rubin3_ics) pfs@shell-ics:/data/logs/actors/drp$ cat 2023-*|grep 'Failed to register'
2023-07-22 04:02:35.951Z actor.ingestPfs  30 ingest.py:627 Failed to ingest file /data/raw/2023-07-22/ramps/PFSB09719733.fits: Failed to register file /data/raw/2023-07-22/ramps/PFSB09719733.fits
2023-07-22 10:23:59.725Z actor.ingestPfs  30 ingest.py:627 Failed to ingest file /data/raw/2023-07-22/ramps/PFSB09731433.fits: Failed to register file /data/raw/2023-07-22/ramps/PFSB09731433.fits
2023-07-23 06:42:20.056Z actor.ingestPfs  30 ingest.py:627 Failed to ingest file /data/raw/2023-07-23/ramps/PFSB09748533.fits: Failed to register file /data/raw/2023-07-23/ramps/PFSB09748533.fits
2023-07-23 13:06:58.276Z actor.ingestPfs  30 ingest.py:627 Failed to ingest file /data/raw/2023-07-23/ramps/PFSB09757133.fits: Failed to register file /data/raw/2023-07-23/ramps/PFSB09757133.fits
2023-07-23 14:10:09.782Z actor.ingestPfs  30 ingest.py:627 Failed to ingest file /data/raw/2023-07-23/ramps/PFSB09758713.fits: Failed to register file /data/raw/2023-07-23/ramps/PFSB09758713.fits
2023-07-24 12:43:33.564Z actor.ingestPfs  30 ingest.py:627 Failed to ingest file /data/raw/2023-07-24/ramps/PFSB09781333.fits: Failed to register file /data/raw/2023-07-24/ramps/PFSB09781333.fits
2023-07-24 21:06:02.068Z actor.ingestPfs  30 ingest.py:627 Failed to ingest file /data/raw/2023-07-24/ramps/PFSB09789733.fits: Failed to register file /data/raw/2023-07-24/ramps/PFSB09789733.fits
2023-07-25 11:05:10.116Z actor.ingestPfs  30 ingest.py:627 Failed to ingest file /data/raw/2023-07-25/ramps/PFSB09801433.fits: Failed to register file /data/raw/2023-07-25/ramps/PFSB09801433.fits
2023-07-25 13:24:19.964Z actor.ingestPfs  30 ingest.py:627 Failed to ingest file /data/raw/2023-07-25/sps/PFSA09804332.fits: Failed to register file /data/raw/2023-07-25/sps/PFSA09804332.fits
2023-07-25 21:38:30.050Z actor.ingestPfs  30 ingest.py:627 Failed to ingest file /data/raw/2023-07-25/ramps/PFSB09813513.fits: Failed to register file /data/raw/2023-07-25/ramps/PFSB09813513.fits
2023-07-25 22:30:51.503Z actor.ingestPfs  30 ingest.py:627 Failed to ingest file /data/raw/2023-07-25/ramps/PFSB09814833.fits: Failed to register file /data/raw/2023-07-25/ramps/PFSB09814833.fits
2023-07-25 22:42:22.287Z actor.ingestPfs  30 ingest.py:627 Failed to ingest file /data/raw/2023-07-25/ramps/PFSB09815133.fits: Failed to register file /data/raw/2023-07-25/ramps/PFSB09815133.fits
2023-07-26 05:26:15.771Z actor.ingestPfs  30 ingest.py:627 Failed to ingest file /data/raw/2023-07-26/ramps/PFSB09820413.fits: Failed to register file /data/raw/2023-07-26/ramps/PFSB09820413.fits
2023-07-26 08:42:29.993Z actor.ingestPfs  30 ingest.py:627 Failed to ingest file /data/raw/2023-07-26/sps/PFSA09833631.fits: Failed to register file /data/raw/2023-07-26/sps/PFSA09833631.fits

My guess is that it's failing because of concurrent ingest.
price I remember you did the migration on pfsa, if you have some instructions/scripts ready, I'd be happy to do that on shell2-ics.



 Comments   
Comment by cloomis [ 26/Jul/23 ]

The db name in Hilo is registry_db. We need to be sure that either that db on shell2-ics does not get replicated to Hilo, or that it gets a different name.....

Comment by price [ 27/Jul/23 ]

My notes from PIPE2D-1058:

PIPE2D-1058: Avoid/fix registry sqlite deadlocks

We're going to move the registry to postgresql. I've asked CSES for a new database on tiger2-sumire.

Testing on my laptop while waiting for CSES:

createuser -h localhost pfs
createdb -h localhost --owner=pfs pfs_gen2

sqlite3 registry.sqlite3 .dump | sed -e 's|^PRAGMA.*$||' -e 's|integer primary key autoincrement|serial primary key|g' -e 's|double|double precision|g' -e 's|^.*sqlite.*$||' -e 's|pfsDesignId int|pfsDesignId bigint|g' | tee registry.sql | psql -h localhost -U pfs pfs_gen2

Here's how to delete all the tables in a database in postgresql:

DO $$ DECLARE
    r RECORD;
BEGIN
    FOR r IN (SELECT tablename FROM pg_tables WHERE schemaname = current_schema()) LOOP
        EXECUTE 'DROP TABLE IF EXISTS ' || quote_ident(r.tablename) || ' CASCADE';
    END LOOP;
END $$;

pfs_gen2=> select field, pfsDesignId, count(*) from raw group by field, pfsDesignId order by field, pfsDesignId;

Results look identical (at least for the first page!) between postresql and sqlite.

Yabe-san created a database at Hilo:

    psql -h 133.40.152.102 -p 5433 -U pfs -d registry_gen2

The password is XXX

Save that password:

price@pfsa-usr01-gb:~ $ cat .pgpass 
*:*:registry_gen2:pfs:XXX

I'm working on pfsa-usr01-gb (via "ssh pfs" with the VPN), which I'm told is at Hilo. Looks like there's a data repo in /work/drp, and raw data in /data/raw, and the stack is /work/stack.

price@pfsa-usr01-gb:~ $ sqlite3 /work/drp/registry.sqlite3 .dump | sed -e 's|^PRAGMA.*$||' -e 's|integer primary key autoincrement|serial primary key|g' -e 's|double|double precision|g' -e 's|^.*sqlite.*$||' -e 's|pfsDesignId int|pfsDesignId bigint|g' | tee registry.sql | psql -h 133.40.152.102 -p 5433 -U pfs registry_gen2

Well, that worked just fine.

To use it, we need to put a "registry.pgsql" file in the repo:

price@pfsa-usr01-gb:~ $ cat /work/drp/registry.pgsql 
host: "133.40.152.102"
port: 5433
user: "pfs"
database: "registry_gen2"

(Password will be supplied via ~/.pgpass, which is better protected.)

price@pfsa-usr01-gb:~ $ . /work/stack/loadLSST.bash 
(lsst-scipipe-1172c30) price@pfsa-usr01-gb:~ $ setup pfs_pipe2d
(lsst-scipipe-1172c30) price@pfsa-usr01-gb:~ $ reduceExposure.py /work/drp --calib /work/drp/CALIB-20220630 --rerun price/pipe2d-1058 --id visit=72016 arm=r
root INFO: Loading config overrride file '/work/stack_INFRA-290/stack/miniconda3-4.5.12-1172c30/Linux64/obs_pfs/w.2022.17/config/reduceExposure.py'
CameraMapper INFO: Loading exposure registry from /work/drp/registry.pgsql
CameraMapper INFO: Loading calib registry from /work/drp/CALIB-20220630/calibRegistry.sqlite3
CameraMapper INFO: Loading calib registry from /work/drp/CALIB-20220630/calibRegistry.sqlite3
root WARN: No data found for dataId=OrderedDict([('visit', 72016), ('arm', 'r')])
root INFO: Running: /work/stack_INFRA-290/stack/miniconda3-4.5.12-1172c30/Linux64/drp_stella/w.2022.17/bin/reduceExposure.py /work/drp --calib /work/drp/CALIB-20220630 --rerun price/pipe2d-1058 --id visit=72016 arm=r
root WARN: Not running the task because there is no data to process; you may preview data using "--show data"

Well, that worked, sort of. It went to the correct registry, but there's no visit=72016 in there. I guess that means we can test the ingest!

(lsst-scipipe-1172c30) price@pfsa-usr01-gb:~ $ ingestPfsImages.py /work/drp "/data/raw/2022-02-22/sps/PFSA07*.fits" --mode=copy
CameraMapper INFO: Loading exposure registry from /work/drp/registry.pgsql
CameraMapper INFO: Loading calib registry from /work/drp/CALIB/calibRegistry.sqlite3
Traceback (most recent call last):
  File "/work/stack_INFRA-290/stack/miniconda3-4.5.12-1172c30/Linux64/obs_pfs/w.2022.17/bin/ingestPfsImages.py", line 3, in <module>
    PfsIngestTask.parseAndRun()
  File "/work/stack_INFRA-290/stack/miniconda3-4.5.12-1172c30/Linux64/pipe_tasks/18.1.0/python/lsst/pipe/tasks/ingest.py", line 416, in parseAndRun
    task.run(args)
  File "/work/stack_INFRA-290/stack/miniconda3-4.5.12-1172c30/Linux64/pipe_tasks/18.1.0/python/lsst/pipe/tasks/ingest.py", line 542, in run
    context = self.register.openRegistry(root, create=args.create, dryrun=args.dryrun)
  File "/work/stack_INFRA-290/stack/miniconda3-4.5.12-1172c30/Linux64/pipe_tasks/18.1.0/python/lsst/pipe/tasks/ingest.py", line 298, in openRegistry
    context = RegistryContext(registryName, self.createTable, create, self.config.permissions)
  File "/work/stack_INFRA-290/stack/miniconda3-4.5.12-1172c30/Linux64/obs_pfs/w.2022.17/python/lsst/obs/pfs/ingest.py", line 53, in __init__
    os.chmod(registryName, permissions)
PermissionError: [Errno 1] Operation not permitted: '/work/drp/registry.sqlite3'

So we need to fix the ingest. Basing off w.2022.17 to be consistent with what's in use at Hilo.

(lsst-scipipe-1172c30) price@pfsa-usr01-gb:~ $ ingestPfsImages.py /work/drp "/data/raw/2022-02-22/sps/PFSA07*.fits" --mode=copy --pfsConfigDir /work/drp/pfsDesign -c clobber=True

Success! But the visits are not in the psql registry... and it's ended up in the sqlite registry!
Oh! There's a separate script for ingesting into a psql registry. It has a different RegisterTask with slightly different SQL commands for creating the tables, and such.

(lsst-scipipe-1172c30) price@pfsa-usr01-gb:~/pfs/obs_pfs[tickets/PIPE2D-1058] $ ingestPfsImagesPgsql.py /work/drp "/data/raw/2022-02-22/sps/PFSA07*.fits" --mode=copy --pfsConfigDir /work/drp/pfsDesign -c clobber=True
CameraMapper INFO: Loading exposure registry from /work/drp/registry.pgsql
CameraMapper INFO: Loading calib registry from /work/drp/CALIB/calibRegistry.sqlite3
ingestPfs INFO: /data/raw/2022-02-22/sps/PFSA07206314.fits --<copy>--> /work/drp/2022-02-22/PFSA072063m1.fits

  File "/home/price/pfs/obs_pfs/bin/ingestPfsImagesPgsql.py", line 3, in <module>
    PfsPgsqlIngestTask.parseAndRun()
  File "/work/stack_INFRA-290/stack/miniconda3-4.5.12-1172c30/Linux64/pipe_tasks/18.1.0/python/lsst/pipe/tasks/ingest.py", line 416, in parseAndRun
    task.run(args)
  File "/work/stack_INFRA-290/stack/miniconda3-4.5.12-1172c30/Linux64/pipe_tasks/18.1.0/python/lsst/pipe/tasks/ingest.py", line 553, in run
    self.register.addRow(registry, info, dryrun=args.dryrun, create=args.create)
  File "/work/stack_INFRA-290/stack/miniconda3-4.5.12-1172c30/Linux64/pipe_tasks/18.1.0/python/lsst/pipe/tasks/ingest.py", line 368, in addRow
    conn.cursor().execute(sql, values)
psycopg2.IntegrityError: duplicate key value violates unique constraint "raw_pkey"
DETAIL:  Key (id)=(2) already exists.

We need to reset the auto increment start point.

registry_gen2=# select pg_get_serial_sequence('raw', 'id');
 pg_get_serial_sequence 
------------------------
 public.raw_id_seq
(1 row)

registry_gen2=# select max(id) from raw;
  max  
-------
 94666
(1 row)

registry_gen2=# ALTER SEQUENCE raw_id_seq RESTART WITH 94667;
ALTER SEQUENCE

(lsst-scipipe-1172c30) price@pfsa-usr01-gb:~/pfs/obs_pfs[tickets/PIPE2D-1058] $ ingestPfsImagesPgsql.py /work/drp "/data/raw/2022-02-22/sps/PFSA07*.fits" --mode=copy --pfsConfigDir /work/drp/pfsDesign -c clobber=True
(lsst-scipipe-1172c30) price@pfsa-usr01-gb:~/pfs/obs_pfs[tickets/PIPE2D-1058] $ reduceExposure.py /work/drp --calib /work/drp/CALIB-20220630 --rerun price/pipe2d-1058 --id visit=72016 arm=r -c isr.doFlat=False 

That works!
Generated at Sat Feb 10 16:05:21 JST 2024 using Jira 8.3.4#803005-sha1:1f96e09b3c60279a408a2ae47be3c745f571388b.