[PIPE2D-1058] Avoid/fix registry sqlite deadlocks Created: 07/Jul/22 Updated: 26/Jul/23 Resolved: 22/Sep/22 |
|
| Status: | Done |
| Project: | DRP 2-D Pipeline |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Task | Priority: | Normal |
| Reporter: | cloomis | Assignee: | price |
| Resolution: | Done | Votes: | 0 |
| Labels: | EngRun | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||||||||||
| Issue Links: |
|
||||||||||||||||
| Story Points: | 3 | ||||||||||||||||
| Sprint: | preEngRun07Sep | ||||||||||||||||
| Reviewers: | hassan | ||||||||||||||||
| Description |
|
During the June engineering run we had a few (2? 3?) sqlite "deadlock"s on the Hilo machines, where butlers running in notebooks could not open the registry (apologies for not having actual logs, etc – I did not think clearly enough to grab anything useful). The "fix" was to kill processes with butlers/ingest tasks until the problem cleared. I'll point out a few things:
One specific recommendation from someone believable is to open write transactions with "begin immediate" or "begin exclusive". It is not clear to me whether that would make things more or less robust in our case. Could certainly try. That may also depend on whether we are using WAL or journal mode for sqlite. We should not be using WAL since it is known not to work over NFS, but any connection can change that for all users.... Umm, https://www.sqlite.org/lang_transaction.html and https://www.sqlite.org/wal.html among others. One question is whether Gen3 will have addressed this for us by the November run. Would we still run sqlite, or could we switch to postgres? |
| Comments |
| Comment by rhl [ 07/Jul/22 ] |
|
Would switching to postgres be a solution? It isn't clear that the problem is gen2/gen3, and both can run against sqlite or postgres. As a reference point, we're switching from sqlite to postgres (within gen3) on Cerro Pachón to avoid similar problems. |
| Comment by michitaro [ 08/Jul/22 ] |
|
A minimum code to produce a deadlock is here. bash-3.2$ python ./deadlock.py Process Process-1: Traceback (most recent call last): File "/Users/michitaro/opt/anaconda3/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap self.run() File "/Users/michitaro/opt/anaconda3/lib/python3.8/multiprocessing/process.py", line 108, in run self._target(*self._args, **self._kwargs) File "/Users/michitaro/Desktop/deadlock.py", line 26, in run_transaction db.execute('insert into t values (0)') sqlite3.OperationalError: database is locked |
| Comment by price [ 30/Aug/22 ] |
|
I don't know how a sqlite-based registry can possibly work if there are "many" notebooks holding locks, but I'm not sure a simple connection is the same thing as a lock. michitaro's demonstration seems to suggest that the problem is too many concurrent writes, perhaps because the ingest process is taking longer than the time between exposures? You say that there are pairs of exposures; does this produce pairs of ingest operations, which would have the same effect as the deadlock demonstration? If any of this is the problem, then I think the calling pattern needs to change from a direct trigger to regular polling so that there's only ever one ingest process in flight at any time. |
| Comment by cloomis [ 30/Aug/22 ] |
|
One ingest task is launched per visit, not one per cam. I'm pretty sure that is the only writer. What is involved in switching to postgresql? |
| Comment by price [ 30/Aug/22 ] |
|
Are there any protections against having more than one ingest process running at once? How easy would that be to implement? Switching to Postgresql involves creating a new database, setting up a registry.pgsql YAML file in the repo (with entries host, port, database, user and optional password), then re-ingesting everything. |
| Comment by rhl [ 30/Aug/22 ] |
|
We're doing this switch on Cerro Pachón. I think that KT has a script to do the migration |
| Comment by cloomis [ 30/Aug/22 ] |
|
If the mechanism is already developed and tested, I propose that we try using it. Basically, investigating sqlite3 locking issues feels pretty open-ended to me, and we know we have issues with NFS in any case. |
| Comment by price [ 02/Sep/22 ] |
|
Discussed with rhl and cloomis. We believe that the Gen2 registry works with PostgresQL (we understand the HSC summit processing uses this), so we just need to transfer the contents from SQLite to a new PostgresQL database (we suspect KTL's script used for LSST@Cerro Pachon is particular to the Gen3 middleware). |
| Comment by price [ 14/Sep/22 ] |
|
I have verified migration of the sqlite registry to postgresql using the following command: sqlite3 registry.sqlite3 .dump | sed -e 's|^PRAGMA.*$||' -e 's|integer primary key autoincrement|serial primary key|g' -e 's|double|double precision|g' -e 's|^.*sqlite.*$||' -e 's|pfsDesignId int|pfsDesignId bigint|g' | tee registry.sql | psql -h localhost -U pfs pfs_gen2 I still need to verify operation of the pipeline using the postgresql registry. I am awaiting provision of postgresql databases on tiger and the summit. |
| Comment by price [ 14/Sep/22 ] |
|
I’ve got the postgresql registry working at Hilo. In order to use it, you need to have a ~/.pgpass entry. The alternative is to put the database password in plaintext in the configuration file in the repo, which I think is a bad idea since the database user has admin privs. |
| Comment by cloomis [ 14/Sep/22 ] |
|
+many on using .pgpass vs. repo: I think we all already do. And good to know about the ingest – same calling conventions, but just a new version/tag? Is it on this ticket for now? |
| Comment by price [ 14/Sep/22 ] |
|
The ingest updates are on the ticket branch of obs_pfs. You'll need to call ingestPfsImagesPgsql.py instead of ingestPfsImages.py. |
| Comment by price [ 15/Sep/22 ] |
|
Asking for review of the code changes. These will need to be put into a release that's deployed and used for the ingest. Everyone who uses it will need to have the database details in their ~/.pgpass. |
| Comment by hassan [ 16/Sep/22 ] |
|
The code changes look fine. A rebase is needed, but that is already been understood. |
| Comment by price [ 16/Sep/22 ] |
|
Merged to master. Leaving this ticket open until we can do the transition at Hilo. |
| Comment by price [ 22/Sep/22 ] |
|
The migration is complete from my side. /work/drp on pfsa-usr01-gb is configured to use the postgresql registry. All future ingests into that repo should use the ingestPfsImagesPgsql.py script, or it won’t be registered correctly. All users of /work/drp will need to ensure that they have the correct entry in their ~/.pgpass file. The password is available from me or Kiyoto Yabe. To roll back to the old system, rename the registry.pgsql file to _registry.pgsql, and rename the registry.sqlite3.OLD file to registry.sqlite3. |