Uploaded image for project: 'DRP 2-D Pipeline'
  1. DRP 2-D Pipeline
  2. PIPE2D-1058

Avoid/fix registry sqlite deadlocks

    XMLWordPrintable

    Details

    • Type: Task
    • Status: Done (View Workflow)
    • Priority: Normal
    • Resolution: Done
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
    • Story Points:
      3
    • Sprint:
      preEngRun07Sep

      Description

      During the June engineering run we had a few (2? 3?) sqlite "deadlock"s on the Hilo machines, where butlers running in notebooks could not open the registry (apologies for not having actual logs, etc – I did not think clearly enough to grab anything useful). The "fix" was to kill processes with butlers/ingest tasks until the problem cleared. I'll point out a few things:

      • the drpActor was ingesting frames as they came in. We were almost always running windowed reads, so pairs of frames every 20-30s or so. That should have been the only process running INSERTs (i.e. write transactions).
      • each notebook with butlers seemed to have tens of open sqlite "connections", per lsof. And there were many such notebooks. Obviously, all SELECTs (read transactions).
      • Yes yes, the registry is saved on an NFS filesystem. 
      • there are a few LSST tickets referring to sqlite deadlocks.
      • the Web has all manner of lore, but many agree that sqlite can deadlock.
      • the obs_pfs ingest method wraps the INSERT in a context manager. I do not see those in the daf_persistence registry code. No idea what that means.

      One specific recommendation from someone believable is to open write transactions with "begin immediate" or "begin exclusive". It is not clear to me whether that would make things more or less robust in our case. Could certainly try.

      That may also depend on whether we are using WAL or journal mode for sqlite. We should not be using WAL since it is known not to work over NFS, but any connection can change that for all users....

      Umm, https://www.sqlite.org/lang_transaction.html and https://www.sqlite.org/wal.html among others.

      One question is whether Gen3 will have addressed this for us by the November run. Would we still run sqlite, or could we switch to postgres?

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                price price
                Reporter:
                cloomis cloomis
                Reviewers:
                hassan
              • Votes:
                0 Vote for this issue
                Watchers:
                6 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: