<!-- 
RSS generated by JIRA (8.3.4#803005-sha1:1f96e09b3c60279a408a2ae47be3c745f571388b) at Sat Feb 10 16:02:09 JST 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>PFS-JIRA</title>
    <link>https://pfspipe.ipmu.jp/jira</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>8.3.4</version>
        <build-number>803005</build-number>
        <build-date>13-09-2019</build-date>
    </build-info>


<item>
            <title>[PIPE2D-1058] Avoid/fix registry sqlite deadlocks</title>
                <link>https://pfspipe.ipmu.jp/jira/browse/PIPE2D-1058</link>
                <project id="10002" key="PIPE2D">DRP 2-D Pipeline</project>
                    <description>&lt;p&gt;During the June engineering run we had a few (2? 3?) sqlite &quot;deadlock&quot;s on the Hilo machines, where butlers running in notebooks could not open the registry (apologies for not having actual logs, etc &#8211; I did not think clearly enough to grab anything useful). The &quot;fix&quot; was to kill processes with butlers/ingest tasks until the problem cleared. I&apos;ll point out a few things:&lt;/p&gt;

&lt;ul&gt;
	&lt;li&gt;the drpActor was ingesting frames as they came in. We were almost always running windowed reads, so pairs of frames every 20-30s or so. That should have been the only process running INSERTs (i.e. write transactions).&lt;/li&gt;
	&lt;li&gt;each notebook with butlers seemed to have tens of open sqlite &quot;connections&quot;, per lsof. And there were many such notebooks. Obviously, all SELECTs (read transactions).&lt;/li&gt;
	&lt;li&gt;Yes yes, the registry is saved on an NFS filesystem.&#160;&lt;/li&gt;
	&lt;li&gt;there are a few LSST tickets referring to sqlite deadlocks.&lt;/li&gt;
	&lt;li&gt;the Web has all manner of lore, but many agree that sqlite can deadlock.&lt;/li&gt;
	&lt;li&gt;the obs_pfs ingest method wraps the INSERT in a context manager. I do not see those in the daf_persistence registry code. No idea what that means.&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;One specific recommendation from someone believable is to open write transactions with &quot;begin immediate&quot; or &quot;begin exclusive&quot;. It is not clear to me whether that would make things more or less robust in our case. Could certainly try.&lt;/p&gt;

&lt;p&gt;That may also depend on whether we are using WAL or journal mode for sqlite. We should &lt;b&gt;not&lt;/b&gt; be using WAL since it is known not to work over NFS, but any connection can change that for all users....&lt;/p&gt;

&lt;p&gt;Umm, &lt;a href=&quot;https://www.sqlite.org/lang_transaction.html&quot; class=&quot;external-link&quot; rel=&quot;nofollow&quot;&gt;https://www.sqlite.org/lang_transaction.html&lt;/a&gt; and &lt;a href=&quot;https://www.sqlite.org/wal.html&quot; class=&quot;external-link&quot; rel=&quot;nofollow&quot;&gt;https://www.sqlite.org/wal.html&lt;/a&gt; among others.&lt;/p&gt;

&lt;p&gt;One question is whether Gen3 will have addressed this for us by the November run. Would we still run sqlite, or could we switch to postgres?&lt;/p&gt;</description>
                <environment></environment>
        <key id="22867">PIPE2D-1058</key>
            <summary>Avoid/fix registry sqlite deadlocks</summary>
                <type id="3" iconUrl="https://pfspipe.ipmu.jp/jira/secure/viewavatar?size=xsmall&amp;avatarId=10518&amp;avatarType=issuetype">Task</type>
                                            <priority id="10000" iconUrl="https://pfspipe.ipmu.jp/jira/images/icons/priorities/medium.svg">Normal</priority>
                        <status id="10002" iconUrl="https://pfspipe.ipmu.jp/jira/images/icons/statuses/generic.png" description="The issue is resolved, reviewed, and merged">Done</status>
                    <statusCategory id="3" key="done" colorName="green"/>
                                    <resolution id="10000">Done</resolution>
                                        <assignee username="price">price</assignee>
                                    <reporter username="cloomis">cloomis</reporter>
                        <labels>
                            <label>EngRun</label>
                    </labels>
                <created>Thu, 7 Jul 2022 05:29:15 +0000</created>
                <updated>Wed, 26 Jul 2023 11:42:51 +0000</updated>
                            <resolved>Wed, 21 Sep 2022 17:35:41 +0000</resolved>
                                                                        <due></due>
                            <votes>0</votes>
                                    <watches>6</watches>
                                                                <comments>
                            <comment id="31085" author="rhl" created="Thu, 7 Jul 2022 08:12:41 +0000"  >&lt;p&gt;Would switching to postgres be a solution?  It isn&apos;t clear that the problem is gen2/gen3, and both can run against sqlite or postgres.  As a reference point, we&apos;re switching from sqlite to postgres (within gen3) on Cerro Pach&#243;n to avoid similar problems.&lt;/p&gt;</comment>
                            <comment id="31087" author="michitaro" created="Fri, 8 Jul 2022 05:04:38 +0000"  >&lt;p&gt;A minimum code to produce a deadlock is here.&lt;/p&gt;

&lt;p&gt; &lt;span class=&quot;nobr&quot;&gt;&lt;a href=&quot;https://pfspipe.ipmu.jp/jira/secure/attachment/15220/15220_deadlock.py&quot; title=&quot;deadlock.py attached to PIPE2D-1058&quot;&gt;deadlock.py&lt;sup&gt;&lt;img class=&quot;rendericon&quot; src=&quot;https://pfspipe.ipmu.jp/jira/images/icons/link_attachment_7.gif&quot; height=&quot;7&quot; width=&quot;7&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt;&lt;/sup&gt;&lt;/a&gt;&lt;/span&gt; &lt;/p&gt;

&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-bash&quot;&gt;
bash-3.2$ python ./deadlock.py
Process Process-1:
Traceback (most recent call last):
File &lt;span class=&quot;code-quote-red&quot;&gt;&quot;/Users/michitaro/opt/anaconda3/lib/python3.8/multiprocessing/process.py&quot;&lt;/span&gt;, line 315, &lt;span class=&quot;code-object&quot;&gt;in&lt;/span&gt; _bootstrap
self.run()
File &lt;span class=&quot;code-quote-red&quot;&gt;&quot;/Users/michitaro/opt/anaconda3/lib/python3.8/multiprocessing/process.py&quot;&lt;/span&gt;, line 108, &lt;span class=&quot;code-object&quot;&gt;in&lt;/span&gt; run
self._target(*self._args, **self._kwargs)
File &lt;span class=&quot;code-quote-red&quot;&gt;&quot;/Users/michitaro/Desktop/deadlock.py&quot;&lt;/span&gt;, line 26, &lt;span class=&quot;code-object&quot;&gt;in&lt;/span&gt; run_transaction
db.execute(&lt;span class=&quot;code-quote-red&quot;&gt;&apos;insert into t values (0)&apos;&lt;/span&gt;)
sqlite3.OperationalError: database is locked
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="31253" author="price" created="Mon, 29 Aug 2022 17:54:54 +0000"  >&lt;p&gt;I don&apos;t know how a sqlite-based registry can possibly work if there are &quot;many&quot; notebooks holding locks, but I&apos;m not sure a simple connection is the same thing as a lock.&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://pfspipe.ipmu.jp/jira/secure/ViewProfile.jspa?name=michitaro&quot; class=&quot;user-hover&quot; rel=&quot;michitaro&quot;&gt;michitaro&lt;/a&gt;&apos;s demonstration seems to suggest that the problem is too many concurrent writes, perhaps because the ingest process is taking longer than the time between exposures? You say that there are &lt;em&gt;pairs&lt;/em&gt; of exposures; does this produce pairs of ingest operations, which would have the same effect as the deadlock demonstration? If any of this is the problem, then I think the calling pattern needs to change from a direct trigger to regular polling so that there&apos;s only ever one ingest process in flight at any time.&lt;/p&gt;</comment>
                            <comment id="31255" author="cloomis" created="Mon, 29 Aug 2022 18:38:08 +0000"  >&lt;p&gt;One ingest task is launched per visit, not one per cam. I&apos;m pretty sure that is the only writer.&lt;/p&gt;

&lt;p&gt;What is involved in switching to postgresql?&lt;/p&gt;</comment>
                            <comment id="31256" author="price" created="Mon, 29 Aug 2022 18:50:39 +0000"  >&lt;p&gt;Are there any protections against having more than one ingest process running at once? How easy would that be to implement?&lt;/p&gt;

&lt;p&gt;Switching to Postgresql involves creating a new database, setting up a &lt;tt&gt;registry.pgsql&lt;/tt&gt; YAML file in the repo (with entries &lt;tt&gt;host&lt;/tt&gt;, &lt;tt&gt;port&lt;/tt&gt;, &lt;tt&gt;database&lt;/tt&gt;, &lt;tt&gt;user&lt;/tt&gt; and optional &lt;tt&gt;password&lt;/tt&gt;), then re-ingesting everything.&lt;/p&gt;</comment>
                            <comment id="31258" author="rhl" created="Mon, 29 Aug 2022 19:05:16 +0000"  >&lt;p&gt;We&apos;re doing this switch on Cerro Pach&#243;n. &#160;I &lt;em&gt;think&lt;/em&gt; that KT has a script to do the migration&lt;/p&gt;</comment>
                            <comment id="31260" author="cloomis" created="Mon, 29 Aug 2022 20:51:50 +0000"  >&lt;p&gt;If the mechanism is already developed and tested, I propose that we try using it. Basically, investigating sqlite3 locking issues feels pretty open-ended to me, and we know we have issues with NFS in any case. &lt;/p&gt;</comment>
                            <comment id="31273" author="price" created="Thu, 1 Sep 2022 19:01:17 +0000"  >&lt;p&gt;Discussed with &lt;a href=&quot;https://pfspipe.ipmu.jp/jira/secure/ViewProfile.jspa?name=rhl&quot; class=&quot;user-hover&quot; rel=&quot;rhl&quot;&gt;rhl&lt;/a&gt; and &lt;a href=&quot;https://pfspipe.ipmu.jp/jira/secure/ViewProfile.jspa?name=cloomis&quot; class=&quot;user-hover&quot; rel=&quot;cloomis&quot;&gt;cloomis&lt;/a&gt;. We believe that the Gen2 registry works with PostgresQL (we understand the HSC summit processing uses this), so we just need to transfer the contents from SQLite to a new PostgresQL database (we suspect KTL&apos;s script used for LSST@Cerro Pachon is particular to the Gen3 middleware).&lt;/p&gt;</comment>
                            <comment id="31322" author="price" created="Tue, 13 Sep 2022 15:26:44 +0000"  >&lt;p&gt;I have verified migration of the sqlite registry to postgresql using the following command:&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
sqlite3 registry.sqlite3 .dump | sed -e &lt;span class=&quot;code-quote&quot;&gt;&apos;s|^PRAGMA.*$||&apos;&lt;/span&gt; -e &lt;span class=&quot;code-quote&quot;&gt;&apos;s|integer primary key autoincrement|serial primary key|g&apos;&lt;/span&gt; -e &lt;span class=&quot;code-quote&quot;&gt;&apos;s|&lt;span class=&quot;code-object&quot;&gt;double&lt;/span&gt;|&lt;span class=&quot;code-object&quot;&gt;double&lt;/span&gt; precision|g&apos;&lt;/span&gt; -e &lt;span class=&quot;code-quote&quot;&gt;&apos;s|^.*sqlite.*$||&apos;&lt;/span&gt; -e &lt;span class=&quot;code-quote&quot;&gt;&apos;s|pfsDesignId &lt;span class=&quot;code-object&quot;&gt;int&lt;/span&gt;|pfsDesignId bigint|g&apos;&lt;/span&gt; | tee registry.sql | psql -h localhost -U pfs pfs_gen2
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;I still need to verify operation of the pipeline using the postgresql registry. I am awaiting provision of postgresql databases on tiger and the summit.&lt;/p&gt;</comment>
                            <comment id="31326" author="price" created="Tue, 13 Sep 2022 20:17:45 +0000"  >&lt;p&gt;I&#8217;ve got the postgresql registry working at Hilo. In order to use it, you need to have a &lt;tt&gt;~/.pgpass&lt;/tt&gt; entry. The alternative is to put the database password in plaintext in the configuration file in the repo, which I think is a bad idea since the database user has admin privs.&lt;br/&gt;
Then there&#8217;s a new script that needs to be used for ingesting images. That means that we&#8217;d need to update the stack (after we merge and cut a new release).&lt;/p&gt;</comment>
                            <comment id="31327" author="cloomis" created="Tue, 13 Sep 2022 20:22:20 +0000"  >&lt;p&gt;+many on using .pgpass vs.  repo: I think we all already do.&lt;/p&gt;

&lt;p&gt;And good to know about the ingest &amp;#8211; same calling conventions, but just a new version/tag? Is it on this ticket for now?&lt;/p&gt;</comment>
                            <comment id="31328" author="price" created="Tue, 13 Sep 2022 20:24:17 +0000"  >&lt;p&gt;The ingest updates are on the ticket branch of obs_pfs. You&apos;ll need to call &lt;tt&gt;ingestPfsImagesPgsql.py&lt;/tt&gt; instead of &lt;tt&gt;ingestPfsImages.py&lt;/tt&gt;.&lt;/p&gt;</comment>
                            <comment id="31366" author="price" created="Thu, 15 Sep 2022 14:55:06 +0000"  >&lt;p&gt;Asking for review of the code changes. These will need to be put into a release that&apos;s deployed and used for the ingest. Everyone who uses it will need to have the database details in their &lt;tt&gt;~/.pgpass&lt;/tt&gt;.&lt;/p&gt;</comment>
                            <comment id="31368" author="hassan" created="Thu, 15 Sep 2022 16:32:50 +0000"  >&lt;p&gt;The code changes look fine. A rebase is needed, but that is already been understood.&lt;/p&gt;</comment>
                            <comment id="31370" author="price" created="Thu, 15 Sep 2022 19:24:43 +0000"  >&lt;p&gt;Merged to master.&lt;/p&gt;

&lt;p&gt;Leaving this ticket open until we can do the transition at Hilo.&lt;/p&gt;</comment>
                            <comment id="31423" author="price" created="Wed, 21 Sep 2022 17:35:41 +0000"  >&lt;p&gt;The migration is complete from my side. &lt;tt&gt;/work/drp&lt;/tt&gt; on &lt;tt&gt;pfsa-usr01-gb&lt;/tt&gt; is configured to use the postgresql registry. All future ingests into that repo should use the &lt;tt&gt;ingestPfsImagesPgsql.py&lt;/tt&gt; script, or it won&#8217;t be registered correctly.&lt;/p&gt;

&lt;p&gt;All users of &lt;tt&gt;/work/drp&lt;/tt&gt; will need to ensure that they have the correct entry in their &lt;tt&gt;~/.pgpass&lt;/tt&gt; file. The password is available from me or &lt;a href=&quot;https://pfspipe.ipmu.jp/jira/secure/ViewProfile.jspa?name=kiyoto.yabe&quot; class=&quot;user-hover&quot; rel=&quot;kiyoto.yabe&quot;&gt;Kiyoto Yabe&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;To roll back to the old system, rename the &lt;tt&gt;registry.pgsql&lt;/tt&gt; file to &lt;tt&gt;&amp;#95;registry.pgsql&lt;/tt&gt;, and rename the &lt;tt&gt;registry.sqlite3.OLD&lt;/tt&gt; file to &lt;tt&gt;registry.sqlite3&lt;/tt&gt;.&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10003">
                    <name>Relates</name>
                                            <outwardlinks description="relates to">
                                        <issuelink>
            <issuekey id="23755">PIPE2D-1272</issuekey>
        </issuelink>
            <issuelink>
            <issuekey id="22962">PIPE2D-1083</issuekey>
        </issuelink>
            <issuelink>
            <issuekey id="22942">INSTRM-1699</issuekey>
        </issuelink>
                            </outwardlinks>
                                                        </issuelinktype>
                    </issuelinks>
                <attachments>
                            <attachment id="15220" name="deadlock.py" size="888" author="michitaro" created="Fri, 8 Jul 2022 04:59:06 +0000"/>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                <customfield id="customfield_10500" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                            <customfield id="customfield_10010" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>0|02qpio:i</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10100" key="com.atlassian.jira.plugin.system.customfieldtypes:userpicker">
                        <customfieldname>Reviewers</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>hassan</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10005" key="com.pyxis.greenhopper.jira:gh-sprint">
                        <customfieldname>Sprint</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue id="152">preEngRun07Sep</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10002" key="com.atlassian.jira.plugin.system.customfieldtypes:float">
                        <customfieldname>Story Points</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>3.0</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                                                        </customfields>
    </item>
</channel>
</rss>