[INSTRM-1180] ingest using shell call to ingestPfsImages Created: 13/Feb/21  Updated: 16/Feb/21  Resolved: 16/Feb/21

Status: Done
Project: Instrument control development
Component/s: ics_drpActor
Affects Version/s: None
Fix Version/s: None

Type: Task Priority: Normal
Reporter: cloomis Assignee: cloomis
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Relates
relates to PIPE2D-715 Remove updates to raw_visit table fro... Done
Story Points: 1

 Description   

We currently ingesting using a Task. Switch to a call to ingestPfsImages.

Hoping to make new file processing faster and less prone to registry failures.



 Comments   
Comment by cloomis [ 14/Feb/21 ]

Doesn't really help. Delay Is dependent on the size of the registry, so we do have to figure this out. Not obvious that the copy-update-rename hack fixed by the DM ticket is the problem, as much as I hate and want to blame that. Does seem to be at least partially tied up with a sequential scan of the db right before closing. More to come.

Comment by cloomis [ 14/Feb/21 ]

Good news: ingest really does take long time due to a fixable obs_pfs problem updating the raw_visit table. PIPE2D-715

Bad news: the drpActor does have associated sequencing problems which still need to be addressed. But I think addressing those will either take care of INSTRM-1075 or at least make things less confusing.

Comment by arnaud.lefur [ 14/Feb/21 ]

Great, it used to be ~atomic, so that makes sense.
Yeah, basically the command thread is the bottleneck, as a result ingest + detrend are done sequentially.
I though it was safer to avoid dealing with threads, especially with DB behind the scene.
We still want to make sure that detrend is fired once after all files are ingested right ?

Comment by cloomis [ 15/Feb/21 ]

This is still a mess. I am running obs_pfs cpl on top of pfs_pipe2d w.2021.06b, which disables the raw_visit updates which cause ingest to slow down to 10s of seconds (everywhere, not just on shell2-ics). There may still be gains to be made on ingest.

I re-ordered the internal processing of ingest&detrend in order to guarantee that only one operation is running at a time.

Added isr.doDefect=False to get around a temporary detrend bug. See PIPE2D-715.

Even though shell-2 is under provisioned (4-core, 4GB)I do not see bad system contention (per dstat 1), but detrend is still slow (12-30s).

There are two oddities which need to be figured out before I will trust or suspect anything else. One of the loggers (not the main actorcore/python logger, nor the LSST logger) multiplies itself, adding one line for each file processed. And when you quit the drpActor, there are a correspondingly increasing number of complaints from sqlite and from the multiplying logger. But according to the main thread, we only ever get four threads (not many many), and the thread ids printed by the sqlite complaint at the end are all the same:

2021-02-14 18:40:49.316Z actor            20 Actor.py:550 reactor dead, cleaning up...
actor INFO: reactor dead, cleaning up...
actor INFO: reactor dead, cleaning up...
actor INFO: reactor dead, cleaning up...
actor INFO: reactor dead, cleaning up...
actor INFO: reactor dead, cleaning up...
actor INFO: reactor dead, cleaning up...
actor INFO: reactor dead, cleaning up...
actor INFO: reactor dead, cleaning up...
actor INFO: reactor dead, cleaning up...
actor INFO: reactor dead, cleaning up...
actor INFO: reactor dead, cleaning up...
actor INFO: reactor dead, cleaning up...
actor INFO: reactor dead, cleaning up...
actor INFO: reactor dead, cleaning up...
Exception ignored in: <function SqlRegistry.__del__ at 0x7fd4fea88400>
Traceback (most recent call last):
  File "/software/drp-5.2/stack/miniconda3-4.5.12-1172c30/Linux64/daf_persistence/18.1.0/python/lsst/daf/persistence/registries.py", line 323, in __del__
    self.conn.close()
sqlite3.ProgrammingError: SQLite objects created in a thread can only be used in that same thread. The object was created in thread id 140551400564480 and this is thread id 140553341876032.
Exception ignored in: <function SqlRegistry.__del__ at 0x7fd4fea88400>
Traceback (most recent call last):
  File "/software/drp-5.2/stack/miniconda3-4.5.12-1172c30/Linux64/daf_persistence/18.1.0/python/lsst/daf/persistence/registries.py", line 323, in __del__
    self.conn.close()
sqlite3.ProgrammingError: SQLite objects created in a thread can only be used in that same thread. The object was created in thread id 140551400564480 and this is thread id 140553341876032.
Exception ignored in: <function SqlRegistry.__del__ at 0x7fd4fea88400>
Traceback (most recent call last):
  File "/software/drp-5.2/stack/miniconda3-4.5.12-1172c30/Linux64/daf_persistence/18.1.0/python/lsst/daf/persistence/registries.py", line 323, in __del__
    self.conn.close()
sqlite3.ProgrammingError: SQLite objects created in a thread can only be used in that same thread. The object was created in thread id 140551400564480 and this is thread id 140553341876032.
Exception ignored in: <function SqlRegistry.__del__ at 0x7fd4fea88400>
Traceback (most recent call last):
  File "/software/drp-5.2/stack/miniconda3-4.5.12-1172c30/Linux64/daf_persistence/18.1.0/python/lsst/daf/persistence/registries.py", line 323, in __del__
    self.conn.close()
sqlite3.ProgrammingError: SQLite objects created in a thread can only be used in that same thread. The object was created in thread id 140551400564480 and this is thread id 140553341876032.
Exception ignored in: <function SqlRegistry.__del__ at 0x7fd4fea88400>
Traceback (most recent call last):
  File "/software/drp-5.2/stack/miniconda3-4.5.12-1172c30/Linux64/daf_persistence/18.1.0/python/lsst/daf/persistence/registries.py", line 323, in __del__
    self.conn.close()
sqlite3.ProgrammingError: SQLite objects created in a thread can only be used in that same thread. The object was created in thread id 140551400564480 and this is thread id 140553341876032.
Exception ignored in: <function SqlRegistry.__del__ at 0x7fd4fea88400>
Traceback (most recent call last):
  File "/software/drp-5.2/stack/miniconda3-4.5.12-1172c30/Linux64/daf_persistence/18.1.0/python/lsst/daf/persistence/registries.py", line 323, in __del__
    self.conn.close()
sqlite3.ProgrammingError: SQLite objects created in a thread can only be used in that same thread. The object was created in thread id 140551400564480 and this is thread id 140553341876032.
Exception ignored in: <function SqlRegistry.__del__ at 0x7fd4fea88400>
Traceback (most recent call last):
  File "/software/drp-5.2/stack/miniconda3-4.5.12-1172c30/Linux64/daf_persistence/18.1.0/python/lsst/daf/persistence/registries.py", line 323, in __del__
    self.conn.close()
sqlite3.ProgrammingError: SQLite objects created in a thread can only be used in that same thread. The object was created in thread id 140551400564480 and this is thread id 140553341876032.
Exception ignored in: <function SqlRegistry.__del__ at 0x7fd4fea88400>
Traceback (most recent call last):
  File "/software/drp-5.2/stack/miniconda3-4.5.12-1172c30/Linux64/daf_persistence/18.1.0/python/lsst/daf/persistence/registries.py", line 323, in __del__
    self.conn.close()
sqlite3.ProgrammingError: SQLite objects created in a thread can only be used in that same thread. The object was created in thread id 140551400564480 and this is thread id 140553341876032.
Exception ignored in: <function SqlRegistry.__del__ at 0x7fd4fea88400>
Traceback (most recent call last):
  File "/software/drp-5.2/stack/miniconda3-4.5.12-1172c30/Linux64/daf_persistence/18.1.0/python/lsst/daf/persistence/registries.py", line 323, in __del__
    self.conn.close()
sqlite3.ProgrammingError: SQLite objects created in a thread can only be used in that same thread. The object was created in thread id 140551400564480 and this is thread id 140553341876032.
Exception ignored in: <function SqlRegistry.__del__ at 0x7fd4fea88400>
Traceback (most recent call last):
  File "/software/drp-5.2/stack/miniconda3-4.5.12-1172c30/Linux64/daf_persistence/18.1.0/python/lsst/daf/persistence/registries.py", line 323, in __del__
    self.conn.close()
sqlite3.ProgrammingError: SQLite objects created in a thread can only be used in that same thread. The object was created in thread id 140551400564480 and this is thread id 140553341876032.

Comment by cloomis [ 16/Feb/21 ]

A cute thing to notice in that last log: the "actor" logger is an actorcore thing, along with the "cmds" logger. Both always use the formatting you see in that top line: the date, logger name, etc. But after the detrend task is called the first time, every output line on the actor logger then also generates the multiplying lines as formatted by the LSST logging system: the logger name, the level, the message.

Also, detrend was being called on each camera, but without specifying the arm: the second call processed both arms. Fixed.

FWIW, with the ingest raw_visit problem fixed, ingest+detrend take ~20-25s, which is fast enough to avoid piling up when taking, say, a sequence of biases. So I propose:

  • I'll close and merge this ticket, which includes some hack-y crap to simplify and track the problems.
  • Arnaud starts from that and works on the detrend ticket, which might fix this stealing and multiplying problem.
  • If that does not work by the day SuNSS goes back on the sky, move detrend out into an external process, like ingest now is. Sure, we eat some overhead, but at least we can stay ahead of the incoming data and not eat ourselves.
  • Separately, have the listener gather all files for a given visit and process them all at once. Yes, that requires knowing which cameras are live, but that is doable.

Frankly, grouping all exposures fo a visit and calling detrend.py externally is probably as good as we can do in any case: the detrend.py overhead will be taken up by the 8 exposures.

And finally, I will request that shell2-ics be beefed up some more. This configuration is silly.

Comment by arnaud.lefur [ 16/Feb/21 ]

Also, detrend was being called on each camera, but without specifying the arm: the second call processed both arms. Fixed.
that's right, but I did it on purpose. The idea is that since I'm calling detrend without arm, the first call was processing all matching ingested files.
But in fact, I was checking the file existence before actually calling detrend, so when the second call kicks in, the file is already there, hence it just generates the filepath keyword for gingaActor.

Agreed, will try to fix INSTRM-1178 asap and external call will definitely my safety net.

Comment by cloomis [ 16/Feb/21 ]

Merged at 76ee97a. Intended to be a sloppy mess.

Generated at Sat Feb 10 16:32:42 JST 2024 using Jira 8.3.4#803005-sha1:1f96e09b3c60279a408a2ae47be3c745f571388b.