-
Type:
Task
-
Status: Open (View Workflow)
-
Priority:
Normal
-
Resolution: Unresolved
-
Component/s: ics_drpActor
-
Labels:
-
Story Points:2
-
Sprint:PreRun21Mar
Twice in the last few days, the drpActor running at the summit has "hung up". The actor was unresponsive to `ping` commands, suggesting that the main user thread was hung. In both cases, there were three live `drpActor` processes. In one case, killing the newest one freed the actor to continue. In the second case, that didn't work and the actor had to be restarted.
Umm, for the second one, the the actor was running reductions on visit=121963, only n1 and n2 detrend did not finish, and the detrends started finishing at 2025-03-22 17:11:30.019Z
For the first, the log showed that the actor had a failure on one of the reductions (b2):
2025-03-22 08:09:49.183Z lsst.ctrl.mpexec.mpGraphExecutor 20 mpGraphExecutor.py:650 Executed 11 quanta successfully, 1 failed and 0 remain out of total 12 quanta. 2025-03-22 08:09:49.183Z lsst.ctrl.mpexec.mpGraphExecutor 40 mpGraphExecutor.py:666 Failed jobs: 2025-03-22 08:09:49.184Z lsst.ctrl.mpexec.mpGraphExecutor 40 mpGraphExecutor.py:669 - FAILED: <TaskDef(lsst.obs.pfs.isrTask.PfsIsrTask, label=isr) dataId={instrument: 'PFS', arm: 'b', spectrograph: 2, visit: 121509, ...}> 2025-03-22 08:09:49.197Z actor 20 engine.py:175 New pfsConfig available: /data/raw/2025-03-21/pfsConfig/pfsConfig-0x5b4744a63e7757a2-121510.fits 2025-03-22 08:09:49.297Z cmds 20 Actor.py:524 new cmd: ping 2025-03-22 08:09:49.299Z cmds 20 CommandLink.py:122 > 2 43 : text='Present and (probably) well'
But I happen to know that ping was sent to the actor at 08:06:39: it really did hang up during the reductions.
- relates to
-
INSTRM-2467 Improve drpActor Reliability (MKO and Hilo)
-
- Open
-