[INSTRM-2500] drpActor at summit hanging during reductions Created: 23/Mar/25  Updated: 15/Apr/25

Status: Open
Project: Instrument control development
Component/s: ics_drpActor
Affects Version/s: None
Fix Version/s: None

Type: Task Priority: Normal
Reporter: cloomis Assignee: arnaud.lefur
Resolution: Unresolved Votes: 0
Labels: EngRun
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Relates
relates to INSTRM-2467 Improve drpActor Reliability (MKO and... Open
Story Points: 2
Sprint: PreRun21Mar, PostRun21Mar

 Description   

Twice in the last few days, the drpActor running at the summit has "hung up". The actor was unresponsive to `ping` commands, suggesting that the main user thread was hung. In both cases, there were three live `drpActor` processes. In one case, killing the newest one freed the actor to continue. In the second case, that didn't work and the actor had to be restarted.

Umm, for the second one, the the actor was running reductions on visit=121963, only n1 and n2 detrend did not finish, and the detrends started finishing at 2025-03-22 17:11:30.019Z

For the first, the log showed that the actor had a failure on one of the reductions (b2):

2025-03-22 08:09:49.183Z lsst.ctrl.mpexec.mpGraphExecutor 20 mpGraphExecutor.py:650 Executed 11 quanta successfully, 1 failed and 0 remain out of total 12 quanta.
2025-03-22 08:09:49.183Z lsst.ctrl.mpexec.mpGraphExecutor 40 mpGraphExecutor.py:666 Failed jobs:
2025-03-22 08:09:49.184Z lsst.ctrl.mpexec.mpGraphExecutor 40 mpGraphExecutor.py:669   - FAILED: <TaskDef(lsst.obs.pfs.isrTask.PfsIsrTask, label=isr) dataId={instrument: 'PFS', arm: 'b', spectrograph: 2, visit: 121509, ...}>
2025-03-22 08:09:49.197Z actor            20 engine.py:175 New pfsConfig available: /data/raw/2025-03-21/pfsConfig/pfsConfig-0x5b4744a63e7757a2-121510.fits
2025-03-22 08:09:49.297Z cmds             20 Actor.py:524 new cmd: ping
2025-03-22 08:09:49.299Z cmds             20 CommandLink.py:122 > 2 43 : text='Present and (probably) well'

But I happen to know that ping was sent to the actor at 08:06:39: it really did hang up during the reductions.


Generated at Fri Apr 18 22:12:55 JST 2025 using Jira 8.3.4#803005-sha1:1f96e09b3c60279a408a2ae47be3c745f571388b.