[INSTRM-2500] drpActor at summit hanging during reductions - PFS-JIRA

XML

Word

Printable

Details

Type: Task
Status: Done (View Workflow)
Priority: Normal
Resolution: Done
Component/s: ics_drpActor
Labels:
- EngRun

Story Points:
2
Sprint:
PreRun21Mar, PreRun22May2025

Description

Twice in the last few days, the drpActor running at the summit has "hung up". The actor was unresponsive to `ping` commands, suggesting that the main user thread was hung. In both cases, there were three live `drpActor` processes. In one case, killing the newest one freed the actor to continue. In the second case, that didn't work and the actor had to be restarted.

Umm, for the second one, the the actor was running reductions on visit=121963, only n1 and n2 detrend did not finish, and the detrends started finishing at 2025-03-22 17:11:30.019Z

For the first, the log showed that the actor had a failure on one of the reductions (b2):

2025-03-22 08:09:49.183Z lsst.ctrl.mpexec.mpGraphExecutor 20 mpGraphExecutor.py:650 Executed 11 quanta successfully, 1 failed and 0 remain out of total 12 quanta.
2025-03-22 08:09:49.183Z lsst.ctrl.mpexec.mpGraphExecutor 40 mpGraphExecutor.py:666 Failed jobs:
2025-03-22 08:09:49.184Z lsst.ctrl.mpexec.mpGraphExecutor 40 mpGraphExecutor.py:669   - FAILED: <TaskDef(lsst.obs.pfs.isrTask.PfsIsrTask, label=isr) dataId={instrument: 'PFS', arm: 'b', spectrograph: 2, visit: 121509, ...}>
2025-03-22 08:09:49.197Z actor            20 engine.py:175 New pfsConfig available: /data/raw/2025-03-21/pfsConfig/pfsConfig-0x5b4744a63e7757a2-121510.fits
2025-03-22 08:09:49.297Z cmds             20 Actor.py:524 new cmd: ping
2025-03-22 08:09:49.299Z cmds             20 CommandLink.py:122 > 2 43 : text='Present and (probably) well'

But I happen to know that ping was sent to the actor at 08:06:39: it really did hang up during the reductions.

Attachments

Issue Links

relates to

INSTRM-2467 Improve drpActor Reliability (MKO and Hilo)

Open

Activity

People

Assignee:

arnaud.lefur

Reporter:

cloomis

Votes:

0 Vote for this issue

Watchers:

2 Start watching this issue

Dates

Created:

23/Mar/25 8:15 AM

Updated:

30/May/25 7:02 PM

Resolved:

30/May/25 7:02 PM