[INSTRM-2045] Harden tron hub and recover better on exit. - PFS-JIRA

XML

Word

Printable

Details

Type: Task
Status: Open (View Workflow)
Priority: Normal
Resolution: Unresolved
Component/s: tron_tron
Labels:
None

Description

We had a tron hub shutdown on 2023-07-26, triggered by the /data NFS mount going away. Recovery was pretty clean, but I don't think we can count on that.

restarted the tron hub on pfs@mhs-ics, with {setup tron_tron; tron restart}
found that all actors reconnected automatically, with {oneCmd.py hub actors}
and some manual inspection. I am actually surprised this worked so well.

The /data NFS bounce also caused all the actors to stop writing their logs. Not sure how we would notice this in general...... That was fixed by telling all the non-hub actors to restart their logging via being reconfigured:

for a in $(oneCmd.py hub actors | sed -n '/actors=/{s/^.*,msg,//; s/,/ /g; p}'); do 
    echo "==== $a"
    oneCmd.py $a reloadConfiguration
    sleep 1
    oneCmd.py $a status
done

Will attach more tickets to this.

Attachments

Issue Links

relates to

INSTRM-2046 Add tron watchdog

Open

INSTRM-322 Switch actor logging to rsyslog.

Done

Activity

People

Assignee:

cloomis

Reporter:

cloomis

Votes:

0 Vote for this issue

Watchers:

1 Start watching this issue

Dates

Created:

27/Jul/23 4:47 AM

Updated:

02/Aug/23 2:18 AM