-
Type: Task
-
Status: Open (View Workflow)
-
Priority: Normal
-
Resolution: Unresolved
-
Component/s: tron_tron
-
Labels:None
We had a tron hub shutdown on 2023-07-26, triggered by the /data NFS mount going away. Recovery was pretty clean, but I don't think we can count on that.
- restarted the tron hub on pfs@mhs-ics, with {setup tron_tron; tron restart}
- found that all actors reconnected automatically, with
{oneCmd.py hub actors}
and some manual inspection. I am actually surprised this worked so well.
The /data NFS bounce also caused all the actors to stop writing their logs. Not sure how we would notice this in general...... That was fixed by telling all the non-hub actors to restart their logging via being reconfigured:
for a in $(oneCmd.py hub actors | sed -n '/actors=/{s/^.*,msg,//; s/,/ /g; p}'); do echo "==== $a" oneCmd.py $a reloadConfiguration sleep 1 oneCmd.py $a status done
Will attach more tickets to this.
- relates to
-
INSTRM-2046 Add tron watchdog
- Open
-
INSTRM-322 Switch actor logging to rsyslog.
- Open