[INSTRM-878] actors cannot connect to tron Created: 14/Jan/20 Updated: 19/Feb/20 Resolved: 13/Feb/20 |
|
| Status: | Done |
| Project: | Instrument control development |
| Component/s: | ics_xcuActor |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Normal |
| Reporter: | fmadec | Assignee: | arnaud.lefur |
| Resolution: | Done | Votes: | 0 |
| Labels: | SM1, SPS | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||
| Story Points: | 1 | ||||||||
| Sprint: | SM1PD-2020 A, SM1PD-2020 B | ||||||||
| Description |
|
tron crashed during the night (3:54am UTC on Jan 13 2020) . after a restart of tron, some actors(xcu b1 and r1, enu_sm1) were not able to reconnect to Tron, but some were : 2020-01-13T15:46:28.182 sent hub status 2020-01-13T15:46:28.186 hub i version="need_to_read_git_version" 2020-01-13T15:46:28.189 hub i actors=hub,keys,msg,gen2,spsait,seqno,meb,sps 2020-01-13T15:46:28.190 hub i commanders=client.v1,client.v3,client.v4,client.v5,client.v8,client.v13,client.
the error message is 2020-01-13 15:46:13.997Z cmdr 30 CmdrConnection.py:79 in doStart 2020-01-13 15:46:14.001Z cmdr 30 CmdrConnection.py:108 CmdrConnection failed: [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionRefusedError'>: Connection was refused by other side: 111: Connection refused.
I rebooted the bee_r1 but got the same error. Arnaud tried to start enu_sm1 on another machine (shell-ics) and it worked... but it doesn't work on rack5 while it is working for rough on the same machine... one thing strange is that the command oneCmd.py hub status doesn't work on rack5 and bee-r1. So we do not have the current status of the 2 cameras... which are supposed to pumpdown (r1 gatevalve certainly close during the reboot).
|
| Comments |
| Comment by cloomis [ 14/Jan/20 ] |
|
Reconnections failed about half the time because I had mistakenly added a second IP address for tron, regardless of which host the connections were made from. That should now be fixed, so I will drop the Blocker level. The crash may or may not be related; I'll look. |
| Comment by hassan [ 15/Jan/20 ] |
|
arnaud.lefur will try to reproduce the problem at LAM. Ticket will be handed back to Craig on his return. |
| Comment by cloomis [ 16/Jan/20 ] |
|
Wait – is there any point in working on this? There were two IP addresses for the hostname 'tron'. Actors (or oneCmd.py) which resolved to the bad one could not connect. A 50/50 chance of a connection working. Or are you trying to make tron crash? That could be interesting, I suppose. But given the history surely there are more important issues to work on, no? |
| Comment by arnaud.lefur [ 16/Jan/20 ] |
|
yes, I was thinking about the crash itself, I'm nearly sure that But I can run some test in parallel, it won't take much time. |
| Comment by arnaud.lefur [ 21/Jan/20 ] |
|
With with this bug log size would reach about n * (n+1)/2 = 1.5e10 rows with only one data and we have about 30, I bet that was the issue ! |
| Comment by cloomis [ 13/Feb/20 ] |
|
There are three problems, and we fixed two:
We could open a new ticket asking tron to be more resilient to denial-of-service attacks. |