[INSTRM-878] actors cannot connect to tron Created: 14/Jan/20  Updated: 19/Feb/20  Resolved: 13/Feb/20

Status: Done
Project: Instrument control development
Component/s: ics_xcuActor
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Normal
Reporter: fmadec Assignee: arnaud.lefur
Resolution: Done Votes: 0
Labels: SM1, SPS
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Relates
relates to INSTRM-879 stsBuffer get never cleared and grow ... Done
Story Points: 1
Sprint: SM1PD-2020 A, SM1PD-2020 B

 Description   

tron crashed during the night (3:54am UTC on Jan 13 2020) .

after a restart of tron, some actors(xcu b1 and r1, enu_sm1) were not able to reconnect to Tron, but some were :

 2020-01-13T15:46:28.182 sent hub status
2020-01-13T15:46:28.186 hub i version="need_to_read_git_version"
2020-01-13T15:46:28.189 hub i actors=hub,keys,msg,gen2,spsait,seqno,meb,sps
2020-01-13T15:46:28.190 hub i commanders=client.v1,client.v3,client.v4,client.v5,client.v8,client.v13,client.

 

the error message is

 2020-01-13 15:46:13.997Z cmdr 30 CmdrConnection.py:79 in doStart
2020-01-13 15:46:14.001Z cmdr 30 CmdrConnection.py:108 CmdrConnection failed: [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionRefusedError'>: Connection was refused by other side: 111: Connection refused.

 

I rebooted the bee_r1 but got the same error.

Arnaud tried to start enu_sm1 on another machine (shell-ics) and it worked... but it doesn't work on rack5 while it is working for rough on the same machine...

one thing strange is that the command oneCmd.py hub status doesn't work on rack5 and bee-r1.

So we do not have the current status of the 2 cameras... which are supposed to pumpdown (r1 gatevalve certainly close during the reboot).

 



 Comments   
Comment by cloomis [ 14/Jan/20 ]

Reconnections failed about half the time because I had mistakenly added a second IP address for tron, regardless of which host the connections were made from. That should now be fixed, so I will drop the Blocker level. The crash may or may not be related; I'll look.

Comment by hassan [ 15/Jan/20 ]

arnaud.lefur will try to reproduce the problem at LAM. Ticket will be handed back to Craig on his return.

Comment by cloomis [ 16/Jan/20 ]

Wait – is there any point in working on this? There were two IP addresses for the hostname 'tron'. Actors (or oneCmd.py) which resolved to the bad one could not connect. A 50/50 chance of a connection working.

Or are you trying to make tron crash? That could be interesting, I suppose. But given the history surely there are more important issues to work on, no?

Comment by arnaud.lefur [ 16/Jan/20 ]

yes, I was thinking about the crash itself, I'm nearly sure that INSTRM-879 was the root cause.

But I can run some test in parallel, it won't take much time.

Comment by arnaud.lefur [ 21/Jan/20 ]

With INSTRM-879, logfile has grown very fast, basically :  data with a sample rate of 15sec  for 30 days gives : 30 * 24 * 60 * 4 = 172800 samples

with this bug log size would reach about n * (n+1)/2 = 1.5e10 rows 

with only one data and we have about 30, I bet that was the issue !

Comment by cloomis [ 13/Feb/20 ]

There are three problems, and we fixed two:

  • the tron server probably crashed due to a huge flood of traffic from the alerts actor. The tron crash has not been fixed.
  • the alertsActor was flooding tron. That bug has been fixed.
  • the dns configuration at subaru has two addresses for the tron host, so connections failed about half the time. This has been fixed.

We could open a new ticket asking tron to be more resilient to denial-of-service attacks.

Generated at Sat Feb 10 16:29:35 JST 2024 using Jira 8.3.4#803005-sha1:1f96e09b3c60279a408a2ae47be3c745f571388b.