[INSTRM-643] ccd r1 fee crash Created: 04/Apr/19  Updated: 24/Dec/20

Status: In Progress
Project: Instrument control development
Component/s: ics_ccdActor
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major
Reporter: fmadec Assignee: cloomis
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Blocks
is blocked by INSTRM-626 give up using ccd expose bias|dark co... Done

 Description   

ccd r1 crashed during biases exposure

 

 


{{2019-04-04 10:40:19 ccd_r1 f text="command failed: KeyError('fee',) in fee() at python/ccdActor/main.py:47" }}

 

 

then tried disconnect/connect

 

2019-04-04 10:40:48 cmdIn=ccd_r1 connect controller=fee 2019-04-04 10:40:49 ccd_r1 w controllers="ccd" 2019-04-04 10:40:49 ccd_r1 w text="failed to connect controller fee" 2019-04-04 10:40:49 ccd_r1 f text="command failed: RuntimeError('failed to arm for readout)',) in pyFPGA.FPGA.setClockLevels() at fpga/pyFPGA.pyx:127" 

 

 

then tried power off/on and connect but:

 

 

2019-04-04 10:45:14 ccd_r1 w controllers="ccd"
2019-04-04 10:45:14 ccd_r1 w text="failed to connect controller fee"
2019-04-04 10:45:14 ccd_r1 f text="command failed: RuntimeError('failed to arm for readout)',) in pyFPGA.FPGA.setClockLevels() at fpga/pyFPGA.pyx:127"

 

 

 

 



 Comments   
Comment by fmadec [ 04/Apr/19 ]

a restart of ccd actor solved the issue...

 

Comment by cloomis [ 04/Apr/19 ]

Slightly confused by the output; will look more carefully at the logs.

Comment by cloomis [ 05/Apr/19 ]

This is an amusing cluster of problems....

Problem 0: the exposure after 15317 "hung". Specifically, the background thread started to run the integration and readout never started the readout (hence no exposureState=readout). This command/code path (running exposures in a background thread) is not used in production (or will not be after INSTRM-626). But the thread holds a reference to an instance of the FEE object, which caused the later problems. No, I do nto know why readout did not start, but it feels thread-y. There is a small chance that INSTRM-640 is implicated, but since the transition to integration requires that the INSTRM-640 header be complete I don't see how right now.

Problem 1: the exposure did not finish, so the exposure was cleared and the fee reconnected. But it couldn't be reconnected, because the exposure thread was not really dead (python cannot kill stuck threads) and still had a reference to an old fee object, which had the hardware connection open.

The two problems explain why there are two "concurrent" and very very different complaints about the FEE.

What to do? I'm inclined to instrument a few things better and make fee disconnection more violent/effective, but that's it. At first these look like very serious problems, but it is only because of the background thread which is going away. So understanding exactly why that thread hung up is not worth working on. If there is an underlying problem, it will be much easier to see and fix without the thread.

Comment by arnaud.lefur [ 17/Dec/19 ]

It happened today, but I had an issue with spsaitActor, it's probably my fault, I have to check the logs...

Comment by cloomis [ 24/Dec/20 ]

Bump, maybe. Tx, arnaud.lefur. The 2020-12-17 problems might well have happened because the file writing took forever, so the per-exposure thread never closed out.

At the very least, new exposures should be rejected if an old exposure thread exists. What to do to better kill a stuck thread is still not clear.

Generated at Sat Feb 10 16:27:09 JST 2024 using Jira 8.3.4#803005-sha1:1f96e09b3c60279a408a2ae47be3c745f571388b.