[INSTRM-1993] Fix pixel drops at start of ramps Created: 15/Jun/23  Updated: 28/Jul/23  Resolved: 28/Jul/23

Status: Done
Project: Instrument control development
Component/s: hxhal, ics_hxActor
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major
Reporter: cloomis Assignee: cloomis
Resolution: Done Votes: 0
Labels: near-term
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Relates
relates to INSTRM-2013 reported readTimes incorrect Done
Sprint: Eng12July

 Description   

Ramps sporadically fail, where the actor cannot get enough pixels from the DAQ to complete the ramp. There is no per-read framing, so this is only discovered at the end of the ramp, and the drops could have come anytime.

On n1 at Subaru, we are seeing quite a few of these: 19/516 from 2023-05 to now, ~4%. It turns out that most/all ramps are simply missing pixels at the beginning of the ramp, visible even in the RESET frame. This is probably good news: even if we cannot fully recover that frame, it is essentially never used.

I have not yet found anything stupid on my side, sadly, nor seen warnings of SAM FIFO overflows. There are various counters on the SAM and the ASIC which can be reported more enthusiastically; I'll work on adding those under this ticket.



 Comments   
Comment by cloomis [ 07/Jul/23 ]

Partly inspired by INSTRM-2013, I will completely validate the ASIC configuration which we read back at the start of each ramp. It can also be rewritten/reloaded, but that can take a while (10s of seconds in some cases) so we do not want to reconfigure unnecessarily. These values are not supposed to change, but I have seen the row window and number of outputs registers unexpectedly change. Both of which would completely mess things up.

Comment by cloomis [ 11/Jul/23 ]

About a third of the failures happened because the number of pixels expected for the ramp is calculated based on values read back from the ASIC, and sometimes those values are read back wrong. Simple fix for these particular ramps: calculate based on the values we wrote to the ASIC when configuring it.

This seems moderately likely to be from errors on the SAM (ASIC register reads are done by programming some SAM registers; in one case the value of the register was the address of the register): I think we will find similar problems with some of the other failed ramps.

Comment by cloomis [ 14/Jul/23 ]

I have made most of the changes required to notice failed ramps quickly, and restructured the inner takeRamp method to support resuming a stopped ramp. But need a failure at this point before I can actually implement that.

Comment by cloomis [ 25/Jul/23 ]

I am going to close this one, although I do expect to have smaller followup tickets. Did four things, basically:

  • ignore read back of the geometry registers, and instead use the values we wrote to them. This is the one problem which looks like hardware (our hardware, at least).
  • be gentler about stopping a ramp in the middle, as we often do from sunss. Basically, set the ASIC to idle, but do not reconfigure it. I still don't think I got this entirely right for all cases.
  • be more aggressive about initializing the SAM right before starting a ramp, and reporting errors.
  • start cleaning up after total failures in the inner loop. I will make a new ticket for finishing this work.
Comment by cloomis [ 26/Jul/23 ]

Yeah, close it, but also look at 98434 on n1.

Comment by cloomis [ 28/Jul/23 ]

As commented, there will be more work, but this ticket should be closed.

hxhal: tagged 3.1.1
ics_hxActor: tagged 2.7.9

which is what we have been running for a couple of nights at the end of the 2023-07 run.

Generated at Sat Feb 10 16:41:53 JST 2024 using Jira 8.3.4#803005-sha1:1f96e09b3c60279a408a2ae47be3c745f571388b.