[INSTRM-1800] Characterize/fix reqframes response Created: 21/Nov/22  Updated: 02/Feb/24

Status: Open
Project: Instrument control development
Component/s: Gen2
Affects Version/s: None
Fix Version/s: None

Type: Task Priority: Normal
Reporter: cloomis Assignee: eric
Resolution: Unresolved Votes: 0
Labels: EngRun
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   

The reqframes service sometimes takes quite a long time to respond. In the latest case about 100s:

2022-11-21 01:56:20,661 | I | PFS.py:654 (reqframes) | reqframes num=100 type='A'
[ ... ]
2022-11-21 01:57:59,327 | I | PFS.py:658 (reqframes) | framelist:

Out of the 44230 reqframes calls made in 2021-2, here are the times for all calls which took more than 1 second. We currently have a 10s timeout, so fourteen real failures for us.

2021-01-12 14:18:25.453000 2
2021-01-14 12:45:13.123000 2
2021-01-17 10:30:41.375000 2
2021-01-19 12:39:47.543000 2
2021-01-22 06:38:42.979000 3
2021-01-24 22:11:03.833000 5
2021-01-30 05:31:01.442000 3
2021-02-02 23:38:03.112000 4
2021-02-06 02:36:59.753000 4
2021-02-22 00:58:19.438000 5
2021-02-25 15:09:51.480000 2
2021-03-31 12:34:28.069000 3
2021-05-04 15:53:02.737000 2
2021-05-06 20:06:15.625000 2
2021-05-07 15:59:53.094000 2
2021-05-08 00:09:19.605000 2
2021-05-10 09:17:46.669000 2
2021-05-10 21:48:53.330000 3
2021-05-13 16:22:39.357000 5
2021-05-15 18:08:46.252000 4
2021-05-17 12:21:06.005000 5
2021-05-20 12:59:46.067000 7
2021-05-21 19:20:02.311000 7
2021-05-25 13:38:35.847000 9
2021-05-31 05:23:14.546000 11
2021-06-03 00:07:58.940000 15
2021-06-07 11:04:03.625000 15
2021-06-24 01:58:23.225000 26
2021-09-12 22:30:38.936000 2
2021-09-15 02:28:04.480000 5
2021-10-22 01:17:29.340000 2
2021-11-20 05:27:04.965000 2
2021-11-21 05:44:35.302000 3
2021-11-21 16:05:50.479000 14
2021-11-21 17:12:52.306000 3
2021-11-21 17:17:16.721000 2
2021-11-21 17:18:32.524000 2
2021-11-21 17:33:25.710000 2
2021-11-21 18:28:36.297000 4
2021-11-21 19:24:34.245000 3
2021-11-21 21:38:29.590000 90
2021-11-22 03:08:22.385000 3
2022-04-30 22:21:00.018000 3
2022-05-06 19:47:37.766000 17
2022-05-11 12:00:19.542000 5
2022-05-11 12:00:27.926000 5
2022-05-11 12:00:35.300000 5
2022-05-11 12:00:40.870000 3
2022-05-11 12:10:12.553000 2
2022-05-11 14:26:41.336000 2
2022-05-11 20:21:44.130000 3
2022-05-11 20:23:24.809000 2
2022-05-11 20:24:34.191000 2
2022-05-11 20:28:02.424000 2
2022-05-11 20:31:45.505000 2
2022-05-11 20:34:15.027000 3
2022-05-11 20:49:41.250000 3
2022-05-11 20:53:52.303000 2
2022-05-11 20:56:53.594000 127
2022-05-12 07:39:15.808000 26
2022-05-12 10:13:46.341000 2
2022-05-12 10:21:57.150000 7
2022-05-12 11:41:35.705000 15
2022-05-16 19:14:26.932000 10
2022-05-16 19:35:25.384000 2
2022-05-16 19:56:23.078000 2
2022-05-16 20:17:20.380000 2
2022-05-17 16:08:31.585000 2
2022-05-17 16:34:17.428000 2
2022-05-17 16:53:57.678000 2
2022-06-09 09:51:07.808000 11
2022-06-14 10:36:39.119000 8
2022-06-19 20:13:01.616000 2
2022-06-22 07:26:24.785000 3
2022-09-28 18:36:28.480000 3
2022-11-07 22:06:21.761000 3
2022-11-13 23:57:46.388000 7
2022-11-14 03:15:06.610000 2
2022-11-16 19:12:48.739000 38
2022-11-17 07:57:02.254000 7
2022-11-18 20:16:55.378000 2
2022-11-21 01:57:59.327000 98


 Comments   
Comment by eric [ 24/Nov/22 ]

Thanks, cloomis.  We are already checking the logs on our side.

 

Comment by eric [ 10/Feb/23 ]

I've updated the frame service software on summit Gen2 to include the fix for occasional slow frame allocation. LMK from now if you experience slow allocations.

Installation time: 2023-02-09 16:00

Comment by cloomis [ 26/Apr/23 ]

A couple more longish ones recently. For 2023, the five out of 2415 requests which took more than 5s:

2023-04-03 13:09:41    7.521
2023-04-13 16:04:29    8.456
2023-04-17 04:24:04    6.24
2023-04-25 04:28:42   72.544
2023-04-26 01:12:32   78.298
Comment by cloomis [ 23/Jul/23 ]

For the 2023-07 run, so far:

2023-07-19 21:15:55 21.154
2023-07-20 01:51:06 35.0
2023-07-20 20:33:58 145.434
2023-07-21 03:53:55 120.745
2023-07-21 19:35:49 83.595
2023-07-22 02:04:43 190.612
2023-07-22 19:25:29 232.99
2023-07-23 04:40:06 211.413
2023-07-23 22:43:30 5.054
2023-07-24 01:36:57 7.844
2023-07-24 02:25:08 15.56
2023-07-24 12:39:45 7.584
2023-07-24 20:41:26 7.886
2023-07-25 00:42:21 6.915
2023-07-25 01:26:03 16.094
2023-07-25 02:35:40 19.909
2023-07-25 04:21:36 22.694
2023-07-25 05:42:38 29.027
2023-07-25 12:12:12 5.095
2023-07-25 12:36:08 6.486
2023-07-25 23:03:53 14.431

Even though the Gen2 side is tracking this themselves, I'll keep this list up to date both to allow comparing with them, and for our records.

Comment by eric [ 29/Jul/23 ]

It's too soon to close this issue, but it looks like the changes made during the 2023-07 run were fairly effective. The last two nights we had no timeouts.

Comment by yuki.moritani [ 17/Dec/23 ]

I'm just reporting that "FetchVisitFromGen2(failed to get a visit number in 10s !))" occurred around 2023-12-16 11:24 (HST)

Comment by eric [ 19/Dec/23 ]

@cloomis, can you run your script that gives the times > 1s? I'm mostly interested in post-october. We had to make a few changes (again) in the code for sqlalchemy >= 2.x and I'm interested to see how things are going.

Comment by eric [ 19/Dec/23 ]

These are the ones (>= 1s) .we see on our side:

2023-10-12 22:06:03,423 | I | framesvc.py:166 (getFrames) | db allocation took 1.061 sec
2023-10-13 15:20:47,132 | I | framesvc.py:166 (getFrames) | db allocation took 1.109 sec
2023-10-14 10:26:34,241 | I | framesvc.py:166 (getFrames) | db allocation took 1.198 sec
2023-10-15 07:40:34,006 | I | framesvc.py:166 (getFrames) | db allocation took 1.230 sec
2023-10-16 06:14:37,252 | I | framesvc.py:166 (getFrames) | db allocation took 1.321 sec
2023-11-13 15:56:44,205 | I | framesvc.py:166 (getFrames) | db allocation took 1.469 sec
2023-11-13 15:56:44,205 | I | framesvc.py:166 (getFrames) | db allocation took 1.469 sec
2023-11-18 09:12:47,585 | I | framesvc.py:166 (getFrames) | db allocation took 1.032 sec
2023-11-18 23:55:00,358 | I | framesvc.py:166 (getFrames) | db allocation took 1.429 sec
2023-11-29 17:52:47,186 | I | framesvc.py:166 (getFrames) | db allocation took 1.595 sec
2023-12-14 05:33:06,939 | I | framesvc.py:166 (getFrames) | db allocation took 1.047 sec
2023-12-14 06:40:50,591 | I | framesvc.py:166 (getFrames) | db allocation took 1.105 sec
2023-12-15 17:05:55,818 | I | framesvc.py:166 (getFrames) | db allocation took 1.077 sec
2023-12-16 11:24:15,280 | I | framesvc.py:166 (getFrames) | db allocation took 37.494 sec

The one on 12-16 stands out.  Most requests are averaging < 250 ms.

Comment by cloomis [ 19/Dec/23 ]

That matches our list (minus your duplicate and plus a couple where the few ms to get to PFS brought the transaction > 1.0s)

Comment by eric [ 19/Dec/23 ]

Good. Still concerned about such an outlier, but it looks like as a whole timings are still going in the right direction.

Comment by cloomis [ 19/Dec/23 ]

Do you have any ideas about what caused the outlier?

Comment by eric [ 19/Dec/23 ]

Not yet.  Per chance, are your frame requests all coming from the same thread in the same g2cam process?

Comment by cloomis [ 29/Dec/23 ]

I believe so, on the thread.
There are still outliers in the 2023-12 allocations > 1s:

2023-12-14 01:51:24 1.062
2023-12-14 05:33:05 1.131
2023-12-14 06:40:49 1.135
2023-12-15 17:05:54 1.145
2023-12-16 11:23:37 37.545
2023-12-19 21:12:39 1.384
2023-12-21 20:58:51 1.34
2023-12-21 21:00:16 1.206
2023-12-21 21:05:00 1.287
2023-12-21 23:19:48 75.016
2023-12-22 03:08:43 58.98
2023-12-22 18:42:10 1.205
2023-12-22 19:01:21 1.021
2023-12-22 22:31:36 80.436
2023-12-23 01:23:20 116.709
2023-12-23 04:29:35 1.03
2023-12-23 04:43:15 1.126
2023-12-23 05:06:26 1.013
2023-12-23 06:53:23 128.693
2023-12-27 18:11:48 3.275
2023-12-27 18:13:00 1.795
2023-12-27 18:13:53 1.962
2023-12-27 18:14:47 1.245
2023-12-27 18:15:40 1.313
2023-12-27 18:16:34 1.082
2023-12-27 18:17:27 1.085
2023-12-27 18:31:39 1.149
2023-12-27 19:23:28 115.652
2023-12-28 00:29:47 25.147
2023-12-28 03:14:42 5.175
2023-12-28 05:22:38 8.18
2023-12-14 01:51:24 1.062
2023-12-14 05:33:05 1.131
2023-12-14 06:40:49 1.135
2023-12-15 17:05:54 1.145
2023-12-16 11:23:37 37.545
2023-12-19 21:12:39 1.384
2023-12-21 20:58:51 1.34
2023-12-21 21:00:16 1.206
2023-12-21 21:05:00 1.287
2023-12-21 23:19:48 75.016
2023-12-22 03:08:43 58.98
2023-12-22 18:42:10 1.205
2023-12-22 19:01:21 1.021
2023-12-22 22:31:36 80.436
2023-12-23 01:23:20 116.709
2023-12-23 04:29:35 1.03
2023-12-23 04:43:15 1.126
2023-12-23 05:06:26 1.013
2023-12-23 06:53:23 128.693
2023-12-27 18:11:48 3.275
2023-12-27 18:13:00 1.795
2023-12-27 18:13:53 1.962
2023-12-27 18:14:47 1.245
2023-12-27 18:15:40 1.313
2023-12-27 18:16:34 1.082
2023-12-27 18:17:27 1.085
2023-12-27 18:31:39 1.149
2023-12-27 19:23:28 115.652
2023-12-28 00:29:47 25.147
2023-12-28 03:14:42 5.175
2023-12-28 05:22:38 8.18
2023-12-28 20:45:37 1.668
2023-12-28 21:04:10 10.697
total: 1909 frames
Comment by eric [ 30/Dec/23 ]

By the logging I added yesterday, the slow time is happening in an SQLAlchemy commit() call, after inserts.  We are looking at instrumenting the Postgres side with some of the tools used to figure out slow operations.

Comment by eric [ 02/Feb/24 ]

We made one more change in December (slight change to flush/commit sequences).  So far on Gen2 side we see no allocations >= 1sec since December 30.  Although the number of allocations since then is not high.  Hopefully we can get a better sense during the next engineering in March.

Generated at Sat Feb 10 16:39:35 JST 2024 using Jira 8.3.4#803005-sha1:1f96e09b3c60279a408a2ae47be3c745f571388b.