[INSTRM-1800] Characterize/fix reqframes response Created: 21/Nov/22 Updated: 02/Feb/24 |
|
| Status: | Open |
| Project: | Instrument control development |
| Component/s: | Gen2 |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Task | Priority: | Normal |
| Reporter: | cloomis | Assignee: | eric |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | EngRun | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Description |
|
The reqframes service sometimes takes quite a long time to respond. In the latest case about 100s: 2022-11-21 01:56:20,661 | I | PFS.py:654 (reqframes) | reqframes num=100 type='A' [ ... ] 2022-11-21 01:57:59,327 | I | PFS.py:658 (reqframes) | framelist: Out of the 44230 reqframes calls made in 2021-2, here are the times for all calls which took more than 1 second. We currently have a 10s timeout, so fourteen real failures for us. 2021-01-12 14:18:25.453000 2 2021-01-14 12:45:13.123000 2 2021-01-17 10:30:41.375000 2 2021-01-19 12:39:47.543000 2 2021-01-22 06:38:42.979000 3 2021-01-24 22:11:03.833000 5 2021-01-30 05:31:01.442000 3 2021-02-02 23:38:03.112000 4 2021-02-06 02:36:59.753000 4 2021-02-22 00:58:19.438000 5 2021-02-25 15:09:51.480000 2 2021-03-31 12:34:28.069000 3 2021-05-04 15:53:02.737000 2 2021-05-06 20:06:15.625000 2 2021-05-07 15:59:53.094000 2 2021-05-08 00:09:19.605000 2 2021-05-10 09:17:46.669000 2 2021-05-10 21:48:53.330000 3 2021-05-13 16:22:39.357000 5 2021-05-15 18:08:46.252000 4 2021-05-17 12:21:06.005000 5 2021-05-20 12:59:46.067000 7 2021-05-21 19:20:02.311000 7 2021-05-25 13:38:35.847000 9 2021-05-31 05:23:14.546000 11 2021-06-03 00:07:58.940000 15 2021-06-07 11:04:03.625000 15 2021-06-24 01:58:23.225000 26 2021-09-12 22:30:38.936000 2 2021-09-15 02:28:04.480000 5 2021-10-22 01:17:29.340000 2 2021-11-20 05:27:04.965000 2 2021-11-21 05:44:35.302000 3 2021-11-21 16:05:50.479000 14 2021-11-21 17:12:52.306000 3 2021-11-21 17:17:16.721000 2 2021-11-21 17:18:32.524000 2 2021-11-21 17:33:25.710000 2 2021-11-21 18:28:36.297000 4 2021-11-21 19:24:34.245000 3 2021-11-21 21:38:29.590000 90 2021-11-22 03:08:22.385000 3 2022-04-30 22:21:00.018000 3 2022-05-06 19:47:37.766000 17 2022-05-11 12:00:19.542000 5 2022-05-11 12:00:27.926000 5 2022-05-11 12:00:35.300000 5 2022-05-11 12:00:40.870000 3 2022-05-11 12:10:12.553000 2 2022-05-11 14:26:41.336000 2 2022-05-11 20:21:44.130000 3 2022-05-11 20:23:24.809000 2 2022-05-11 20:24:34.191000 2 2022-05-11 20:28:02.424000 2 2022-05-11 20:31:45.505000 2 2022-05-11 20:34:15.027000 3 2022-05-11 20:49:41.250000 3 2022-05-11 20:53:52.303000 2 2022-05-11 20:56:53.594000 127 2022-05-12 07:39:15.808000 26 2022-05-12 10:13:46.341000 2 2022-05-12 10:21:57.150000 7 2022-05-12 11:41:35.705000 15 2022-05-16 19:14:26.932000 10 2022-05-16 19:35:25.384000 2 2022-05-16 19:56:23.078000 2 2022-05-16 20:17:20.380000 2 2022-05-17 16:08:31.585000 2 2022-05-17 16:34:17.428000 2 2022-05-17 16:53:57.678000 2 2022-06-09 09:51:07.808000 11 2022-06-14 10:36:39.119000 8 2022-06-19 20:13:01.616000 2 2022-06-22 07:26:24.785000 3 2022-09-28 18:36:28.480000 3 2022-11-07 22:06:21.761000 3 2022-11-13 23:57:46.388000 7 2022-11-14 03:15:06.610000 2 2022-11-16 19:12:48.739000 38 2022-11-17 07:57:02.254000 7 2022-11-18 20:16:55.378000 2 2022-11-21 01:57:59.327000 98 |
| Comments |
| Comment by eric [ 24/Nov/22 ] |
|
Thanks, cloomis. We are already checking the logs on our side.
|
| Comment by eric [ 10/Feb/23 ] |
|
I've updated the frame service software on summit Gen2 to include the fix for occasional slow frame allocation. LMK from now if you experience slow allocations. Installation time: 2023-02-09 16:00 |
| Comment by cloomis [ 26/Apr/23 ] |
|
A couple more longish ones recently. For 2023, the five out of 2415 requests which took more than 5s: 2023-04-03 13:09:41 7.521 2023-04-13 16:04:29 8.456 2023-04-17 04:24:04 6.24 2023-04-25 04:28:42 72.544 2023-04-26 01:12:32 78.298 |
| Comment by cloomis [ 23/Jul/23 ] |
|
For the 2023-07 run, so far: 2023-07-19 21:15:55 21.154 2023-07-20 01:51:06 35.0 2023-07-20 20:33:58 145.434 2023-07-21 03:53:55 120.745 2023-07-21 19:35:49 83.595 2023-07-22 02:04:43 190.612 2023-07-22 19:25:29 232.99 2023-07-23 04:40:06 211.413 2023-07-23 22:43:30 5.054 2023-07-24 01:36:57 7.844 2023-07-24 02:25:08 15.56 2023-07-24 12:39:45 7.584 2023-07-24 20:41:26 7.886 2023-07-25 00:42:21 6.915 2023-07-25 01:26:03 16.094 2023-07-25 02:35:40 19.909 2023-07-25 04:21:36 22.694 2023-07-25 05:42:38 29.027 2023-07-25 12:12:12 5.095 2023-07-25 12:36:08 6.486 2023-07-25 23:03:53 14.431 Even though the Gen2 side is tracking this themselves, I'll keep this list up to date both to allow comparing with them, and for our records. |
| Comment by eric [ 29/Jul/23 ] |
|
It's too soon to close this issue, but it looks like the changes made during the 2023-07 run were fairly effective. The last two nights we had no timeouts. |
| Comment by yuki.moritani [ 17/Dec/23 ] |
|
I'm just reporting that "FetchVisitFromGen2(failed to get a visit number in 10s !))" occurred around 2023-12-16 11:24 (HST) |
| Comment by eric [ 19/Dec/23 ] |
|
@cloomis, can you run your script that gives the times > 1s? I'm mostly interested in post-october. We had to make a few changes (again) in the code for sqlalchemy >= 2.x and I'm interested to see how things are going. |
| Comment by eric [ 19/Dec/23 ] |
|
These are the ones (>= 1s) .we see on our side: 2023-10-12 22:06:03,423 | I | framesvc.py:166 (getFrames) | db allocation took 1.061 sec The one on 12-16 stands out. Most requests are averaging < 250 ms. |
| Comment by cloomis [ 19/Dec/23 ] |
|
That matches our list (minus your duplicate and plus a couple where the few ms to get to PFS brought the transaction > 1.0s) |
| Comment by eric [ 19/Dec/23 ] |
|
Good. Still concerned about such an outlier, but it looks like as a whole timings are still going in the right direction. |
| Comment by cloomis [ 19/Dec/23 ] |
|
Do you have any ideas about what caused the outlier? |
| Comment by eric [ 19/Dec/23 ] |
|
Not yet. Per chance, are your frame requests all coming from the same thread in the same g2cam process? |
| Comment by cloomis [ 29/Dec/23 ] |
|
I believe so, on the thread. 2023-12-14 01:51:24 1.062 2023-12-14 05:33:05 1.131 2023-12-14 06:40:49 1.135 2023-12-15 17:05:54 1.145 2023-12-16 11:23:37 37.545 2023-12-19 21:12:39 1.384 2023-12-21 20:58:51 1.34 2023-12-21 21:00:16 1.206 2023-12-21 21:05:00 1.287 2023-12-21 23:19:48 75.016 2023-12-22 03:08:43 58.98 2023-12-22 18:42:10 1.205 2023-12-22 19:01:21 1.021 2023-12-22 22:31:36 80.436 2023-12-23 01:23:20 116.709 2023-12-23 04:29:35 1.03 2023-12-23 04:43:15 1.126 2023-12-23 05:06:26 1.013 2023-12-23 06:53:23 128.693 2023-12-27 18:11:48 3.275 2023-12-27 18:13:00 1.795 2023-12-27 18:13:53 1.962 2023-12-27 18:14:47 1.245 2023-12-27 18:15:40 1.313 2023-12-27 18:16:34 1.082 2023-12-27 18:17:27 1.085 2023-12-27 18:31:39 1.149 2023-12-27 19:23:28 115.652 2023-12-28 00:29:47 25.147 2023-12-28 03:14:42 5.175 2023-12-28 05:22:38 8.18 2023-12-14 01:51:24 1.062 2023-12-14 05:33:05 1.131 2023-12-14 06:40:49 1.135 2023-12-15 17:05:54 1.145 2023-12-16 11:23:37 37.545 2023-12-19 21:12:39 1.384 2023-12-21 20:58:51 1.34 2023-12-21 21:00:16 1.206 2023-12-21 21:05:00 1.287 2023-12-21 23:19:48 75.016 2023-12-22 03:08:43 58.98 2023-12-22 18:42:10 1.205 2023-12-22 19:01:21 1.021 2023-12-22 22:31:36 80.436 2023-12-23 01:23:20 116.709 2023-12-23 04:29:35 1.03 2023-12-23 04:43:15 1.126 2023-12-23 05:06:26 1.013 2023-12-23 06:53:23 128.693 2023-12-27 18:11:48 3.275 2023-12-27 18:13:00 1.795 2023-12-27 18:13:53 1.962 2023-12-27 18:14:47 1.245 2023-12-27 18:15:40 1.313 2023-12-27 18:16:34 1.082 2023-12-27 18:17:27 1.085 2023-12-27 18:31:39 1.149 2023-12-27 19:23:28 115.652 2023-12-28 00:29:47 25.147 2023-12-28 03:14:42 5.175 2023-12-28 05:22:38 8.18 2023-12-28 20:45:37 1.668 2023-12-28 21:04:10 10.697 total: 1909 frames |
| Comment by eric [ 30/Dec/23 ] |
|
By the logging I added yesterday, the slow time is happening in an SQLAlchemy commit() call, after inserts. We are looking at instrumenting the Postgres side with some of the tools used to figure out slow operations. |
| Comment by eric [ 02/Feb/24 ] |
|
We made one more change in December (slight change to flush/commit sequences). So far on Gen2 side we see no allocations >= 1sec since December 30. Although the number of allocations since then is not high. Hopefully we can get a better sense during the next engineering in March. |