[INSTRM-66] archiver startup is slow Created: 06/Jan/17  Updated: 03/Jul/18  Resolved: 03/Jul/18

Status: Done
Project: Instrument control development
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Task Priority: Major
Reporter: cloomis Assignee: arnaud.lefur
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   

The archiver takes a long time to start, evidently because of checking/crawling all the existing rows. At LAM, the raw tables have ~170M rows:

2017-01-05 17:31:36+0000 Log opened.
2017-01-05 17:31:36+0000 importing postgres dbapi module psycopg2
2017-01-05 17:31:36+0000 database: table enu__slit contains 1 rows
2017-01-05 17:33:50+0000 database: table reply_raw contains 170736146 rows
2017-01-05 17:35:06+0000 database: table reply_hdr contains 170686655 rows
2017-01-05 17:35:06+0000 database: table hub__commanders contains 104527 rows
2017-01-05 17:35:06+0000 database: table hub__users contains 104526 rows
2017-01-05 17:35:06+0000 database: table actors contains 44 rows

It is not clear whether that scan is necessary. If not perhaps we can remove it. If it is, perhaps we need to use a union database or something.



 Comments   
Comment by arnaud.lefur [ 23/May/17 ]

"select count from %s" % tableName" is really slow with a billion row.
But it appears that's not really necessary to do it every single tables.

However it's required for reply_raw, and reply_hdr, in that case, i'm just taking the last row_id, and it seems ok.

https://github.com/Subaru-PFS/ics_archiver/commit/ea01a7421cba16da3054e7ac3d0d3316191ca266

I'll merge later, if it's fine.

Comment by shimono [ 23/May/17 ]

Or if it is int/bigint type, you could use MAX() instead of??

  1. In pgsql, count could be quite slow if index only scan is not enabled, due to MVCC... (I don't have clear idea that MAX() can solve this,,, sorry.)
Comment by cloomis [ 23/May/17 ]

I agree with using MAX():


archiver=# explain select count(id) from reply_raw;
                                 QUERY PLAN
----------------------------------------------------------------------------
 Aggregate  (cost=1695406.31..1695406.32 rows=1 width=8)
   ->  Seq Scan on reply_raw  (cost=0.00..1533464.05 rows=64776905 width=8)
(2 rows)

versus:

archiver=# explain select max(id) from reply_raw;
                                                       QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------
 Result  (cost=0.61..0.62 rows=1 width=0)
   InitPlan 1 (returns $0)
     ->  Limit  (cost=0.56..0.61 rows=1 width=8)
           ->  Index Only Scan Backward using reply_raw_pkey on reply_raw  (cost=0.56..2732366.40 rows=64776905 width=8)
                 Index Cond: (id IS NOT NULL)
(5 rows)

Note that we should perhaps be using postgresql's `BIGSERIAL` type for the ids. Probably does not matter for single-writer tables. But it does make the monotonically increasing integer constraint explicit.

Comment by shimono [ 24/May/17 ]

Due to MVCC used by pgsql, you need to use some trick or configuration to get rid from seq scan into index only scan for count(xxx). For just getting max number from auto_increment-ish column, the easiest way in pgsql is max()...

Comment by arnaud.lefur [ 24/May/17 ]

Yes, I should have used max(), what I did is equivalent but not very clean.

Comment by shimono [ 28/Jun/17 ]

arnaud.lefur close this?

Comment by cloomis [ 03/Jul/18 ]

This has been fixed for a year.

Generated at Sat Feb 10 16:21:09 JST 2024 using Jira 8.3.4#803005-sha1:1f96e09b3c60279a408a2ae47be3c745f571388b.