Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

slow execution of complex query #4281

Closed
knutwannheden opened this issue Sep 7, 2022 · 16 comments
Closed

slow execution of complex query #4281

knutwannheden opened this issue Sep 7, 2022 · 16 comments

Comments

@knutwannheden
Copy link

The following somewhat complex query is very slow on Dolt, when compared to MySQL. In MySQL the query runs in about 110ms (on my machine) and in Dolt the query takes around 35s to complete (so some factor 300+):

with kommission_interessengruppe as (select k.id kommission_id, i.id interessengruppe_id
                                     from kommission k
                                              join branche b on k.id = b.kommission_id
                                              join interessengruppe i on b.id = i.branche_id
                                     where k.abkuerzung in ('SGK-NR', 'SGK-SR'))
select count(*)
from parlamentarier pa
         join partei p on pa.partei_id = p.id
         join interessenbindung ib on ib.parlamentarier_id = pa.id
         left join (select *
                    from interessenbindung_jahr
                    where id in (select max(id) from interessenbindung_jahr group by interessenbindung_id)) ibj
                   on ibj.interessenbindung_id = ib.id
         join organisation o on ib.organisation_id = o.id
         join in_kommission ik on pa.id = ik.parlamentarier_id
         join kommission_interessengruppe ki on ki.kommission_id = ik.kommission_id and ki.interessengruppe_id in
                                                                                        (o.interessengruppe_id,
                                                                                         o.interessengruppe2_id,
                                                                                         o.interessengruppe3_id)
where now() between pa.im_rat_seit and coalesce(pa.im_rat_bis, now())
  and now() between coalesce(ib.von, ib.created_date) and coalesce(ib.bis, now())
  and now() between ib.von and coalesce(ik.bis, now())
  and ib.art != 'mitglied'
  and o.rechtsform not in ('Parlamentarische Gruppe', 'Informelle Gruppe')
;

I have attached the SQL script required to create the schema and data here: lobbywatch.zip

Here is the execution plan in MySQL 8:

SELECT (select)          
UNKNOWN (Aggregate)   1 53.65   count(0)
UNKNOWN (Nested loop left join)   39 49.8    
NESTED_LOOPS (Nested loop inner join)   12 42.9    
NESTED_LOOPS (Nested loop inner join)   4 37.8    
NESTED_LOOPS (Nested loop inner join)   5 36.02    
NESTED_LOOPS (Nested loop inner join)   2 17.25    
NESTED_LOOPS (Nested loop inner join)   2 16.63    
NESTED_LOOPS (Nested loop inner join)   16 8.41    
NESTED_LOOPS (Nested loop inner join)   1 6.55    
FILTER (filter)   14 1.65   (b.kommission_id is not null)
FULL_INDEX_SCAN (Index scan) table: b; index: idx_kommission_freigabe; 14 1.65    
FILTER (filter)   0.08 0.25   (k.abkuerzung in ('SGK-NR','SGK-SR'))
UNIQUE_INDEX_SCAN (Single-row index lookup) table: k; index: PRIMARY; 1 0.25   (id=b.kommission_id)
INDEX_SCAN (Covering index lookup) table: ik; index: idx_kommission; 15 1.62   (kommission_id=b.kommission_id)
FILTER (filter)   0.1 0.42   (((now()) between pa.im_rat_seit and coalesce(pa.im_rat_bis,(now()))) and (pa.partei_id is not null))
UNIQUE_INDEX_SCAN (Single-row index lookup) table: pa; index: PRIMARY; 1 0.42   (id=ik.parlamentarier_id)
UNIQUE_INDEX_SCAN (Single-row covering index lookup) table: p; index: PRIMARY; 1 0.31   (id=pa.partei_id)
FILTER (filter)   3 7.79   (((now()) between coalesce(ib.von,ib.created_date) and coalesce(ib.bis,(now()))) and ((now()) between ib.von and coalesce(ik.bis,(now()))) and (ib.art <> 'mitglied'))
UNIQUE_INDEX_SCAN (Index lookup) table: ib; index: parlamentarier_id; 31 7.79   (parlamentarier_id=ik.parlamentarier_id)
FILTER (filter)   1 0.27   (o.rechtsform not in ('Parlamentarische Gruppe','Informelle Gruppe'))
UNIQUE_INDEX_SCAN (Single-row index lookup) table: o; index: PRIMARY; 1 0.27   (id=ib.organisation_id)
FILTER (filter)   3 0.33   (i.id in (o.interessengruppe_id,o.interessengruppe2_id,o.interessengruppe3_id))
INDEX_SCAN (Covering index lookup) table: i; index: idx_branche_freigabe; 10 0.33   (branche_id=b.id)
FILTER (filter)   3 0.28   (interessenbindung_jahr.id,interessenbindung_jahr.id in (select #3))
INDEX_SCAN (Covering index lookup) table: interessenbindung_jahr; index: idx_jahr_unique; 3 0.28   (interessenbindung_id=ib.id)
SUBQUERY (Select)         #3 (subquery in condition; run only once)
FILTER (filter)   10550 3189.25 3189.25 ((interessenbindung_jahr.id = ``.max(id)))
UNKNOWN (Limit)         1 row(s)
UNIQUE_INDEX_SCAN (Index lookup) table: ; index: ;       (max(id)=interessenbindung_jahr.id)
UNKNOWN (Materialize with deduplication)   10550 3189.25 3189.25  
AGGREGATE (Group aggregate)   10550 2134.25   max(interessenbindung_jahr.id)
FULL_INDEX_SCAN (Index scan) table: interessenbindung_jahr; index: idx_jahr_unique; 10550 1079.25    

When I try to run EXPLAIN for the query in Dolt I get an error: 2022-09-07T15:59:47+02:00 WARN [conn 3] error running query {connectTime=2022-09-07T15:59:11+02:00, connectionDb=lobbywatch_public, error=string ' │ │ │ │ │ │ └─ columns: [id nachname vorname vorname_kurz zweiter_vorname buergerorte rat_id kanton_id kommissionen partei_id parteifunktion fraktion_id fraktionsfunktion im_rat_seit im_rat_bis ratswechsel ratsunterbruch_von ratsunterbruch_bis beruf beruf_fr beruf_interessengruppe_id titel aemter weitere_aemter zivilstand anzahl_kinder militaerischer_grad_id geschlecht geburtstag photo_dateiname photo_dateierweiterung photo_dateiname_voll photo_mime_type kleinbild sitzplatz email_2 homepage homepage_2 parlament_biografie_id parlament_number parlament_beruf_json parlament_interessenbindungen parlament_interessenbindungen_json parlament_interessenbindungen_updated twitter_name instagram_profil youtube_user linkedin_profil_url xing_profil_name facebook_name wikipedia wikidata_qid sprache arbeitssprache adresse_firma adresse_plz adresse_ort erfasst autorisierung_reminder_verschickt_visa autorisierung_reminder_verschickt_datum autorisiert_datum freigabe_datum created_date updated_date]' is too large for column 'varchar(1000)', query=/* ApplicationName=IntelliJ IDEA 2022.2.2 Preview */ explain with kommission_interessengruppe as (select k.id kommission_id, i.id interessengruppe_id from kommission k join branche b on k.id = b.kommission_id join interessengruppe i on b.id = i.branche_id where k.abkuerzung in ('SGK-NR', 'SGK-SR')) select count(*) from parlamentarier pa join partei p on pa.partei_id = p.id join interessenbindung ib on ib.parlamentarier_id = pa.id left join (select * from interessenbindung_jahr where id in (select max(id) from interessenbindung_jahr group by interessenbindung_id)) ibj on ibj.interessenbindung_id = ib.id join organisation o on ib.organisation_id = o.id join in_kommission ik on pa.id = ik.parlamentarier_id join kommission_interessengruppe ki on ki.kommission_id = ik.kommission_id and ki.interessengruppe_id in (o.interessengruppe_id, o.interessengruppe2_id, o.interessengruppe3_id) where now() between pa.im_rat_seit and coalesce(pa.im_rat_bis, now()) and now() between coalesce(ib.von, ib.created_date) and coalesce(ib.bis, now()) and now() between ib.von and coalesce(ik.bis, now()) and ib.art != 'mitglied' and o.rechtsform not in ('Parlamentarische Gruppe', 'Informelle Gruppe')}

I am sure the query can be boiled down a bit to a much simpler query, which also has performance problems. Please let me know if I can help with anything.

@timsehn
Copy link
Contributor

timsehn commented Sep 7, 2022

@max-hoffman will dig into this one when he gets a chance.

@max-hoffman
Copy link
Contributor

max-hoffman commented Sep 7, 2022

Some additional notes:

  1. We panic trying to import the dump, with a Found dangling references to HashSet error logged. The script imports successfully with SET AUTOCOMMIT = 1;

  2. The explain error is due to a limitation in the EXPLAIN schema, which was varchar(1000). I will fix that.

  3. If I run the EXPLAIN with count(p.id) instead of count(*), we no longer try to read hundreds of columns, which bypasses the schema limitation, runs in about ~5s (from 25s for original query on my machine), and gives plan below. We are not using primary keys for lookups, which if fixed would probably make this query sub-second.

| GroupBy                                                                                                                                                                                                                                            |
|  ├─ SelectedExprs(COUNT(p.id))                                                                                                                                                                                                                     |
|  ├─ Grouping()                                                                                                                                                                                                                                     |
|  └─ Filter(2022-09-07 09:17:30.48366 -0700 PDT m=+57.525852929 BETWEEN ib.von AND coalesce(ik.bis, 2022-09-07 09:17:30.48366 -0700 PDT m=+57.525852929))                                                                                           |
|      └─ InnerJoin((ki.kommission_id = ik.kommission_id) AND (ki.interessengruppe_id IN (o.interessengruppe_id, o.interessengruppe2_id, o.interessengruppe3_id)))                                                                                   |
|          ├─ InnerJoin(pa.id = ik.parlamentarier_id)                                                                                                                                                                                                |
|          │   ├─ InnerJoin(ib.organisation_id = o.id)                                                                                                                                                                                               |
|          │   │   ├─ LeftJoin(obj.interessenbindung_id = ib.id)                                                                                                                                                                                     |
|          │   │   │   ├─ InnerJoin(ib.parlamentarier_id = pa.id)                                                                                                                                                                                    |
|          │   │   │   │   ├─ InnerJoin(pa.partei_id = p.id)                                                                                                                                                                                         |
|          │   │   │   │   │   ├─ Exchange                                                                                                                                                                                                           |
|          │   │   │   │   │   │   └─ Filter(2022-09-07 09:17:30.48366 -0700 PDT m=+57.525852929 BETWEEN pa.im_rat_seit AND coalesce(pa.im_rat_bis, 2022-09-07 09:17:30.48366 -0700 PDT m=+57.525852929))                                            |
|          │   │   │   │   │   │       └─ TableAlias(pa)                                                                                                                                                                                             |
|          │   │   │   │   │   │           └─ Table(parlamentarier)                                                                                                                                                                                  |
|          │   │   │   │   │   │               └─ columns: [id partei_id im_rat_seit im_rat_bis]                                                                                                                                                     |
|          │   │   │   │   │   └─ Exchange                                                                                                                                                                                                           |
|          │   │   │   │   │       └─ TableAlias(p)                                                                                                                                                                                                  |
|          │   │   │   │   │           └─ Table(partei)                                                                                                                                                                                              |
|          │   │   │   │   │               └─ columns: [id]                                                                                                                                                                                          |
|          │   │   │   │   └─ Exchange                                                                                                                                                                                                               |
|          │   │   │   │       └─ Filter((2022-09-07 09:17:30.48366 -0700 PDT m=+57.525852929 BETWEEN coalesce(ib.von, ib.created_date) AND coalesce(ib.bis, 2022-09-07 09:17:30.48366 -0700 PDT m=+57.525852929)) AND (NOT((ib.art = 'mitglied')))) |
|          │   │   │   │           └─ TableAlias(ib)                                                                                                                                                                                                 |
|          │   │   │   │               └─ Table(interessenbindung)                                                                                                                                                                                   |
|          │   │   │   │                   └─ columns: [id parlamentarier_id organisation_id art von bis created_date]                                                                                                                               |
|          │   │   │   └─ HashLookup(child: (obj.interessenbindung_id), lookup: (ib.id))                                                                                                                                                             |
|          │   │   │       └─ CachedResults                                                                                                                                                                                                          |
|          │   │   │           └─ SubqueryAlias(obj)                                                                                                                                                                                                 |
|          │   │   │               └─ Project(interessenbindung_jahr.id, interessenbindung_jahr.interessenbindung_id)                                                                                                                                |
|          │   │   │                   └─ IndexedInSubqueryFilter(interessenbindung_jahr.id IN ((GroupBy                                                                                                                                             |
|          │   │   │                       ├─ SelectedExprs(MAX(interessenbindung_jahr.id))                                                                                                                                                          |
|          │   │   │                       ├─ Grouping(interessenbindung_jahr.interessenbindung_id)                                                                                                                                                  |
|          │   │   │                       └─ Table(interessenbindung_jahr)                                                                                                                                                                          |
|          │   │   │                           └─ columns: [id interessenbindung_id]                                                                                                                                                                 |
|          │   │   │                      )))                                                                                                                                                                                                        |
|          │   │   │                       └─ IndexedTableAccess(interessenbindung_jahr)                                                                                                                                                             |
|          │   │   │                           └─ index: [interessenbindung_jahr.id]                                                                                                                                                                 |
|          │   │   └─ Exchange                                                                                                                                                                                                                       |
|          │   │       └─ Filter(NOT((o.rechtsform HASH IN ('Parlamentarische Gruppe', 'Informelle Gruppe'))))                                                                                                                                       |
|          │   │           └─ TableAlias(o)                                                                                                                                                                                                          |
|          │   │               └─ Table(organisation)                                                                                                                                                                                                |
|          │   │                   └─ columns: [id rechtsform interessengruppe_id interessengruppe2_id interessengruppe3_id]                                                                                                                         |
|          │   └─ Exchange                                                                                                                                                                                                                           |
|          │       └─ TableAlias(ik)                                                                                                                                                                                                                 |
|          │           └─ Table(in_kommission)                                                                                                                                                                                                       |
|          │               └─ columns: [parlamentarier_id kommission_id bis]                                                                                                                                                                         |
|          └─ CachedResults                                                                                                                                                                                                                          |
|              └─ SubqueryAlias(ki)                                                                                                                                                                                                                  |
|                  └─ Project(k.id as kommission_id, i.id as interessengruppe_id)                                                                                                                                                                    |
|                      └─ IndexedJoin(b.id = i.branche_id)                                                                                                                                                                                           |
|                          ├─ IndexedJoin(k.id = b.kommission_id)                                                                                                                                                                                    |
|                          │   ├─ Exchange                                                                                                                                                                                                           |
|                          │   │   └─ TableAlias(b)                                                                                                                                                                                                  |
|                          │   │       └─ Table(branche)                                                                                                                                                                                             |
|                          │   │           └─ columns: [id kommission_id]

We can force index lookups in other ways if this is a sensitive query that needs to be optimized in a hurry.

@knutwannheden
Copy link
Author

Thanks for the feedback. I had noticed the import error and was going to report that separately. No need now, I suppose :-)

Thanks for the count() trick. I suppose that will be fixed? The first non-null value should be enough per tuple.

The query is not sensitive, but I imagine I will have many other queries which would exhibit the same performance problems. Because actually my regular queries wouldn't do a count(). I only did the for the reproducer, as depending on the client being used, only the first page would get loaded.

@max-hoffman
Copy link
Contributor

Small status update, I've identified the issue and am making progress in this PR dolthub/go-mysql-server#1247.

If I move ki.interessengruppe_id in (o.interessengruppe_id, o.interessengruppe2_id, o.interessengruppe3_id) into the where filter (sorting out conjunctions issues now) locally I'm logging 680ms with count(*), and 480ms with count(p.id).

@knutwannheden
Copy link
Author

Sounds excellent! I tried doing what you describe (inline CTE in FROM clause) but with my version of Dolt I didn't see much of a difference.

Regarding the other two problems we briefly discussed here, do you need issues for them?

@max-hoffman
Copy link
Contributor

Yes the main issue is the join order, which will only be in that feature branch for now. Moving the condition to a WHERE will be unnecessary after I am finished, but lets me give you an idea right now about speed expectations. I will roll the EXPECT error into this PR, but creating a separate issue for the import panic might be useful for whoever picks that one up.

We appreciate the bug reports! Keep them coming.

@max-hoffman
Copy link
Contributor

You should usually file a bug before resorting to this, but you also manually set the join order with SELECT /*+ JOIN_ORDER(pa, p, ib, obj, o, ik, ki) */ count(*) ... if you feel so inclined. This should give you the fast query on main right now.

@knutwannheden
Copy link
Author

I see, Oracle-style hints. Are these documented somewhere?

@timsehn
Copy link
Contributor

timsehn commented Sep 9, 2022

No. Maybe we should document them?

@timsehn
Copy link
Contributor

timsehn commented Sep 12, 2022

Resolving. Made docs ticket for the join hints.

@timsehn timsehn closed this as completed Sep 12, 2022
@timsehn
Copy link
Contributor

timsehn commented Sep 12, 2022

Oops. Reopening until PR is merged.

@timsehn timsehn reopened this Sep 12, 2022
@max-hoffman
Copy link
Contributor

I need to rewrite join ordering generation to get the final edge cases. I hope to finish by Friday.

@max-hoffman
Copy link
Contributor

@knutwannheden I ended up rewriting our join ordering algorithm to fix this. I am currently working on adding our missing join operators, SemiJoin, AntiJoin, FullOuterJoin to make it easier to test the rewrite, and then I am redoing our index selection, which should be easier now that we only generate correctly indexable plans. The combination of these improvements will make us lighter on our feet reacting to join queries in the future. Thank you for your patience! Keep the bugs coming.

@max-hoffman
Copy link
Contributor

The new runtime with the joins PR (dolthub/go-mysql-server#1290.) and pending up subquery bug fix (dolthub/go-mysql-server#1350) should be more satisfactory, though there is more room for improving the join plan when we get to costing filters:

time dolt sql -q "with kommission_interessengruppe as (select k.id kommission_id, i.id interessengruppe_id                                      from kommission k                                               join branche b on k.id = b.kommission_id                                               join interessengruppe i on b.id = i.branche_id                                      where k.abkuerzung in ('SGK-NR', 'SGK-SR')) select count(*) from parlamentarier pa          join partei p on pa.partei_id = p.id          join interessenbindung ib on ib.parlamentarier_id = pa.id          left join (select *                     from interessenbindung_jahr                     where id in (select max(id) from interessenbindung_jahr group by interessenbindung_id)) ibj                    on ibj.interessenbindung_id = ib.id          join organisation o on ib.organisation_id = o.id          join in_kommission ik on pa.id = ik.parlamentarier_id          join kommission_interessengruppe ki on ki.kommission_id = ik.kommission_id and ki.interessengruppe_id in                                                                                         (o.interessengruppe_id,                                                                                          o.interessengruppe2_id,                                                                                          o.interessengruppe3_id) where now() between pa.im_rat_seit and coalesce(pa.im_rat_bis, now())   and now() between coalesce(ib.von, ib.created_date) and coalesce(ib.bis, now())   and now() between ib.von and coalesce(ik.bis, now())   and ib.art != 'mitglied'   and o.rechtsform not in ('Parlamentarische Gruppe', 'Informelle Gruppe') ;"
+----------+
| count(*) |
+----------+
| 62       |
+----------+


________________________________________________________
Executed in  709.37 millis    fish           external
   usr time  733.86 millis  205.00 micros  733.65 millis
   sys time   75.79 millis  667.00 micros   75.12 millis

@knutwannheden
Copy link
Author

Sounds like excellent progress! Thanks for the update.

@max-hoffman
Copy link
Contributor

Release 0.50.9 is out, thank you for the quality of your repros, and let me know when you find more issues. Keep in mind 1) we do not consider filters and index distributions in the process of costing join orders, and 2) we do not generate a transitive closure of join conditions yet, and those projects are probably post-2022 targets.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants