Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move away from orConditionGroup to UNION-ized queries #1068

Closed
wants to merge 8 commits into from

Conversation

adam-vessey
Copy link

GitHub Issue: (link)

  • Other Relevant Links (Google Groups discussion, related pull requests,
    Release pull requests, etc.)

What does this Pull Request do?

Issues reporting slowness in queries that was being patched by effectively avoiding the orConditionGroup, so, it seems we want to avoid them. At the same time, submitting the queries separately can be avoided by constructing the set of results by UNION-ing all the queries together so, let's make it so.

What's new?

A in-depth description of the changes made by this PR. Technical details and
possible side effects.

  • Changes x feature to such that y
  • Added x
  • Removed y
  • Does this change add any new dependencies?
  • Does this change require any other modifications to be made to the repository
    (i.e. Regeneration activity, etc.)?
  • Could this change impact execution of existing code?

How should this be tested?

A description of what steps someone could take to:

  • Reproduce the problem you are fixing (if applicable)
  • Test that the Pull Request does what is intended.
  • Please be as detailed as possible.
  • Good testing instructions help get your PR completed faster.

Documentation Status

  • Does this change existing behaviour that's currently documented?
  • Does this change require new pages or sections of documentation?
  • Who does this need to be documented for?
  • Associated documentation pull request(s): ___ or documentation issue ___

Additional Notes:

Any additional information that you think would be helpful when reviewing this
PR.

Interested parties

Tag (@ mention) interested parties or, if unsure, @Islandora/committers

Was looking at other callers of the original method; however, missed
that the caller as a DGI-specific module. No need to roll such a method
in the core code... not that it would really hurt per-se, just not
wanting to add method for it at the moment.
Also, handle other return-types.
@joecorall joecorall closed this Jan 22, 2025
@joecorall joecorall reopened this Jan 22, 2025
@joecorall
Copy link
Member

I tested this PR locally and 100 calls to getTermForUri is still taking over 10s

time drush scr scripts/1067.php 
real    0m 12.29s
user    0m 0.56s
sys     0m 0.35s

for getTermForUri this PR creates a query like

SELECT base_table.revision_id AS revision_id, base_table.tid AS tid
FROM
taxonomy_term_data base_table
INNER JOIN taxonomy_term_field_data taxonomy_term_field_data ON taxonomy_term_field_data.tid = base_table.tid
WHERE taxonomy_term_field_data.tid IN (SELECT f.entity_id AS entity_id
FROM
taxonomy_term__field_authority_link f
WHERE f.field_authority_link_uri = 'http://purl.org/coar/resource_type/c_18cc' UNION SELECT f.entity_id AS entity_id
FROM
taxonomy_term__field_external_uri f
WHERE f.field_external_uri_uri = 'http://purl.org/coar/resource_type/c_18cc')
GROUP BY base_table.revision_id, base_table.tid
LIMIT 1 OFFSET 0;

Which will have the same result of running N number of queries, where N is the number of external URI fields in the site config. Which was the original issue with #1067 (comment)

Just yeah, I find that I'm really not a fan of the idea of starting to make arbitrarily many queries as more fields are introduced

@adam-vessey
Copy link
Author

Which will have the same result of running N number of queries, where N is the number of external URI fields in the site config.

Unsure what you're trying to say. The same result is expected; however, it's the actual execution of the query with which I'm concerned. Like, sure there will be N parts under the UNION; however, the query as a whole is submitted to the server as a single query, with only a single pass of the query tagging. Building and tagging multiple queries, if there were actually other conditions being added for access control or whatever could start to overshadow.

That said, I am unable to see the same query performance you're seeing between any of:

  • our internal envs (though we use PostgreSQL)
  • my ddev branch of islandora-starter-site (https://github.com/adam-vessey/islandora-starter-site/tree/feature/ddev; should be able to use with a ddev delete -Oy ; ddev full-install... IIRC, there's presently issues generating derivatives, but it uses MariaDB, more in-line with what would be running elsewhere, possibly?)
    adam@compy:~/repos/islandora-starter-site$ git log
    commit 6449ba063b86ff19e4fe2496a90ed32796d08289 (HEAD -> feature/ddev, me/feature/ddev)
    Merge: a3f782f 9bf6d90
    Author: Adam Vessey <...>
    Date:   Wed Dec 11 14:38:26 2024 -0400
    [...]
    adam@compy:~/repos/islandora-starter-site$ ddev delete -Oy ; ddev full-install
    [...]
    adam@compy:~/repos/islandora-starter-site$ /usr/bin/time ddev drush php:eval "
    \$i = 0;
    while (++\$i < 100) {
      \$term = \Drupal::service('islandora.utils')->getTermForUri('http://purl.org/coar/resource_type/c_18cc');
      if (\$term === NULL) {
        print 'Failed to query term
    ';
        exit(1);
      }
    }
    print 'Finished
    ';
    "
    Finished
            0.86 real         0.08 user         0.12 sys
    adam@compy:~/repos/islandora-starter-site$
    
  • a fresh isle-dc env (@ origin/development, commit current at Islandora-Devops/isle-dc@57bbd3f ): (adding in print statements for paranoia, just to be sure results were being returned)
    adam@DockerBusiness:~/repos/isle-dc$ git log -n1
    commit 57bbd3f21da96385243bda364322ec26b59bb176 (HEAD -> development, origin/development, origin/HEAD)
    Author: Don Richards <[email protected]>
    Date:   Tue Dec 10 18:55:55 2024 -0500
    [...]
    adam@DockerBusiness:~/repos/isle-dc$ make starter
    [...]
    make[2]: Leaving directory '/home/adam/repos/isle-dc'
    make wait-for-drupal-locally
    make[2]: Entering directory '/home/adam/repos/isle-dc'
    make[2]: Leaving directory '/home/adam/repos/isle-dc'
    make[1]: Leaving directory '/home/adam/repos/isle-dc'
    adam@DockerBusiness:~/repos/isle-dc$ docker exec -it isle-dc-drupal-1 /bin/bash
    27eed0f0f257:/var/www/drupal# /usr/bin/time drush php:eval "
    \$i = 0;
    while (++\$i < 100) {
      \$term = \Drupal::service('islandora.utils')->getTermForUri('http://purl.org/coar/resource_type/c_18cc');
      if (\$term === NULL) {
        print 'Failed to query term
    ';
        exit(1);
      }
    }
    print 'Finished
    ';
    "
    Finished
    real	0m 0.56s
    user	0m 0.24s
    sys	0m 0.09s
    27eed0f0f257:/var/www/drupal#
    exit
    adam@DockerBusiness:~/repos/isle-dc$
    

The islandora-starter-site and isle-dc tests timings here were using the current (non-UNION) code.

@joecorall
Copy link
Member

joecorall commented Jan 23, 2025

however, it's the actual execution of the query with which I'm concerned. ... Building and tagging multiple queries, if there were actually other conditions being added for access control or whatever could start to overshadow.

Oh, I see. Yeah, that's reasonable. I thought you were concered about running N number of queries vs just one in a general sense. Yeah, makes sense since Drupal can tack on a lot of extras on a per-query basis.


I am unable to see the same query performance you're seeing between any of:

With this script

<?php

$i = 0;
while (++$i < 100) {
  $term = \Drupal::service('islandora.utils')->getTermForUri("http://purl.org/coar/resource_type/c_18cc");
}

on a fresh install I get

That's because there's pretty much nothing in taxonomy_term_data and taxonomy_term_field_data. I was testing this against a repo with 105K taxonomy terms.

On a fresh isle-site-template install I imported taxonomy_term_* from my prod database and ran the drush script again and got


So I think we still have a performance issue with this PR as-is.

@adam-vessey
Copy link
Author

adam-vessey commented Jan 23, 2025

So I think we still have a performance issue with this PR as-is.

I mean, sure, but between either here or the orConditionGroup, I feel like the query is as a whole more in-line with how Drupal expects to run queries. For this method in particular, where we expect a single term for a URI, I find myself thinking of the repository design pattern? Rather tempting to break the lookup out to another service (with caching set appropriately, even just caching statically)?

On the other hand, maybe the method in question should be avoided, since it doesn't really confine to a given taxonomy and/or field, which is entirely the root cause of the complex query? For example, if you're looking for a model term, you could query more directly against the singular field of the model taxonomy, instead of calling against this method that queries against all fields authority link/external_uri fields for all taxonomies. (this already being a part of the implementation, just highlighting that for the signature of the method, either the present orConditionGroup or UNION implementations seem more idiomatic than performing separate discrete queries)

Somewhat along the same line, the idea of using the same URI being possible, but varying access control on the terms meaning different users can get different terms (from different taxonomies!) when they perform a lookup seems like a loaded foot-gun.

Copy link
Member

@joecorall joecorall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mostly agree with your concerns, and the solution(s) probably need to go into a 3+ release (and we should carve out an issue for those concerns).

As for today, for those running on 2.x: Given the getTermForUri function is on a hot path I think we should go for faster is better and use #1067 and consider adding a similar fix to the media/file fix you added in this PR.


Somewhat along the same line, the idea of using the same URI being possible, but varying access control on the terms meaning different users can get different terms (from different taxonomies!) when they perform a lookup seems like a loaded foot-gun.

This would be a great thing for a status report warning @ /admin/reports/status and/or in documentation

@adam-vessey adam-vessey dismissed joecorall’s stale review January 24, 2025 13:24

Is non-sensical on a draft PR accompanying the recommendation to adopt another PR?

@adam-vessey adam-vessey deleted the fix/union-query branch January 24, 2025 13:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants