Skip to content

[dataquery] Optimize filtering of valid candidates for large datasets #9344

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Sep 24, 2024

Conversation

driusan
Copy link
Collaborator

@driusan driusan commented Sep 19, 2024

This optimizes the filtering of inaccessible candidates, building on PR#9334

Previously, the entire query was loaded into memory, and then manipulated in a way that made it easier to see which candidates are valid, then returned as an array by putting the valid ones in a temporary table and selecting from it.

With this change, instead of an array, a generator is used so that only one candidate (and its sessions) need to be loaded into memory at a time. This is done on an unbuffered connection, which necessitates a different database connection since only one query can be run at a time on a connection and the caller may make other queries before the generator finished.

This result should be significantly less memory required for filtering candidates.

@driusan driusan force-pushed the MoreEfficientCandidateFilter branch 2 times, most recently from 3450f73 to e9e2bcf Compare September 19, 2024 15:54
Copy link
Contributor

@regisoc regisoc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, should the new db connection needs to be in lorisintance since it is only used in dataquery?

*
* @return \Database
*/
public function getNewDatabaseConnection() : \Database
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this function useful in all places that use lorisinstance object?
Shouldn't it be placed only in the dataquery module query.class.inc because that is the only one using it?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's useful any place that calls setBuffering(false) and then iterates over a query result because the connection can't be re-used until the iteration is finished.

It's the only one now, but there will be others in other PRs.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(ie. #9347 is another example that will use it.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok yes, thanks. I will make a quick test today ideally.

Dave MacFarlane added 3 commits September 23, 2024 08:59
This optimizes the filtering of inaccessible candidates, building on
PR#9334

Previously, the entire query was loaded into memory, and then manipulated
in a way that made it easier to see which candidates are valid, then returned
as an array by putting the valid ones in a temporary table and selecting it.

With this change, instead of an array, a generator is used so that only one
candidate (and its sessions) need to be loaded into memory at a time. This is
done on an unbuffered connection, which necessitates a different database
connection since only one query can be run at a time on a connection and the
caller may make other queries before the generator finished.
@regisoc
Copy link
Contributor

regisoc commented Sep 24, 2024

@driusan tried several queries but I do not have something big enough to make a thorough testing. I think it is good noneless.

@driusan
Copy link
Collaborator Author

driusan commented Sep 24, 2024

I think the important thing is that you tested the queries. I've tested the performance on large datasets (albiet in combination with other optimizations..)

@driusan driusan merged commit 246e8fc into aces:main Sep 24, 2024
10 checks passed
ZhichGaming pushed a commit to ZhichGaming/Loris that referenced this pull request Nov 25, 2024
…aces#9344)

This optimizes the filtering of inaccessible candidates, building on
PR#9334

Previously, the entire query was loaded into memory, and then
manipulated in a way that made it easier to see which candidates are
valid, then returned as an array by putting the valid ones in a
temporary table and selecting from it.

With this change, instead of an array, a generator is used so that only
one candidate (and its sessions) need to be loaded into memory at a
time. This is done on an unbuffered connection, which necessitates a
different database connection since only one query can be run at a time
on a connection and the caller may make other queries before the
generator finished.

This result should be significantly less memory required for filtering
candidates.
driusan added a commit that referenced this pull request Feb 3, 2025
This updates DBRowProvisioner to remove the strval (and fixes #9335).

After doing that, it became clear that the inner class could be replaced
by a generator while still fulfilling the contract of returning a
`\Traversable`. Doing this made it more obvious that since #9334,
pselect could be used directly while maintaining the lazy evaluation
instead of bypassing the Database object to use the PDO directly.
However this required injecting a `LorisInstance` object to the
constructor to get the database connection.

With access to the `LorisInstance` object, the `DBRowProvisioner` can
also get a new database connection and disable query buffering to lazily
retrieve rows since PR #9344. This may theoretically improve performance
and memory usage on large provisioners but has not been tested.
(Performance did not decrease in raisinbread but it is not a large
dataset where it would be noticeable.)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants