Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement issues de-duplication in raw SQL #2512

Merged
merged 1 commit into from
Nov 13, 2024

Conversation

La0
Copy link
Collaborator

@La0 La0 commented Nov 12, 2024

Even after the cleanup, the migration 0014 (issue deduplication) was still slow on testing DB.

First I noticed that we lacked an index on Issue.hash so I added that. Query runtime sped up significantly, but not enough (still requiring a week of processing).

Then I thought of creating temporary tables to write only the needed rows, but while doing that I realised I could simply partition the issues grouped by hashes, thus identifying duplicates, then update in-place the IssueLink references towards Issues, and finally delete the duplicate issues.

I'm currently running this on testing DB and on a copy of the production DB. I'll report back with times

@La0 La0 requested a review from vrigal November 12, 2024 13:20
@La0 La0 self-assigned this Nov 12, 2024
@La0
Copy link
Collaborator Author

La0 commented Nov 13, 2024

This ran on a copy of production DB in 4h19min, allowing the unique constraint on Issue.hash to be applied

@La0
Copy link
Collaborator Author

La0 commented Nov 13, 2024

This also ran succesfully on Heroku testing DB

@La0 La0 requested review from Archaeopteryx and marco-c and removed request for vrigal November 13, 2024 08:35
@Archaeopteryx Archaeopteryx merged commit cb146f3 into mozilla:master Nov 13, 2024
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants