Implement issues de-duplication in raw SQL #2512

La0 · 2024-11-12T13:20:25Z

Even after the cleanup, the migration 0014 (issue deduplication) was still slow on testing DB.

First I noticed that we lacked an index on Issue.hash so I added that. Query runtime sped up significantly, but not enough (still requiring a week of processing).

Then I thought of creating temporary tables to write only the needed rows, but while doing that I realised I could simply partition the issues grouped by hashes, thus identifying duplicates, then update in-place the IssueLink references towards Issues, and finally delete the duplicate issues.

I'm currently running this on testing DB and on a copy of the production DB. I'll report back with times

La0 · 2024-11-13T08:04:18Z

This ran on a copy of production DB in 4h19min, allowing the unique constraint on Issue.hash to be applied

La0 · 2024-11-13T08:28:48Z

This also ran succesfully on Heroku testing DB

La0 requested a review from vrigal November 12, 2024 13:20

La0 self-assigned this Nov 12, 2024

Implement issues de-duplication in raw SQL

ac9c9f1

La0 force-pushed the faster-0014 branch from edcdb47 to ac9c9f1 Compare November 13, 2024 08:33

La0 requested review from Archaeopteryx and marco-c and removed request for vrigal November 13, 2024 08:35

Archaeopteryx approved these changes Nov 13, 2024

View reviewed changes

Archaeopteryx merged commit cb146f3 into mozilla:master Nov 13, 2024
12 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement issues de-duplication in raw SQL #2512

Implement issues de-duplication in raw SQL #2512

La0 commented Nov 12, 2024

La0 commented Nov 13, 2024

La0 commented Nov 13, 2024

Implement issues de-duplication in raw SQL #2512

Implement issues de-duplication in raw SQL #2512

Conversation

La0 commented Nov 12, 2024

La0 commented Nov 13, 2024

La0 commented Nov 13, 2024