-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Description
In pre-release versions of Rollup we used to concatenate all the rollup doc keys together to form the doc's ID, but at some point this was changed to a rolling CRC32. I don't recall the specifics about why we did this, probably just to save space and avoid giant IDs.
Unfortunately, this was poorly thought out. 32bit IDs lead to collisions very quickly as the doc count grows. Existing Rollup indices over 200k docs have a very high chance of at least one collision, meaning a leaf node of a rollup interval overwrote a previous leaf node. Ultimately, this leads to data loss and subtly incorrect results. It also limits the total number of docs in the rollup index.
Our current plan to fix is:
-
Move back to concatenating job ID and composite keys together to form a unique ID
- Insert a delimiter after the Job ID to guarantee that collisions can't be created (accidentally or otherwise) based on the job name.
- Guarantee that the date histo key is immediately after the job ID. Due to date format restrictions, this means we don't need to blacklist the delimiter in job names... just ensure it can't appear in a valid date pattern.
- If the ID is >= 32k bytes (the hard limit for Lucene) fall back to hashing with a sufficiently large hash.
-
Add a flag to the persistent task's state that indicates if the job is running on the old or new ID
- If the indexer is still running, continue to use the old ID scheme.
- Whenever the job checkpoints (on stop, failure or periodically), toggle the flag and switch to new scheme. Since the flag is also persisted, any future triggers of the job will continue to use the new ID
- All new jobs start with flag toggled and use the new ID scheme
-
Bump the internal Rollup version, so that we have some diagnosing power for reports of problems in the future.
-
Benchmark the size of concatenated IDs to make sure it doesn't bloat the index too much. Prefix-compression should be strong for these IDs, but good to double check. If it's really bad we can just hash the values directly instead and skip the part where we cutover at 32k
Changing the ID scheme in the middle of a job is acceptable as long as we have checkpointed our position. Deterministic IDs are only required so that, if we are interrupted before we get to the next checkpoint, we can roll back to the last checkpoint and just overwrite existing docs. So as long as we change the ID scheme on checkpoint, we know there are no "partial results" that may need overwriting.
We'll deal with 7.0 upgrade issues in a followup.