Skip to content

[Rollup] Replace ID generation scheme to avoid collisions #32372

@polyfractal

Description

@polyfractal

In pre-release versions of Rollup we used to concatenate all the rollup doc keys together to form the doc's ID, but at some point this was changed to a rolling CRC32. I don't recall the specifics about why we did this, probably just to save space and avoid giant IDs.

Unfortunately, this was poorly thought out. 32bit IDs lead to collisions very quickly as the doc count grows. Existing Rollup indices over 200k docs have a very high chance of at least one collision, meaning a leaf node of a rollup interval overwrote a previous leaf node. Ultimately, this leads to data loss and subtly incorrect results. It also limits the total number of docs in the rollup index.

Our current plan to fix is:

  1. Move back to concatenating job ID and composite keys together to form a unique ID

    1. Insert a delimiter after the Job ID to guarantee that collisions can't be created (accidentally or otherwise) based on the job name.
    2. Guarantee that the date histo key is immediately after the job ID. Due to date format restrictions, this means we don't need to blacklist the delimiter in job names... just ensure it can't appear in a valid date pattern.
    3. If the ID is >= 32k bytes (the hard limit for Lucene) fall back to hashing with a sufficiently large hash.
  2. Add a flag to the persistent task's state that indicates if the job is running on the old or new ID

    1. If the indexer is still running, continue to use the old ID scheme.
    2. Whenever the job checkpoints (on stop, failure or periodically), toggle the flag and switch to new scheme. Since the flag is also persisted, any future triggers of the job will continue to use the new ID
    3. All new jobs start with flag toggled and use the new ID scheme
  3. Bump the internal Rollup version, so that we have some diagnosing power for reports of problems in the future.

  4. Benchmark the size of concatenated IDs to make sure it doesn't bloat the index too much. Prefix-compression should be strong for these IDs, but good to double check. If it's really bad we can just hash the values directly instead and skip the part where we cutover at 32k

Changing the ID scheme in the middle of a job is acceptable as long as we have checkpointed our position. Deterministic IDs are only required so that, if we are interrupted before we get to the next checkpoint, we can roll back to the last checkpoint and just overwrite existing docs. So as long as we change the ID scheme on checkpoint, we know there are no "partial results" that may need overwriting.

We'll deal with 7.0 upgrade issues in a followup.

/cc @jimczi @colings86 @pcsanwald @clintongormley

Metadata

Metadata

Assignees

No one assigned

    Labels

    :StorageEngine/RollupTurn fine-grained time-based data into coarser-grained data>bug

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions