Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix issue where JSON diff would fail under specific circumstances. #8534

Merged
merged 4 commits into from
Nov 6, 2024

Conversation

nicktobey
Copy link
Contributor

If all of the following are true during a JSON diff operation:

  1. One document fits in a single chunk and the other doesn't
  2. The location of the chunk boundary in the larger document is also present in the smaller document. (It doesn't correspond to a location that was added or removed)
  3. The chunk boundary falls at the end of a value in the document (before the next key or comma/right brace/etc)

Then the differ would fail to advance the prolly tree cursor and would incorrectly see the larger document as corrupt.

This fixes that issue and improves the error messaging to make it seem less like the database is corrupt, since the error is much more likely to be a parsing bug like this one.

Copy link
Contributor

@fulghum fulghum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@nicktobey
Copy link
Contributor Author

To explain why this happened and why this PR fixes the issue:

The JSON differ sits on top of a prolly tree differ, fetching blocks from the two prolly trees and scanning them while keeping track of the current location in both documents. The JSON scanner takes a reference to the underlying prolly tree cursor and advances it if it hits the end of a block: this is a code smell because it breaks encapsulation for the prolly tree differ, but it was the best way to keep both data structures in sync.

In the event that one or both documents fit entirely within a single block, we take a slightly different code path: we don't need to use the prolly tree differ anymore, so we create the JSON scanner without taking a reference to the prolly tree cursor.

The problem arose when one of the documents fit in a single block while the other one didn't. In that case, we still didn't need to use the prolly tree differ, so we would instantiate both scanners without taking a reference to the differ's cursors. However, if the JSON scanner hit a chunk boundary in the larger document while it wasn't in the middle of computing a diff (both documents were the same at the location where the chunk boundary occurred), we would need to fetch the next block of the larger document, and we would use the cursor originally created for the prolly tree differ to do so. Since the JSON scanner never took a reference to this cursor, it never advanced it, and the cursor still pointed to the beginning of the document. The scanner would subsequently fetch the first block in the document while it was expecting the next block and try to resume scanning from there: the scanner would quickly detect that it was reading unexpected values and would incorrectly assume that the underlying storage was corrupt.

The workaround in the PR is to make sure that if only one of the documents being diffed fits in a single chunk, we still create a JSON scanner for the other document that has access to the original underlying cursor and can advance it. I'm not 100% satisfied: a cleaner solution might be to avoid referencing that cursor at all if we're not using the differ, but that would make the JSON differ's already complicated state machine even more complicated.

Fixing this has me a bit worried that there might be other bugs in here. In particular I'm not convinced that we always do the correct thing when only conditions (1) and (2) above hold, but I'd need to look at it more and try to come up with an exapmle case.

@coffeegoddd
Copy link
Contributor

@nicktobey DOLT

comparing_percentages
100.000000 to 100.000000
version result total
ad631c6 ok 5937457
version total_tests
ad631c6 5937457
correctness_percentage
100.0

@nicktobey nicktobey merged commit b241b58 into main Nov 6, 2024
34 of 35 checks passed
@nicktobey nicktobey deleted the nicktobey/json_fix branch November 6, 2024 02:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants