Fix issue where JSON diff would fail under specific circumstances. #8534

nicktobey · 2024-11-06T00:52:58Z

If all of the following are true during a JSON diff operation:

One document fits in a single chunk and the other doesn't
The location of the chunk boundary in the larger document is also present in the smaller document. (It doesn't correspond to a location that was added or removed)
The chunk boundary falls at the end of a value in the document (before the next key or comma/right brace/etc)

Then the differ would fail to advance the prolly tree cursor and would incorrectly see the larger document as corrupt.

This fixes that issue and improves the error messaging to make it seem less like the database is corrupt, since the error is much more likely to be a parsing bug like this one.

fulghum

LGTM!

nicktobey · 2024-11-06T01:14:16Z

To explain why this happened and why this PR fixes the issue:

The JSON differ sits on top of a prolly tree differ, fetching blocks from the two prolly trees and scanning them while keeping track of the current location in both documents. The JSON scanner takes a reference to the underlying prolly tree cursor and advances it if it hits the end of a block: this is a code smell because it breaks encapsulation for the prolly tree differ, but it was the best way to keep both data structures in sync.

In the event that one or both documents fit entirely within a single block, we take a slightly different code path: we don't need to use the prolly tree differ anymore, so we create the JSON scanner without taking a reference to the prolly tree cursor.

The problem arose when one of the documents fit in a single block while the other one didn't. In that case, we still didn't need to use the prolly tree differ, so we would instantiate both scanners without taking a reference to the differ's cursors. However, if the JSON scanner hit a chunk boundary in the larger document while it wasn't in the middle of computing a diff (both documents were the same at the location where the chunk boundary occurred), we would need to fetch the next block of the larger document, and we would use the cursor originally created for the prolly tree differ to do so. Since the JSON scanner never took a reference to this cursor, it never advanced it, and the cursor still pointed to the beginning of the document. The scanner would subsequently fetch the first block in the document while it was expecting the next block and try to resume scanning from there: the scanner would quickly detect that it was reading unexpected values and would incorrectly assume that the underlying storage was corrupt.

The workaround in the PR is to make sure that if only one of the documents being diffed fits in a single chunk, we still create a JSON scanner for the other document that has access to the original underlying cursor and can advance it. I'm not 100% satisfied: a cleaner solution might be to avoid referencing that cursor at all if we're not using the differ, but that would make the JSON differ's already complicated state machine even more complicated.

Fixing this has me a bit worried that there might be other bugs in here. In particular I'm not convinced that we always do the correct thing when only conditions (1) and (2) above hold, but I'd need to look at it more and try to come up with an exapmle case.

coffeegoddd · 2024-11-06T01:22:59Z

@nicktobey DOLT

comparing_percentages
100.000000 to 100.000000

version	result	total
`ad631c6`	ok	5937457

version	total_tests
`ad631c6`	5937457

correctness_percentage
100.0

nicktobey added 4 commits November 5, 2024 16:30

Add regression test for merging large doc with small doc.

143a1c6

Fix merging large doc with small doc.

cf28c79

Make json diff error message less scary.

bc98363

Add comments to json differ.

ad631c6

fulghum approved these changes Nov 6, 2024

View reviewed changes

coffeegoddd added the correctness_approved label Nov 6, 2024

nicktobey merged commit b241b58 into main Nov 6, 2024
34 of 35 checks passed

nicktobey deleted the nicktobey/json_fix branch November 6, 2024 02:53

BrewTestBot mentioned this pull request Nov 6, 2024

dolt 1.43.13 Homebrew/homebrew-core#196806

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix issue where JSON diff would fail under specific circumstances. #8534

Fix issue where JSON diff would fail under specific circumstances. #8534

nicktobey commented Nov 6, 2024

fulghum left a comment

nicktobey commented Nov 6, 2024

coffeegoddd commented Nov 6, 2024

Fix issue where JSON diff would fail under specific circumstances. #8534

Fix issue where JSON diff would fail under specific circumstances. #8534

Conversation

nicktobey commented Nov 6, 2024

fulghum left a comment

Choose a reason for hiding this comment

nicktobey commented Nov 6, 2024

coffeegoddd commented Nov 6, 2024