Suboptimal (O(D^2)) memory usage #396

quark-zju · 2023-03-03T23:16:46Z

It seems https://github.com/kpdecker/jsdiff/blob/3b654c2ed7d5262ed9946de841ad8dae990286c7/src/diff/base.js does not implement "4b. A Linear Space Refinement" mentioned in Myers' paper. So the memory usage is O(D^2). From the code it looks like O(len(bestPath) * len(components in bestPath)) might be that complexity.

This makes it practically too slow to complete diff calculation if there are many changes. For example, comparing the prettified Hearthstone cards.json in different languages wouldn't complete in 20 minutes using jsdiff:

Reproduce:

# Prepare test data
wget https://api.hearthstonejson.com/v1/168129/enUS/cards.json -O 1
wget https://api.hearthstonejson.com/v1/168129/zhCN/cards.json -O 2
python3 -m json.tool < 1 > a
python3 -m json.tool < 2 > b

// a.js - Calculate diff
const fs = require('fs')
const diff = require('diff');

const a = fs.readFileSync('a', {encoding: 'utf-8'});
const b = fs.readFileSync('b', {encoding: 'utf-8'});

console.log(`pid: ${process.pid}`);
console.log(diff.diffLines(a, b).length);

Profiler shows that nodejs spent 70% time on GC:

As a comparison, it seems diff-match-patch implements the linear space refinement and can completes the above test case in ~24s. (For reference, git diff --stat --minimal completes in ~0.3s).

The text was updated successfully, but these errors were encountered:

Summary: Before this change, there are 3 Myers diff algorithms used in the dependency tree: - diff-match-patch (1.0.5) - diff (4.0.1) - diff-sequences (via jest -> jest-diff -> diff-sequences) We'd like to simplify the dependency tree. The short answer is: - Use `diff-sequences`, or `jest-diff` which uses `diff-sequences` internally. For best performance, do: - Strip common prefix and suffix. - Make line comparison O(1), avoid `line1 === line2` which can be O(line length). - Consider skipping "cleanup" in `jest-diff` for long input. ---- Long answer of picking a diff library: I wrote a benchmark script to get some idea about their performance: const fs = require('fs') const dmp = new (require('diff-match-patch').diff_match_patch)(); const diff = require('diff'); const ds = require('diff-sequences').default; const jd = require('jest-diff'); dmp.Diff_Timeout = 120; // Diff functions. Output format: Chunk[] // Chunk is one of: // [0, n]: n common lines (same on both side) // [-1, n]: n left-side-only lines // [1, n]: n right-side-only lines function diff1(chars1, chars2) { return dmp.diff_main(chars1, chars2).map(v => [v[0], v[1].length]); } function diff1a(chars1, chars2) { return dmp.diff_main(chars1, chars2, false).map(v => [v[0], v[1].length]); } function diff2(chars1, chars2) { return diff.diffChars(chars1, chars2).map(v => { const d = v.added ? 1 : (v.removed ? -1 : 0); return [d, v.count]; }); } function diff3(chars1, chars2) { function isCommon(ai, bi) { return chars1[ai] == chars2[bi]; } const r = []; let lastA = 0, lastB = 0; function foundSequence(n, na, nb) { if (na > lastA) { r.push([-1, na - lastA]); lastA = na; } if (nb > lastB) { r.push([1, nb - lastB]); lastB = nb; } if (n > 0) { r.push([0, n]); lastA += n; lastB += n; } } ds(chars1.length, chars2.length, isCommon, foundSequence); foundSequence(0, chars1.length, chars2.length); return r; } function diff3a(chars1, chars2) { return jd.diffStringsRaw(chars1, chars2, false).map((d) => [d[0], d[1].length]); } function diff3b(chars1, chars2) { return jd.diffStringsRaw(chars1, chars2, true).map((d) => [d[0], d[1].length]); } function bench(a, b) { const {chars1, chars2} = dmp.diff_linesToChars_(a, b); function stringify(obj) { if (obj.length > 20) { return `${obj.length} items`; } else { return JSON.stringify(obj); } } [ ['diff-match-patch', diff1], ['diff-match-patch (checklines=false)', diff1a], ['diff-sequences', diff3], ['jest-diff (diff-sequences), no cleanup', diff3a], ['jest-diff (diff-sequences), with cleanup', diff3b], ['jsdiff', diff2], ].forEach(([name, diffFunc]) => { // node --expose_gc if (global.gc) { gc(); } const label = ` ${name}`; console.time(label); console.log(' ', stringify(diffFunc(chars1, chars2))); console.timeEnd(label); }); } let a, b; console.log('\nwith common prefix and suffix 1'); a = 'aaaaaaa\n'.repeat(50000) + 'bbbb\n' + 'dddd\n'.repeat(50000); b = 'aaaaaaa\n'.repeat(50000) + 'cccc\n' + 'dddd\n'.repeat(50000); bench(a, b); console.log('\nwith common prefix and suffix 2'); a = 'aaaaaaa\n'.repeat(50000) + 'bbbbbbb\n' + 'dddd\n'.repeat(50000); b = 'aaaaaaa\n'.repeat(50100) + 'cccc\n' + 'dddd\n'.repeat(49900); bench(a, b); console.log('\nwithout common prefix or suffix 1'); a = 'c\n' + 'aaaaaaa\n'.repeat(50000) + 'dddd\n'.repeat(50000); b = 'aaaaaaa\n'.repeat(50000) + 'dddd\n'.repeat(50100) + 'z\n'; bench(a, b); console.log('\nwithout common prefix or suffix 2'); a = 'cccc\n' + 'aaaaaaa\n'.repeat(50000) + 'bbbbbbb\n' + 'dddd\n'.repeat(50000) + 'z\n'; b = 'aaaaaaa\n'.repeat(50100) + 'cccc\n' + 'dddd\n'.repeat(49900) + 'z\ny\n'; bench(a, b); // Hearthstone cards.json in different languages. // This is somewhat challenging since many lines are changed. // wget https://api.hearthstonejson.com/v1/168129/enUS/cards.json -O 1 // wget https://api.hearthstonejson.com/v1/168129/zhCN/cards.json -O 2 // python3 -m json.tool < 1 > 1.json // python3 -m json.tool < 2 > 2.json console.log('\ncards.json with different languages'); a = fs.readFileSync('1.json', {encoding: 'utf-8'}); b = fs.readFileSync('2.json', {encoding: 'utf-8'}); bench(a, b); The output looks like: with common prefix and suffix 1 [[0,50000],[-1,1],[1,1],[0,50000]] diff-match-patch: 5.073ms [[0,50000],[-1,1],[1,1],[0,50000]] diff-match-patch (checklines=false): 0.481ms [[0,50000],[-1,1],[1,1],[0,50000]] diff-sequences: 7.589ms [[0,50000],[-1,1],[1,1],[0,50000]] jest-diff (diff-sequences), no cleanup: 10.915ms [[0,50000],[-1,1],[1,1],[0,50000]] jest-diff (diff-sequences), with cleanup: 10.588ms [[0,50000],[-1,1],[1,1],[0,50000]] jsdiff: 22.664ms with common prefix and suffix 2 [[0,50000],[-1,101],[1,101],[0,49900]] diff-match-patch: 10.688ms [[0,50000],[-1,101],[1,101],[0,49900]] diff-match-patch (checklines=false): 2.619ms [[0,50000],[-1,101],[1,101],[0,49900]] diff-sequences: 12.687ms [[0,50000],[-1,101],[1,101],[0,49900]] jest-diff (diff-sequences), no cleanup: 11.055ms [[0,50000],[-1,101],[1,101],[0,49900]] jest-diff (diff-sequences), with cleanup: 4.356ms [[0,50000],[-1,1],[1,101],[0,49900],[-1,100]] jsdiff: 59.359ms without common prefix or suffix 1 [[-1,1],[0,100000],[1,101]] diff-match-patch: 632.863ms [[-1,1],[0,100000],[1,101]] diff-match-patch (checklines=false): 607.796ms [[-1,1],[0,50000],[1,51],[0,50000],[1,50]] diff-sequences: 12.366ms [[-1,1],[0,50000],[1,51],[0,50000],[1,50]] jest-diff (diff-sequences), no cleanup: 11.096ms [[-1,1],[0,100000],[1,51],[1,50]] jest-diff (diff-sequences), with cleanup: 1.029s [[-1,1],[0,100000],[1,101]] jsdiff: 13.163ms without common prefix or suffix 2 [[-1,1],[0,50000],[-1,101],[1,101],[0,49901],[1,1]] diff-match-patch: 2.773s [[-1,1],[0,50000],[-1,101],[1,101],[0,49901],[1,1]] diff-match-patch (checklines=false): 1.402s [[-1,1],[0,50000],[-1,101],[1,101],[0,49901],[1,1]] diff-sequences: 22.216ms [[-1,1],[0,50000],[-1,101],[1,101],[0,49901],[1,1]] jest-diff (diff-sequences), no cleanup: 20.546ms [[-1,1],[0,50000],[-1,101],[1,101],[0,49901],[1,1]] jest-diff (diff-sequences), with cleanup: 19.222ms [[-1,1],[0,50000],[-1,1],[1,101],[0,49900],[-1,100],[0,1],[1,1]] jsdiff: 33.82ms cards.json with different languages 67781 items diff-match-patch: 1:04.122 (m:ss.mmm) 57514 items diff-match-patch (checklines=false): 2:00.283 (m:ss.mmm) 67781 items diff-sequences: 1:09.486 (m:ss.mmm) 67781 items jest-diff (diff-sequences), no cleanup: 1:06.452 (m:ss.mmm) 52937 items jest-diff (diff-sequences), with cleanup: 1:09.118 (m:ss.mmm) ... (jsdiff cannot complete this test case in 20+ minutes) Observations: - In the last test case, `jsdiff` does not implement O(D^2) -> O(D) space optimization so it is practically unusable (reported as kpdecker/jsdiff#396). `diff-match-patch` and `jest-diff` both implement the linear space optimization, and have similar performance. - `diff-match-patch` strips common prefix and suffix, which makes it faster than `jest-diff` in "common prefix and suffix" test cases. - Both `diff-match-patch` and `jest-diff` can take a long time on "cleanup". See the "without common prefix or suffix 1" test case. We probably want to only enable cleanup for smaller input. - `diff-match-patch` performs visibly worse on the "without common prefix or suffix 2" test case. From the code it looks like `diff-match-patch` uses some kind of heuristics that tries to speed up things but ends up slowing it down. - Without cleanup, `jest-diff` might output `[1,51],[1,50]` that can be "obviously" merged to `[1,101]`. We might use a lightweight cleanup logic for that. - Reading the code, `diff-match-patch` turns lines into char codes. It cannot handle 65536 unique lines. (https://github.com/google/diff-match-patch/blob/62f2e689f498f9c92dbc588c58750addec9b1654/javascript/diff_match_patch_uncompressed.js#L503) Conclusions: - `jest-diff` (and `diff-sequences` under the hood) is overall the best choice. It has expected time and space complexities, and provides flexibility to skip the potentially slow "cleanup", and can support >65k unique lines. - `jest-diff` misses the "skip common prefix / suffix" optimization that `diff-match-patch` has, and seems practically important (editing a line in the editor - all lines are common prefixes and suffixes except for the line being edited). The optimization is not hard to implement. This diff implements it. - For certain use-cases (ex. linelog) where the diff content is not needed (at least for the left / "a" side), it should use `diff-sequences` to avoid overhead preparing the diff content. - `jest-diff`'s `diffLines` outputs one line per `Diff` but we want one chunk per `Diff`. - `jest-diff`'s `diffStringsRaw` produces one `Diff` per chunk, and because [`string.slice` is O(1) in V8](https://stackoverflow.com/a/72545403), it has acceptable performance. But mapping lines to chars would introduce the 65535 unique line limit undesirably. Reviewed By: evangrayk Differential Revision: D43857949 fbshipit-source-id: 9a3d85ebf10c9b82da8ab5cba4e14e519bbf264d

tomByrer · 2023-04-09T16:44:57Z

I'm curious to see a better algorithm...

ExplodingCabbage · 2024-01-11T13:49:54Z

Hmmmmm...

I definitely think it makes sense to implement a version of the algorithm based on the linear space refinement in which we explore the edit graph in both directions (from the top-left, and from the bottom-right) until they join up in the middle. This will have the practical effect of making jsdiff faster and reducing its memory usage (though without, I think, reducing its worst-case time or space complexity).

I think the actual linear space refinement proposed by Myers is different to (and cleverer than) this, though, and I'm not convinced of the wisdom of fully implementing it; certainly I think it shouldn't be on by default. If I'm understanding it correctly - and it took me a while to get this - the key idea of that refinement is that you can navigate through the edit graph from both sides without keeping track of the paths you've taken, just how far they've got in each diagonal, until your two paths hit each other somewhere in the middle of the edit graph, and then the snake they hit each other on is guaranteed to be the middle snake of the optimal diff, and then the way you compensate for not having kept a record of the paths you took is by recursively running the algorithm again on the regions before and after the middle snake to find the middle snakes of those, and so on recursively until all the middle snakes you've found together amount to a full path through the edit graph. Effectively you're repeating lots of work, roughly doubling the time taken in some cases, to compensate for not keeping a record of the work you've done.

From a computer scientist's perspective, judging performance only in terms of time and space complexity, this is a brilliant trick that gets the space complexity down without worsening the time complexity. But pragmatically speaking, I'm not convinced there's any point? I have encountered situations where jsdiff basically hung "forever" trying to compute a large diff but never any situation where it ran out of memory or even used a nontrivial amount of my computer's memory. Any tradeoff that involves making jsdiff slower in order to reduce its memory requirements intuitively feels like a bad tradeoff to me (though I am open to being persuaded otherwise). I'd be hesitant accept a contribution adding this as an optional off-by-default feature without some persuasion that it's worth it.

(Do note that the perf issues around clonePath that you highlight are already basically fixed by a separate change, #411.)

ExplodingCabbage · 2024-01-11T14:06:43Z

I'm gonna label this as planned for next release with the intent that I'll implement the idea of navigating the edit graph from both ends, but not actually implement the linear space algorithm.

quark-zju · 2024-01-12T21:40:50Z

Part of the space complexity can manifest as time complexity via the GC. The hung "forever" cases might be actually caused by the space complexity. I didn't observe "out of memory" error either.

The hearthstone cards.json example is a simplified version of a production slow diff case we observed in the past. Both are large generated JSON files. For that specific production case, while I cannot share the file contents, I can share some numbers: even git diff without --minimal took 5 seconds.

ExplodingCabbage · 2024-01-15T10:02:41Z

Part of the space complexity can manifest as time complexity via the GC

No, I don't think so - or at least, I don't think GC can possibly increase the time complexity in a way you wouldn't expect if you analyzed the time complexity without regard for GC. Every change object creation is already accounted for in the time complexity and there are only as many object garbage collections are there are object creations.

Again, do also note that #411 (merged to master but not yet released to npm) has drastically reduced the amount of garbage collection needed and completely eliminated clonePath.

quark-zju · 2024-01-15T17:03:01Z

I don't think GC can possibly increase the time complexity in a way you wouldn't expect if you analyzed the time complexity without regard for GC

For example, if you append N element to a linked list. It is O(N) time complexity since appending is O(1) per element. However, if for some reason the GC pauses O(N) times and had to scan the whole linked list each time, GC would cause a O(N^2) time complexity.

ExplodingCabbage · 2024-01-15T17:06:25Z

Ah, interesting. Yeah, that makes sense; I was assuming the cost of garbage collection would be proportional to the number of objects in need of collecting but of course that's not the case because of the cost of finding the items that can be collected, which I was completely ignoring. Hmm...

ExplodingCabbage · 2024-01-16T16:45:23Z

I think I'm persuaded that it'd make sense from a performance perspective to make this change, but one thing still gives me pause, which is that the linear space refinement from the paper gives different results to the core Myers algorithm. Consider diffing old text A against new text AAA, for instance. With the core Myers diff algorithm, we should keep the first A and then insert two As after it. But now consider following this algorithm for finding the middle snake:

Here's our edit graph:

At D=0, the forward and reverse paths both just preserve one A (i.e. move diagonally):

Then when D=1, we do an insertion on the forward path and similarly a vertical move on the reverse path. (I'm omitting paths that go off the edge of the edit graph, although technically we also do those if we're following the algorithm faithfully.)

Now the forward path is at (x, y) = (1, 2) and the reverse path is at (u, v) = (0, 1), satisfying the definition of the paths "overlapping" in diagonal k=-1. (Note we use D = 2 when we apply this definition, even though the variable D in the algorithm is 1 at this point; they're not the same D.)

But now, per the algorithm, "The last snake of the reverse path is the middle snake." - i.e. the empty snake from (0, 1) to (0, 1). We're thus now forced to build a 2-path that goes through (0, 1), even though the core Myers algorithm would not go through that point.

I was slightly doubting my own interpretation, but the interactive visualisation at https://blog.robertelder.org/diff-algorithm/ agrees:

So if we made this change we'd be modifying the results jsdiff returns. Do you see a tweak to the published refinement that would let us avoid this?

quark-zju · 2024-01-16T19:10:55Z

Git's xdiff does some post-processing (ex. shifting the diff chunk up or down, merging adjacent diff chunks) to "normalize" the results.

ExplodingCabbage · 2024-01-17T11:24:10Z

Hmm. I wonder if that consistently achieves results identical to the core Myers diff algorithm, though? (It's not immediately obvious to me whether there's any reasonable way to achieve that at all.) Would need to scrutinise the post-processing logic carefully...

hyrious mentioned this issue Apr 14, 2023

Add timeout? #398

Closed

ExplodingCabbage added the planned-for-next-release label Jan 11, 2024

ExplodingCabbage removed the planned-for-next-release label Jan 16, 2024

ExplodingCabbage added the performance label Jan 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Suboptimal (O(D^2)) memory usage #396

Suboptimal (O(D^2)) memory usage #396

quark-zju commented Mar 3, 2023 •

edited

Loading

tomByrer commented Apr 9, 2023

ExplodingCabbage commented Jan 11, 2024 •

edited

Loading

ExplodingCabbage commented Jan 11, 2024

quark-zju commented Jan 12, 2024

ExplodingCabbage commented Jan 15, 2024 •

edited

Loading

quark-zju commented Jan 15, 2024

ExplodingCabbage commented Jan 15, 2024

ExplodingCabbage commented Jan 16, 2024 •

edited

Loading

quark-zju commented Jan 16, 2024 •

edited

Loading

ExplodingCabbage commented Jan 17, 2024

Suboptimal (O(D^2)) memory usage #396

Suboptimal (O(D^2)) memory usage #396

Comments

quark-zju commented Mar 3, 2023 • edited Loading

tomByrer commented Apr 9, 2023

ExplodingCabbage commented Jan 11, 2024 • edited Loading

ExplodingCabbage commented Jan 11, 2024

quark-zju commented Jan 12, 2024

ExplodingCabbage commented Jan 15, 2024 • edited Loading

quark-zju commented Jan 15, 2024

ExplodingCabbage commented Jan 15, 2024

ExplodingCabbage commented Jan 16, 2024 • edited Loading

quark-zju commented Jan 16, 2024 • edited Loading

ExplodingCabbage commented Jan 17, 2024

quark-zju commented Mar 3, 2023 •

edited

Loading

ExplodingCabbage commented Jan 11, 2024 •

edited

Loading

ExplodingCabbage commented Jan 15, 2024 •

edited

Loading

ExplodingCabbage commented Jan 16, 2024 •

edited

Loading

quark-zju commented Jan 16, 2024 •

edited

Loading