Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Index out of range panic in DiffCharsToLines on large JSON diff #89

Closed
krylovsk opened this issue Jan 16, 2018 · 16 comments
Closed

Index out of range panic in DiffCharsToLines on large JSON diff #89

krylovsk opened this issue Jan 16, 2018 · 16 comments

Comments

@krylovsk
Copy link

I've encountered this issue while using https://github.com/src-d/go-git, but the bug is easily reproducible with the code snippet below and the JSON file in attachment.

package main

import (
	"fmt"
	"io/ioutil"
	"os"

	"github.com/sergi/go-diff/diffmatchpatch"
)

func main() {
	f, err := os.Open("data.txt")
	defer f.Close()
	checkErr(err)
	data, err := ioutil.ReadAll(f)
	checkErr(err)

	// from https://github.com/src-d/go-git/blob/v4.0.0/utils/diff/diff.go#L17
	dmp := diffmatchpatch.New()
	wSrc, wDst, warray := dmp.DiffLinesToChars(string(data), "")
	diffs := dmp.DiffMain(wSrc, wDst, false)
	diffs = dmp.DiffCharsToLines(diffs, warray)
	fmt.Println(diffs)
}

func checkErr(err error) {
	if err != nil {
		panic(err)
	}
}

Output:

$ go run main.go
panic: runtime error: index out of range

goroutine 1 [running]:
github.com/sergi/go-diff/diffmatchpatch.(*DiffMatchPatch).DiffCharsToLines(0xc420044ee8, 0xc420078390, 0x1, 0x2, 0xc4202ce000, 0xd802, 0xec00, 0x1, 0x2, 0xc4202ce000)
	/Users/krylovsk/src/github.com/sergi/go-diff/diffmatchpatch/diff.go:414 +0x394
main.main()
	/tmp/go-diff-debug/main.go:24 +0x29e
exit status 2

data.txt

@clafoutis42
Copy link

clafoutis42 commented Sep 22, 2018

Hi, I've also met this issue while iterating on large repositories (wanting to analyze diffs) with github.com/src-d/go-git
@krylovsk did you manage to find a workaround ? (I guess the problem is the same, it might be a large diff, and as I was testing this in linux git repository, it is hard and very long to reproduce)

@krylovsk
Copy link
Author

@clafoutis42 no sorry I haven't figured that one out. Eventually I switched to https://github.com/libgit2/git2go

@vmarkovtsev
Copy link
Contributor

vmarkovtsev commented May 20, 2019

We encountered the same problem at source{d}, and @mcuadros is currently debugging it and looking for the fix.

@mcuadros
Copy link

I limited the scope of the problem with one file (64k lines), it contains serveral language from Thai Script to Arabic.

This is the fixture: https://gist.github.com/mcuadros/6369103d4838a9042613596ee212f32c#file-fixture-go

And the code to replicate the issues is:

package main

import (
	"io/ioutil"
	"log"

	"github.com/sergi/go-diff/diffmatchpatch"
)

func main() {
	sNew, err := ioutil.ReadFile("/tmp/dst")
	if err != nil {
		panic(err)
	}

	dmp := diffmatchpatch.New()
	t1, t2, t := dmp.DiffLinesToChars("", string(sNew))
	diffs := dmp.DiffMain(t1, t2, false)
	diffs = dmp.DiffCharsToLines(diffs, t)
	for _, diff := range diffs {
		log.Println(diff.Type, diff.Text)
	}
}

@mcuadros
Copy link

The problem looks the same described at #4, so looks like the solution implement at #7, was a partial solution.

//cc @sergi

@Corefracture
Copy link

Anyone know if @sergi is still active? I was hoping to submit a PR for this fix (unless someone else is or has a fix). I'm running into this now consistently :(

@mcuadros
Copy link

mcuadros commented May 26, 2019 via email

@r-pai
Copy link
Contributor

r-pai commented Aug 20, 2019

I too faced the issue and had a look on why the issue is occurring. I was able to know what the issue but for the fix its not that easy as you need to know most of the logic. Its better people who wrote it can fix it better than me.

Please have a look on this example.
Here is a sample code of what happens when a rune array is converted to string and back to rune array. Sample: https://play.golang.org/p/XFVKayUhV27

func main() {
	r := []rune{55296}
	s := string(r)
	r1 := []rune(s)
	fmt.Println(r, s, r1)
}

Output :

[55296] � [65533]
In go-diff , similar thing is done and issue occurs because of this usage.

Issue:
If my file has lines is 55296 or more this issue arrises, and max 65530 should be there.
How the issue occurs?

In go-diff the lineArray index is stored in the rune array ( In diff.go , diffLinesToRunesMunge() function ).
This is converted to string and stored in 'Diff' structure 'Text' variable. (In diff.go , diffMainRunes function)
There is come processing with this Diff.Text and logic is a bit difficult to understand :) Time Constraint too
Now here in diff.go , DiffCharsToLines method we convert the Diff.Text to rune array and use it (as index) to get the line from line array. and this is when it get the 'index out of bound' error

I hope , I had put some light on this issue. :)

Thank You.... P@i

@sergi
Copy link
Owner

sergi commented Nov 19, 2019

Hi folks,

Sorry about the silence. I'm back and I'll be looking into this issue soon, but I'd definitely appreciate it if one of you wants to contribute a PR to it.

@r-pai
Copy link
Contributor

r-pai commented Nov 20, 2019

Hi All,

I have done a fix for this issue and with my testing its working.

Issue:
As I explained above issue is storing the index value in the rune.
Fix:
Moved the index stored from rune to string array. Using delimeter "," the string array is combined to a string . This string is converted to rune array and returned for DiffLinesToRunes.
DiffCharsToLines method, split the index string, which is the diff using the delimeter and then use the index to fetch data from linearray.

The fix is in the last two commits of my go-diff fork. I have also changed the test cases and added test to check the a huge line diff testcase .

Branch: https://github.com/r-pai/go-diff/

Any comments before I create a PR.

Thanks
Pai

@ronhab
Copy link

ronhab commented Feb 26, 2020

Hi all,

tl;dr I have a solution for this issue, but it requires lots of changes and also increase memory consumption by a factor of 2-4.
You can find it here.

Long analysis:
As @r-pai described, the issue is caused due to the limit on the values a rune type can hold.
So, we need to change the "character" type passed to the diff functions to something which can hold larger values (i.e. int).
But this is not enough - when using checklines mode (which is where the problem exists), the returned diffs are not "normal" text diffs, they are consist of runes which (supposedly) encode line numbers. So we also need a way to encode our new character type (i.e. int) inside strings to be returned as encoded diffs.

As I see it, there are two possible ways to solve it:

  1. Duplicate each function being called from DiffMainRunes to have a second version receiving []int instead of []rune as input, and returning []int instead of string as diffs. Call those functions on diffLineMode.
  2. Change all relevant functions to use []int instead of []rune for everything, convert input from []rune on external functions, and restore output back to []rune at the end.
    In addition, pass a bool argument to all relevant functions, stating whether the diffs should be encoded strings (each int encoded as 4 bytes) or "normal" text.
    This is how I've implemented it - mostly because I wanted to avoid code duplication.

This explanation sounds much more complicated than the real changes...
So please take a look at https://github.com/ronhab/go-diff/tree/FixCrashUsingDiffChar .

BTW, it might be possible to choose option 1 without code duplication (at least not human-written code duplication...) by generating the two versions of the function automatically.
This way we can avoid some of the performance/memory consumption hits of my solution - but I don't have time to work on it right now.

@ronhab
Copy link

ronhab commented Feb 26, 2020

For comparison, here are the benchmark results on both current master and my branch (running on my Windows 10 Intel i7-6700HQ laptop):
current master -
BenchmarkDiffCommonPrefix-8 5675502 222 ns/op 0 B/op 0 allocs/op
BenchmarkDiffCommonSuffix-8 5783984 199 ns/op 0 B/op 0 allocs/op
BenchmarkCommonLength/prefix/empty-8 249642283 5.07 ns/op 0 B/op 0 allocs/op
BenchmarkCommonLength/prefix/short-8 178517660 7.26 ns/op 0 B/op 0 allocs/op
BenchmarkCommonLength/prefix/long-8 1313544 956 ns/op 0 B/op 0 allocs/op
BenchmarkCommonLength/suffix/empty-8 189780298 5.70 ns/op 0 B/op 0 allocs/op
BenchmarkCommonLength/suffix/short-8 131496415 8.71 ns/op 0 B/op 0 allocs/op
BenchmarkCommonLength/suffix/long-8 1000000 1234 ns/op 0 B/op 0 allocs/op
BenchmarkDiffHalfMatch-8 7064 176486 ns/op 106496 B/op 2 allocs/op
BenchmarkDiffCleanupSemantic-8 775 1328013 ns/op 162149 B/op 538 allocs/op
BenchmarkDiffMain-8 1 1012293800 ns/op 16087728 B/op 81 allocs/op
BenchmarkDiffMainLarge-8 8 148727300 ns/op 4838944 B/op 45376 allocs/op
BenchmarkDiffMainRunesLargeLines-8 1366 811892 ns/op 156550 B/op 1056 allocs/op

My branch:
BenchmarkDiffCommonPrefix-8 3260402 369 ns/op 480 B/op 2 allocs/op
BenchmarkDiffCommonSuffix-8 3260413 365 ns/op 480 B/op 2 allocs/op
BenchmarkCommonLength/prefix/empty-8 298529592 4.09 ns/op 0 B/op 0 allocs/op
BenchmarkCommonLength/prefix/short-8 209983783 5.72 ns/op 0 B/op 0 allocs/op
BenchmarkCommonLength/prefix/long-8 1643733 727 ns/op 0 B/op 0 allocs/op
BenchmarkCommonLength/suffix/empty-8 280116555 4.30 ns/op 0 B/op 0 allocs/op
BenchmarkCommonLength/suffix/short-8 174342531 6.85 ns/op 0 B/op 0 allocs/op
BenchmarkCommonLength/suffix/long-8 1000000 1010 ns/op 0 B/op 0 allocs/op
BenchmarkDiffHalfMatch-8 5728 215256 ns/op 311298 B/op 4 allocs/op
BenchmarkDiffCleanupSemantic-8 1054 1150403 ns/op 355219 B/op 875 allocs/op
BenchmarkDiffMain-8 1 1020316100 ns/op 177532736 B/op 32877 allocs/op
BenchmarkDiffMainLarge-8 8 127642438 ns/op 7230068 B/op 84529 allocs/op
BenchmarkDiffMainRunesLargeLines-8 1800 661175 ns/op 227841 B/op 3860 allocs/op

It seems like some of the operations are actually faster, but memory allocations are much higher.

@stamblerre
Copy link

We just ran into this in the Go language server (gopls): golang/go#42927. Is anybody working on a fix?

@sergi
Copy link
Owner

sergi commented Dec 1, 2020

Hi @stamblerre, I have a potential fix for this that I plan to apply later today if all tests pass.

@stamblerre
Copy link

Awesome, thank you for the quick update!

@stamblerre
Copy link

Thank you for the quick fix on this, @sergi! Do you plan to tag a new release with this fix?

schroederc added a commit to schroederc/go-diff that referenced this issue Mar 31, 2023
Revert implementation of diffLines to use runes to fix sergi#140.  In order
to not regress sergi#89, skip invalid utf8 runes when munging lines.
schroederc added a commit to schroederc/go-diff that referenced this issue Mar 31, 2023
Revert implementation of diffLines to use runes to fix sergi#140.  In order
to not regress sergi#89, skip invalid utf8 runes when munging lines.
schroederc added a commit to schroederc/go-diff that referenced this issue Mar 31, 2023
Revert implementation of diffLines to use runes to fix sergi#140.  In order
to not regress sergi#89, skip invalid utf8 runes when munging lines.
jazanne added a commit to launchdarkly/ld-find-code-refs that referenced this issue Jan 24, 2024
…e diffs (#425)

Adds a test case to verify the fix. The test data is taken from
sergi/go-diff#89 data.txt

---------

Co-authored-by: Lucas Shadler <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants