-
Notifications
You must be signed in to change notification settings - Fork 25.7k
Speed up CompressedXContent.equals #79600
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up CompressedXContent.equals #79600
Conversation
The hash code on this one is stable (we compute it from the uncompressed bytes). It's essentially free to compare it and early-break out in the false case. In case of equals actually working though there are cases where compressed and uncompressed bytes differ for the same content which made for a slow and potentially very allocation heavy comparison. This commit (at the cost of some complexity) makes the equality checks needed to deduplicate Beats style metadata about twice as fast in isolation and more importantly saves a massive amount of allocations in them which should make for a larger practical speedup. This has not been a huge deal in practice yet, but I would like to use the functionality to implement metadata deduplication in a follow-up that is fairly simple but requires that the equals check in these objects is safe to run in a hot loop on the master thread. Relates #77466
|
Pinging @elastic/es-distributed (Team:Distributed) |
|
Pinging @elastic/es-search (Team:Search) |
server/src/main/java/org/elasticsearch/common/compress/CompressedXContent.java
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/common/compress/CompressedXContent.java
Show resolved
Hide resolved
…x-content-comparison
DaveCTurner
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, I left a few small comments.
server/src/main/java/org/elasticsearch/common/compress/CompressedXContent.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/common/compress/CompressedXContent.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/common/compress/CompressedXContent.java
Show resolved
Hide resolved
| final CompressedXContent sameAsOne = | ||
| new CompressedXContent((builder, params) -> | ||
| builder.stringListField("arr", Arrays.asList(randomJSON)), XContentType.JSON, ToXContent.EMPTY_PARAMS); | ||
| assertEquals(one, sameAsOne); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we assert that the compressed bytes are not equal in this case?
Also we are not exercising equalsWhenUncompressed very hard in any of these tests since usually the bytes will be the same or the CRC is not going to match. Could we either construct some CRC collisions or else just test the method directly, checking in particular that does sometimes return false?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see Alan already approved this. I think this is a blocker, the rest of my comments are less important.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a test for this directly now :) I didn't want to hard code a collision (or brute force one to begin with :D) and finding one at runtime takes too long as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd quite like to have an assertFalse(Arrays.equals(one.compressed(), sameAsOne.compressed())); here too, mostly to document that the compressed representations are different.
server/src/main/java/org/elasticsearch/common/compress/CompressedXContent.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/common/compress/CompressedXContent.java
Outdated
Show resolved
Hide resolved
| final CompressedXContent sameAsOne = | ||
| new CompressedXContent((builder, params) -> | ||
| builder.stringListField("arr", Arrays.asList(randomJSON)), XContentType.JSON, ToXContent.EMPTY_PARAMS); | ||
| assertEquals(one, sameAsOne); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see Alan already approved this. I think this is a blocker, the rest of my comments are less important.
|
Thanks David, all points addressed now :) |
…x-content-comparison
DaveCTurner
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with two further suggestions.
| final CompressedXContent sameAsOne = | ||
| new CompressedXContent((builder, params) -> | ||
| builder.stringListField("arr", Arrays.asList(randomJSON)), XContentType.JSON, ToXContent.EMPTY_PARAMS); | ||
| assertEquals(one, sameAsOne); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd quite like to have an assertFalse(Arrays.equals(one.compressed(), sameAsOne.compressed())); here too, mostly to document that the compressed representations are different.
| final CompressedXContent two = new CompressedXContent((builder, params) -> | ||
| builder.stringListField("arr", Arrays.asList(randomJSON2)), XContentType.JSON, ToXContent.EMPTY_PARAMS); | ||
| assertFalse(CompressedXContent.equalsWhenUncompressed(one.compressed(), two.compressed())); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here you go (cheating slightly since the inputs aren't XContent, but I won't tell...) that irked me, I found a JSON collision instead.
| } | |
| } | |
| public void testEqualsCrcCollision() throws IOException { | |
| final CompressedXContent content1 = new CompressedXContent("{\"d\":\"68&A<\"}".getBytes(StandardCharsets.UTF_8)); | |
| final CompressedXContent content2 = new CompressedXContent("{\"d\":\"gZG- \"}".getBytes(StandardCharsets.UTF_8)); | |
| assertEquals(content1.hashCode(), content2.hashCode()); // the inputs are a known CRC32 collision | |
| assertNotEquals(content1, content2); | |
| } |
server/src/main/java/org/elasticsearch/common/compress/CompressedXContent.java
Outdated
Show resolved
Hide resolved
|
Thanks David & Alan! |
The hash code on this one is stable (we compute it from the uncompressed bytes). It's essentially free to compare it and early-break out in the false case. In case of equals actually working though there are cases where compressed and uncompressed bytes differ for the same content which made for a slow and potentially very allocation heavy comparison. This commit (at the cost of some complexity) makes the equality checks needed to deduplicate Beats style metadata about twice as fast in isolation and more importantly saves a massive amount of allocations in them which should make for a larger practical speedup. This has not been a huge deal in practice yet, but I would like to use the functionality to implement metadata deduplication in a follow-up that is fairly simple but requires that the equals check in these objects is safe to run in a hot loop on the master thread. Relates #77466
The hash code on this one is stable (we compute it from the uncompressed bytes). It's essentially free to compare it and early-break out in the false case. In case of equals actually working though there are cases where compressed and uncompressed bytes differ for the same content which made for a slow and potentially very allocation heavy comparison. This commit (at the cost of some complexity) makes the equality checks needed to deduplicate Beats style metadata about twice as fast in isolation and more importantly saves a massive amount of allocations in them which should make for a larger practical speedup. This has not been a huge deal in practice yet, but I would like to use the functionality to implement metadata deduplication in a follow-up that is fairly simple but requires that the equals check in these objects is safe to run in a hot loop on the master thread. Relates elastic#77466
The hash code on this one is stable (we compute it from the uncompressed bytes). It's essentially
free to compare it and early-break out in the false case.
In case of equals actually working though there are cases where compressed and uncompressed bytes
differ for the same content which made for a slow and potentially very allocation heavy comparison.
This commit (at the cost of some complexity) makes the equality checks needed to deduplicate Beats
style metadata about twice as fast in isolation and more importantly saves a massive amount of allocations
in them which should make for a larger practical speedup.
This has not been a huge deal in practice yet, but I would like to use the functionality to implement
metadata deduplication in a follow-up that is fairly simple but requires that the equals check in these
objects is safe to run in a hot loop on the master thread.
Relates #77466