Skip to content

Conversation

@original-brownbear
Copy link
Contributor

The hash code on this one is stable (we compute it from the uncompressed bytes). It's essentially
free to compare it and early-break out in the false case.
In case of equals actually working though there are cases where compressed and uncompressed bytes
differ for the same content which made for a slow and potentially very allocation heavy comparison.
This commit (at the cost of some complexity) makes the equality checks needed to deduplicate Beats
style metadata about twice as fast in isolation and more importantly saves a massive amount of allocations
in them which should make for a larger practical speedup.
This has not been a huge deal in practice yet, but I would like to use the functionality to implement
metadata deduplication in a follow-up that is fairly simple but requires that the equals check in these
objects is safe to run in a hot loop on the master thread.

Relates #77466

The hash code on this one is stable (we compute it from the uncompressed bytes). It's essentially
free to compare it and early-break out in the false case.
In case of equals actually working though there are cases where compressed and uncompressed bytes
differ for the same content which made for a slow and potentially very allocation heavy comparison.
This commit (at the cost of some complexity) makes the equality checks needed to deduplicate Beats
style metadata about twice as fast in isolation and more importantly saves a massive amount of allocations
in them which should make for a larger practical speedup.
This has not been a huge deal in practice yet, but I would like to use the functionality to implement
metadata deduplication in a follow-up that is fairly simple but requires that the equals check in these
objects is safe to run in a hot loop on the master thread.

Relates #77466
@original-brownbear original-brownbear added >non-issue :Search Foundations/Mapping Index mappings, including merging and defining field types :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. v8.0.0 v7.16.1 labels Oct 21, 2021
@elasticmachine elasticmachine added Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. Team:Search Meta label for search team labels Oct 21, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search (Team:Search)

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, I left a few small comments.

final CompressedXContent sameAsOne =
new CompressedXContent((builder, params) ->
builder.stringListField("arr", Arrays.asList(randomJSON)), XContentType.JSON, ToXContent.EMPTY_PARAMS);
assertEquals(one, sameAsOne);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we assert that the compressed bytes are not equal in this case?

Also we are not exercising equalsWhenUncompressed very hard in any of these tests since usually the bytes will be the same or the CRC is not going to match. Could we either construct some CRC collisions or else just test the method directly, checking in particular that does sometimes return false?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see Alan already approved this. I think this is a blocker, the rest of my comments are less important.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a test for this directly now :) I didn't want to hard code a collision (or brute force one to begin with :D) and finding one at runtime takes too long as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd quite like to have an assertFalse(Arrays.equals(one.compressed(), sameAsOne.compressed())); here too, mostly to document that the compressed representations are different.

final CompressedXContent sameAsOne =
new CompressedXContent((builder, params) ->
builder.stringListField("arr", Arrays.asList(randomJSON)), XContentType.JSON, ToXContent.EMPTY_PARAMS);
assertEquals(one, sameAsOne);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see Alan already approved this. I think this is a blocker, the rest of my comments are less important.

@original-brownbear
Copy link
Contributor Author

Thanks David, all points addressed now :)

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with two further suggestions.

final CompressedXContent sameAsOne =
new CompressedXContent((builder, params) ->
builder.stringListField("arr", Arrays.asList(randomJSON)), XContentType.JSON, ToXContent.EMPTY_PARAMS);
assertEquals(one, sameAsOne);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd quite like to have an assertFalse(Arrays.equals(one.compressed(), sameAsOne.compressed())); here too, mostly to document that the compressed representations are different.

final CompressedXContent two = new CompressedXContent((builder, params) ->
builder.stringListField("arr", Arrays.asList(randomJSON2)), XContentType.JSON, ToXContent.EMPTY_PARAMS);
assertFalse(CompressedXContent.equalsWhenUncompressed(one.compressed(), two.compressed()));
}
Copy link
Contributor

@DaveCTurner DaveCTurner Oct 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here you go (cheating slightly since the inputs aren't XContent, but I won't tell...) that irked me, I found a JSON collision instead.

Suggested change
}
}
public void testEqualsCrcCollision() throws IOException {
final CompressedXContent content1 = new CompressedXContent("{\"d\":\"68&A<\"}".getBytes(StandardCharsets.UTF_8));
final CompressedXContent content2 = new CompressedXContent("{\"d\":\"gZG- \"}".getBytes(StandardCharsets.UTF_8));
assertEquals(content1.hashCode(), content2.hashCode()); // the inputs are a known CRC32 collision
assertNotEquals(content1, content2);
}

@original-brownbear
Copy link
Contributor Author

Thanks David & Alan!

@original-brownbear original-brownbear merged commit 528bcb9 into elastic:master Oct 22, 2021
@original-brownbear original-brownbear deleted the faster-compressed-x-content-comparison branch October 22, 2021 08:18
original-brownbear added a commit that referenced this pull request Oct 22, 2021
The hash code on this one is stable (we compute it from the uncompressed bytes). It's essentially
free to compare it and early-break out in the false case.
In case of equals actually working though there are cases where compressed and uncompressed bytes
differ for the same content which made for a slow and potentially very allocation heavy comparison.
This commit (at the cost of some complexity) makes the equality checks needed to deduplicate Beats
style metadata about twice as fast in isolation and more importantly saves a massive amount of allocations
in them which should make for a larger practical speedup.
This has not been a huge deal in practice yet, but I would like to use the functionality to implement
metadata deduplication in a follow-up that is fairly simple but requires that the equals check in these
objects is safe to run in a hot loop on the master thread.

Relates #77466
lockewritesdocs pushed a commit to lockewritesdocs/elasticsearch that referenced this pull request Oct 28, 2021
The hash code on this one is stable (we compute it from the uncompressed bytes). It's essentially
free to compare it and early-break out in the false case.
In case of equals actually working though there are cases where compressed and uncompressed bytes
differ for the same content which made for a slow and potentially very allocation heavy comparison.
This commit (at the cost of some complexity) makes the equality checks needed to deduplicate Beats
style metadata about twice as fast in isolation and more importantly saves a massive amount of allocations
in them which should make for a larger practical speedup.
This has not been a huge deal in practice yet, but I would like to use the functionality to implement
metadata deduplication in a follow-up that is fairly simple but requires that the equals check in these
objects is safe to run in a hot loop on the master thread.

Relates elastic#77466
@original-brownbear original-brownbear restored the faster-compressed-x-content-comparison branch April 18, 2023 20:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. >non-issue :Search Foundations/Mapping Index mappings, including merging and defining field types Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. Team:Search Meta label for search team v7.16.0 v8.0.0-beta1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants