[v2][adjuster] Enhance Span Hash Adjuster For Spans That Have Already Been Hashed #6393

mahadzaryab1 · 2024-12-22T16:29:21Z

Originally posted by @yurishkuro in #6391 (comment)

Some storage backends (Cassandra, in particular), perform similar deduping by computing a hash before the span is saved and using it as part of the partition key (it creates tombstones if identical span is saved 2 times or more but no dups on read). So we could make this hashing process to be a part of the ingestion pipeline (e.g. in sanitizers) and simply store it as an attribute on the span. Then this adjuster would be "lazy", it will only recompute the hash if it doesn't already exist in the storage.
If we do this on the write path, we would want this to be as efficient as possible, so we would need to implement manual hashing by iterating through the attributes (and pre-sorting them to avoid dependencies) and but manually going through all fields of the Span / SpanEvent / SpanLink. The reason I was reluctant to do that in the past was to avoid unintended bugs if the data model was changed, like a new field added that we'd forget to add to the hash function. To protect against that we probably could use some fuzzing tests, by setting / unsetting each field individually and making sure the hash code changes as a result.

yurishkuro · 2024-12-23T21:07:12Z

@suryaaprakassh What guidance do you need?

zzzk1 · 2024-12-25T12:34:25Z

@yurishkuro Has anyone started this? If not, I’d like to try.

ADI-ROXX · 2024-12-25T19:50:03Z

@yurishkuro Regarding the issue, I was able to understand the first point that you made that we want to add hash in the span as a new field so that we are calculating hashes of only those that are not yet calculated.

I have doubts regarding the 2nd point that you are making. Can you provide a sample testcase where, if we do not include the 2nd point in the final code, there might be an issue?

I think that if, in the future, a new attribute is added to the hash, it might not be that big of an issue because we just have to change the input to the hash function. That is, while adding a new field in the span, simultaneously change the hashing code so that the issue is resolved.
That is what I am able to see for now. If there is anything that I am missing, can you please point it out (especially on the 2nd point)?

yurishkuro · 2024-12-25T19:55:44Z

That is, while adding a new field in the span, simultaneously change the hashing code so that the issue is resolved.

What if we forget to do that? The intent of item (2) is to ensure that we will have some tests fail if we forget to add a new field to the hash function. A solution that does not provide this guarantee is unsafe and can lead to future bugs.

ADI-ROXX · 2024-12-25T20:33:08Z

A solution that does not provide this guarantee is unsafe and can lead to future bugs.

@yurishkuro
For this two cases are possible:

The new field is intended to be used in the hashing function: In this case, if there is a fail in the test case, it signifies that there is some issue in the code.
The new field added is not intended to be used in the hashing function: In this case, if there is a faile in a test case, it falsely signifies that there is some issue in the code.

Instead of adding unit tests, there is a simpler solution that I have that could prevent this issue. Keep an array that stores what all fields have to be used for hashing and what will be their order.

Lets say that following are the fields of a span:
[traceID, spanID, operationName, warnings, process]

and out of these [traceID, spanID, operationName] are the ones that are used for hashing.

So, what we can do is that we can define an array that will contain what field to be used.

So the array might look like the following:

Index-----------Value
0 --------------traceID
1 --------------spanID
2 -------------- operationName

and if we want to add a new field, we can just add it to the array as follows, and it will be adjusted automatically by the hashing function.

Index ------- Value
0 -------------- traceID
1 --------------newField
2 -------------- spanID
3 --------------operationName

yurishkuro · 2024-12-25T21:06:43Z

@ADI-ROXX

All public fields are meant to be used in a hash.
Your solution requires reflection, which is inefficient at runtime. I expect the hash functions to access data directly via compile-time functions. However, I am not opposed to the code for such functions to be auto-generated from the underlying data types, and during the generation it's fine to use reflection.

ADI-ROXX · 2024-12-26T06:08:10Z

Your solution requires reflection, which is inefficient at runtime.

@yurishkuro I understood the concern. The problem being that for all the spans during deployment, we cannot use reflection as consumes the resource.

So the solution I can think of is a combination of my proposed solution and yours.
There will be an array that would contain the order of parameters that have to be used for hashing:

Index-----------Value
0 --------------traceID
1 --------------spanID
2 -------------- operationName

And during the testing, it will be checked whether the hash of the span that comes out of the adjuster is the same or not.

So this will reduce the headache of adding custom tests, and the user just has to make changes in the ordering array, and everything will be done automatically.

Although this might not be efficient during the testing phase, I don't think it matters.

yurishkuro · 2024-12-26T06:59:44Z

it's ok to use reflection in the tests. The "array" is irrelevant since it can be constructed via reflection from the actual types.

mahadzaryab1 changed the title ~~Enhance Span Hash Adjuster For Spans That Have Already Been Hashed~~ [v2][adjuster] Enhance Span Hash Adjuster For Spans That Have Already Been Hashed Dec 22, 2024

dosubot bot added performance storage/cassandra v2 labels Dec 22, 2024

yurishkuro added help wanted Features that maintainers are willing to accept but do not have cycles to implement good first issue Good for beginners and removed performance storage/cassandra v2 labels Dec 22, 2024

chahatsagarmain linked a pull request Jan 7, 2025 that will close this issue

[v2][WIP] Span hash sanitizer and enhance span hash adjuster #6499

Draft

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[v2][adjuster] Enhance Span Hash Adjuster For Spans That Have Already Been Hashed #6393

[v2][adjuster] Enhance Span Hash Adjuster For Spans That Have Already Been Hashed #6393

mahadzaryab1 commented Dec 22, 2024 •

edited by yurishkuro

Loading

yurishkuro commented Dec 23, 2024

zzzk1 commented Dec 25, 2024

ADI-ROXX commented Dec 25, 2024

yurishkuro commented Dec 25, 2024

ADI-ROXX commented Dec 25, 2024 •

edited

Loading

yurishkuro commented Dec 25, 2024

ADI-ROXX commented Dec 26, 2024

yurishkuro commented Dec 26, 2024

[v2][adjuster] Enhance Span Hash Adjuster For Spans That Have Already Been Hashed #6393

[v2][adjuster] Enhance Span Hash Adjuster For Spans That Have Already Been Hashed #6393

Comments

mahadzaryab1 commented Dec 22, 2024 • edited by yurishkuro Loading

yurishkuro commented Dec 23, 2024

zzzk1 commented Dec 25, 2024

ADI-ROXX commented Dec 25, 2024

yurishkuro commented Dec 25, 2024

ADI-ROXX commented Dec 25, 2024 • edited Loading

yurishkuro commented Dec 25, 2024

ADI-ROXX commented Dec 26, 2024

yurishkuro commented Dec 26, 2024

mahadzaryab1 commented Dec 22, 2024 •

edited by yurishkuro

Loading

ADI-ROXX commented Dec 25, 2024 •

edited

Loading