-
Notifications
You must be signed in to change notification settings - Fork 524
QueryPlan : Fixes 410 Gone Exception on non-x64 platforms #5257
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All good!
Microsoft.Azure.Cosmos/tests/Microsoft.Azure.Cosmos.Tests/RetryHandlerTests.cs
Outdated
Show resolved
Hide resolved
|
Please include the issue# which it closes. |
Microsoft.Azure.Cosmos/tests/Microsoft.Azure.Cosmos.Tests/RetryHandlerTests.cs
Outdated
Show resolved
Hide resolved
Microsoft.Azure.Cosmos/tests/Microsoft.Azure.Cosmos.Tests/RetryHandlerTests.cs
Outdated
Show resolved
Hide resolved
FabianMeiswinkel
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - I am fine with just using unit test instead of E2E test based on the reaosns in the PR descritpion. But I left a few minor comments - I am good once those are addressed.
FabianMeiswinkel
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - Thanks
NaluTripician
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Customer faced an issue with their HPK container set up. Problem was when they set the root level partition key in the QueryRequestOptions and run a query they would get the data back when running on windows but for other environments(like docker) they were getting an exception
For Docker they were getting 410(Gone) exception
Microsoft.Azure.Cosmos.CosmosException : Response status code does not indicate success: Gone (410); Substatus: 1002; ActivityId: 623305e2-9a69-4b65-aa9b-911ee8ac3533; Reason: (Epk Range: [0DCEB8CE51C6BFE84F4BD9409F69B9BB,0DCEB8CE51C6BFE84F4BD9409F69B9BBFF) is gone.);
The root of the problem was that for non-window(x64) environment we were getting the query plan from gateway(unlike windows where we use the ServiceInterop dll) and after we got the plan we funnel it through RequestInvokerHandler pipeline. The pipeline has a check where it sees that if a feedRange is provided and if it returns more than 1 overlapping ranges it throws a gone exception but in this particular case this is a valid scenario(edge case).
e.g
datatable (MinEPK: string, MaxEPK: string, PKrangeId)
[
"0D4DC2CD8F49C65A8E0C5306B61B4343","0DCEB8CE51C6BFE84F4BD9409F69B9BB2164DEBD78C50C850E0C1E3E3F0579ED",0
"0DCEB8CE51C6BFE84F4BD9409F69B9BB2164DEBD78C50C850E0C1E3E3F0579ED","1080F600C27CF98DC13F8639E94E7676" 1
]
If our provided partition key results in a hash e.g 0DCEB8CE51C6BFE84F4BD9409F69B9BB that lies between the above 2 records, we will get overlapping ranges back.
Fix:
Fix it to explicitly make a check and see if its a Query Plan , if it is then allow it to progress.
Testing:
I was able to get access to the Defender Team DB where they were facing this issue. I manually tested the following E2E use cases and got the desired results.
On Windows -> with PartitionKey specified
Diagnostics is : {"Summary":{"DirectCalls":{"(449, 5350)":6,"(200, 0)":2},"GatewayCalls":{"(200, 0)":4,"(304, 0)":1}}
Count: 24025; Severity: Medium;
Count: 24114; Severity: Informational;
Count: 23; Severity: High;
On Non_windows - with PartitionKey specified(in the QueryRequestOptions)
Diagnostics is : {"Summary":{"DirectCalls":{"(449, 5350)":2,"(200, 0)":2},"GatewayCalls":{"(200, 0)":5,"(304, 0)":1,"(0, 0)":1}}
Count: 24025; Severity: Medium;
Count: 24114; Severity: Informational;
Count: 23; Severity: High;
On Windows - no QueryRequestOption specified
Diagnostics is : {"Summary":{"DirectCalls":{"(200, 0)":2},"GatewayCalls":{"(200, 0)":4,"(304, 0)":1}}
Count: 24025; Severity: Medium;
Count: 24114; Severity: Informational;
Count: 23; Severity: High;
On Non_windows - no QueryRequestOption specified
Diagnostics is : {"Summary":{"DirectCalls":{"(200, 0)":2},"GatewayCalls":{"(200, 0)":5,"(304, 0)":1}}
Count: 24025; Severity: Medium;
Count: 24114; Severity: Informational;
Count: 23; Severity: High;
Problem writing a new reliable E2E test in the pipeline is the set up needed to return overlapping ranges and we don't have control over when partition split happens.
Talked to Ananth on how he tested it and he worked with Elasticity team to create the database in the desired state, we can do the same in the pipeline but need to check if our database accounts/containers get cleaned up by some background job in the future(will check with Nalu on this one).
I also explored the PartitionKeyHashRangeSplitterAndMerger class but that works only with in-memory container and doesn't follow the RequestInvokerHandler pipeline flow.
for the time being I added a unit test which will avoid any regression in the future.
closes #5220
#5220