Skip to content

Conversation

@dibahlfi
Copy link
Contributor

@dibahlfi dibahlfi commented Jun 24, 2025

Customer faced an issue with their HPK container set up. Problem was when they set the root level partition key in the QueryRequestOptions and run a query they would get the data back when running on windows but for other environments(like docker) they were getting an exception
For Docker they were getting 410(Gone) exception
Microsoft.Azure.Cosmos.CosmosException : Response status code does not indicate success: Gone (410); Substatus: 1002; ActivityId: 623305e2-9a69-4b65-aa9b-911ee8ac3533; Reason: (Epk Range: [0DCEB8CE51C6BFE84F4BD9409F69B9BB,0DCEB8CE51C6BFE84F4BD9409F69B9BBFF) is gone.);

The root of the problem was that for non-window(x64) environment we were getting the query plan from gateway(unlike windows where we use the ServiceInterop dll) and after we got the plan we funnel it through RequestInvokerHandler pipeline. The pipeline has a check where it sees that if a feedRange is provided and if it returns more than 1 overlapping ranges it throws a gone exception but in this particular case this is a valid scenario(edge case).
e.g
datatable (MinEPK: string, MaxEPK: string, PKrangeId)
[
"0D4DC2CD8F49C65A8E0C5306B61B4343","0DCEB8CE51C6BFE84F4BD9409F69B9BB2164DEBD78C50C850E0C1E3E3F0579ED",0
"0DCEB8CE51C6BFE84F4BD9409F69B9BB2164DEBD78C50C850E0C1E3E3F0579ED","1080F600C27CF98DC13F8639E94E7676" 1
]
If our provided partition key results in a hash e.g 0DCEB8CE51C6BFE84F4BD9409F69B9BB that lies between the above 2 records, we will get overlapping ranges back.

Fix:
Fix it to explicitly make a check and see if its a Query Plan , if it is then allow it to progress.

Testing:
I was able to get access to the Defender Team DB where they were facing this issue. I manually tested the following E2E use cases and got the desired results.
On Windows -> with PartitionKey specified
Diagnostics is : {"Summary":{"DirectCalls":{"(449, 5350)":6,"(200, 0)":2},"GatewayCalls":{"(200, 0)":4,"(304, 0)":1}}
Count: 24025; Severity: Medium;
Count: 24114; Severity: Informational;
Count: 23; Severity: High;

On Non_windows - with PartitionKey specified(in the QueryRequestOptions)
Diagnostics is : {"Summary":{"DirectCalls":{"(449, 5350)":2,"(200, 0)":2},"GatewayCalls":{"(200, 0)":5,"(304, 0)":1,"(0, 0)":1}}
Count: 24025; Severity: Medium;
Count: 24114; Severity: Informational;
Count: 23; Severity: High;

On Windows - no QueryRequestOption specified
Diagnostics is : {"Summary":{"DirectCalls":{"(200, 0)":2},"GatewayCalls":{"(200, 0)":4,"(304, 0)":1}}
Count: 24025; Severity: Medium;
Count: 24114; Severity: Informational;
Count: 23; Severity: High;

On Non_windows - no QueryRequestOption specified
Diagnostics is : {"Summary":{"DirectCalls":{"(200, 0)":2},"GatewayCalls":{"(200, 0)":5,"(304, 0)":1}}
Count: 24025; Severity: Medium;
Count: 24114; Severity: Informational;
Count: 23; Severity: High;

Problem writing a new reliable E2E test in the pipeline is the set up needed to return overlapping ranges and we don't have control over when partition split happens.

Talked to Ananth on how he tested it and he worked with Elasticity team to create the database in the desired state, we can do the same in the pipeline but need to check if our database accounts/containers get cleaned up by some background job in the future(will check with Nalu on this one).
I also explored the PartitionKeyHashRangeSplitterAndMerger class but that works only with in-memory container and doesn't follow the RequestInvokerHandler pipeline flow.

for the time being I added a unit test which will avoid any regression in the future.

closes #5220
#5220

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All good!

@dibahlfi dibahlfi changed the title fix: HPK QueryRequestOption fix QueryPlan : Fixes 410 Gone Exception on non-x64 platforms Jun 24, 2025
@kirankumarkolli
Copy link
Member

Please include the issue# which it closes.

Copy link
Member

@FabianMeiswinkel FabianMeiswinkel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - I am fine with just using unit test instead of E2E test based on the reaosns in the PR descritpion. But I left a few minor comments - I am good once those are addressed.

Copy link
Member

@FabianMeiswinkel FabianMeiswinkel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - Thanks

Copy link
Contributor

@NaluTripician NaluTripician left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@kirankumarkolli kirankumarkolli merged commit 0c12e8e into master Jun 27, 2025
27 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Hierarchical Partition Key] - Fix the 410/1002 Gone Exception during queries where customer passes a partial PK in "QueryRequestOptions"

5 participants