Diagnostics: Add Troubleshooting guide pages by j82w · Pull Request #1302 · Azure/azure-cosmos-dotnet-v3

j82w · 2020-03-25T15:30:58Z

Pull Request Template

Description

Adding an initial small set of troubleshooting pages. These pages will be added to the service exceptions messages and to the client generated exceptions. These pages will eventually move to Azure Docs once they have stabilized.

Type of change

Please delete options that are not relevant.

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
This change requires a documentation update

Co-Authored-By: Matias Quaranta <ealsur@users.noreply.github.com>

…nostics/tsg_pages

Co-Authored-By: Matias Quaranta <ealsur@users.noreply.github.com>

….com/Azure/azure-cosmos-dotnet-v3 into users/jawilley/diagnostics/tsg_pages

kirankumarkolli · 2020-05-11T12:33:05Z

+There is a race condition with container creation. An application instance is trying to access the container before container creation is complete. The most common scenario for this if the application is running, and the container is deleted and recreated with the same name while the application is running. The SDK will attempt to use the new container, but the container creation is still in progress so it does not have the keys.
+
+#### Symptoms
+401 MAC signature issue is seen shortly after a container creation, and only occur until the container creation is completed.


Also lets cover the clock-skew issue as well (of-late nit frequent but possible) (NON BLOCKING).

@milismsft any others?

kirankumarkolli · 2020-05-11T12:41:16Z

+## Troubleshooting steps
+These are the known causes for this issue.
+
+### High CPU utilization


Also most practical scenario might be cross-partition query doing much higher work.

I haven't see this be scenario yet. It can be added later if we actually have incidents that were caused by it.

kirankumarkolli · 2020-05-11T12:42:09Z

+Follow the [performance tips](https://docs.microsoft.com/azure/cosmos-db/performance-tips), and use a single CosmosClient instance across an entire process.|
+
+### Hot partition key
+Azure Cosmos DB distributes the overall provisioned throughput evenly across physical partitions. One partition is having all of it's resources consumed while other partitions go unused. Check portal metrics to see if the workload is encountering a hot [partition key](https://docs.microsoft.com/azure/cosmos-db/partition-data). This will cause the aggregate consumed throughput (RU/s) to be appear to be under the provisioned RUs, but a single partition consumed throughput (RU/s) will exceed the provisioned throughput


Is it related to 408?

For a client request timeout it could be caused by hot partition.

….com/Azure/azure-cosmos-dotnet-v3 into users/jawilley/diagnostics/tsg_pages

FabianMeiswinkel · 2020-06-02T17:52:24Z

+
+    Symptoms: 401 MAC signature issue is seen shortly after a container creation, and only occur until the container creation is completed.
+
+    Cause: There is a race condition with container creation. An application instance is trying to access the container before container creation is complete. The most common scenario for this if the application is running, and the container is deleted and recreated with the same name while the application is running. The SDK will attempt to use the new container, but the container creation is still in progress so it does not have the keys.


What is the best practice to handle this scenario (I usually recommend not reusing names (like avoiding the drop/recreate scenario)) - what do others recommend?

That's a good point. This is pretty much what already exists in the current public TSG. We can update it based on feedback.

FabianMeiswinkel · 2020-06-02T17:53:44Z

+
+    Fix: Change the Time To Live to prevent the item from getting purged.
+
+### 4. Lazy indexing


Shouldn't exist anymore? Didn't we deprecate lazy indexing?

This is still in the official version so I'm going to leave it. It's possible with old SDKs or some other scenario users might still have it.

FabianMeiswinkel · 2020-06-02T17:55:43Z

+
+## Issue
+
+'Request rate too large' or error code 429 indicates that your requests are being throttled, because the consumed throughput (RU/s) has exceeded the [provisioned throughput](https://docs.microsoft.com/azure/cosmos-db/set-throughput). The SDK will automatically retry requests based on the specified retry policy. If you get this failure often, consider increasing the throughput on the collection. Check the portal's metrics to see if you are getting 429 errors. Review your partition key to ensure it results in an [even distribution of storage and request volume](https://docs.microsoft.com/azure/cosmos-db/partition-data).


I think it would be good to extend here - like how do I check with the on-board telemetry whether distribution is even etc. Don't block on this - we can iterate on it later. But given how common this comes up I think this should be a very granular walkthrough of how to troubleshoot.

Jake Willey and others added 12 commits February 28, 2020 10:01

Adding initial troubleshooting pages.

de903b4

Updated id based on feedback

1def3af

Update TroubleshootingGuides/Cosmos503_9000SNATPortExhaustion.md

8f0d568

Co-Authored-By: Matias Quaranta <ealsur@users.noreply.github.com>

Update TroubleshootingGuides/Cosmos503_0000RequestTimeout.md

a8afbc7

Co-Authored-By: Matias Quaranta <ealsur@users.noreply.github.com>

Update TroubleshootingGuides/Cosmos503_0000RequestTimeout.md

f6b831d

Co-Authored-By: Matias Quaranta <ealsur@users.noreply.github.com>

Update TroubleshootingGuides/Cosmos429_0000RequestRateTooLarge.md

c773d02

Co-Authored-By: Matias Quaranta <ealsur@users.noreply.github.com>

Update TroubleshootingGuides/Cosmos404_0000NotFound.md

35928da

Co-Authored-By: Matias Quaranta <ealsur@users.noreply.github.com>

Update TroubleshootingGuides/Cosmos304_0000NotModified.md

72267bc

Co-Authored-By: Matias Quaranta <ealsur@users.noreply.github.com>

Update TroubleshootingGuides/Cosmos503_0000RequestTimeout.md

575e9de

Co-Authored-By: Matias Quaranta <ealsur@users.noreply.github.com>

Updated based on feedback

300c210

Merge remote-tracking branch 'origin/master' into users/jawilley/diag…

a7d4c3b

…nostics/tsg_pages

Added link to exception and update files to just use names

f316b68

j82w added the Diagnostics Issues around diagnostics and troubleshooting label Mar 25, 2020

j82w requested review from kirankumarkolli and kirillg as code owners March 25, 2020 15:30

j82w self-assigned this Mar 25, 2020

j82w changed the title ~~Users/jawilley/diagnostics/tsg pages~~ Adding Troubleshooting guide pages and links to CosmosException Mar 25, 2020

Updated naming

46aeead

ealsur previously approved these changes Mar 25, 2020

View reviewed changes

Comment thread TroubleshootingGuides/CosmosNotFound.md Outdated

Comment thread TroubleshootingGuides/CosmosRequestRateTooLarge.md Outdated

Update TroubleshootingGuides/CosmosRequestRateTooLarge.md

830e3e6

Co-Authored-By: Matias Quaranta <ealsur@users.noreply.github.com>

j82w dismissed ealsur’s stale review via 830e3e6 March 25, 2020 16:54

Jake Willey added 4 commits March 25, 2020 11:59

Adding tests

dfa1c44

Merging to latest

c9aafc8

Merge branch 'users/jawilley/diagnostics/tsg_pages' of https://github…

ac69f79

….com/Azure/azure-cosmos-dotnet-v3 into users/jawilley/diagnostics/tsg_pages

Updated not found based on comments

304c32c

bchong95 reviewed Mar 30, 2020

View reviewed changes

Comment thread Microsoft.Azure.Cosmos/src/Diagnostics/CosmosTroubleshootingLink.cs Outdated

bchong95 reviewed Mar 30, 2020

View reviewed changes

Comment thread Microsoft.Azure.Cosmos/src/Diagnostics/CosmosTroubleshootingLink.cs Outdated

bchong95 reviewed Mar 30, 2020

View reviewed changes

Comment thread Microsoft.Azure.Cosmos/src/Diagnostics/CosmosTroubleshootingLink.cs Outdated

bchong95 reviewed Mar 30, 2020

View reviewed changes

Comment thread TroubleshootingGuides/CosmosClientRequestTimeout.md Outdated

bchong95 reviewed Mar 30, 2020

View reviewed changes

Comment thread Microsoft.Azure.Cosmos/src/Diagnostics/CosmosTroubleshootingLink.cs Outdated

Merge branch 'users/jawilley/diagnostics/tsg_pages' of https://github…

06e21b0

….com/Azure/azure-cosmos-dotnet-v3 into users/jawilley/diagnostics/tsg_pages

j82w requested review from khdang and sboshra as code owners April 29, 2020 22:05

Merge branch 'master' into users/jawilley/diagnostics/tsg_pages

f5d156c