Diagnostics: Add Troubleshooting guide pages#1302
Conversation
Co-Authored-By: Matias Quaranta <ealsur@users.noreply.github.com>
Co-Authored-By: Matias Quaranta <ealsur@users.noreply.github.com>
Co-Authored-By: Matias Quaranta <ealsur@users.noreply.github.com>
Co-Authored-By: Matias Quaranta <ealsur@users.noreply.github.com>
Co-Authored-By: Matias Quaranta <ealsur@users.noreply.github.com>
Co-Authored-By: Matias Quaranta <ealsur@users.noreply.github.com>
Co-Authored-By: Matias Quaranta <ealsur@users.noreply.github.com>
…nostics/tsg_pages
Co-Authored-By: Matias Quaranta <ealsur@users.noreply.github.com>
….com/Azure/azure-cosmos-dotnet-v3 into users/jawilley/diagnostics/tsg_pages
….com/Azure/azure-cosmos-dotnet-v3 into users/jawilley/diagnostics/tsg_pages
| There is a race condition with container creation. An application instance is trying to access the container before container creation is complete. The most common scenario for this if the application is running, and the container is deleted and recreated with the same name while the application is running. The SDK will attempt to use the new container, but the container creation is still in progress so it does not have the keys. | ||
|
|
||
| #### Symptoms | ||
| 401 MAC signature issue is seen shortly after a container creation, and only occur until the container creation is completed. No newline at end of file |
There was a problem hiding this comment.
Also lets cover the clock-skew issue as well (of-late nit frequent but possible) (NON BLOCKING).
@milismsft any others?
| ## Troubleshooting steps | ||
| These are the known causes for this issue. | ||
|
|
||
| ### High CPU utilization |
There was a problem hiding this comment.
Also most practical scenario might be cross-partition query doing much higher work.
There was a problem hiding this comment.
I haven't see this be scenario yet. It can be added later if we actually have incidents that were caused by it.
| Follow the [performance tips](https://docs.microsoft.com/azure/cosmos-db/performance-tips), and use a single CosmosClient instance across an entire process.| | ||
|
|
||
| ### Hot partition key | ||
| Azure Cosmos DB distributes the overall provisioned throughput evenly across physical partitions. One partition is having all of it's resources consumed while other partitions go unused. Check portal metrics to see if the workload is encountering a hot [partition key](https://docs.microsoft.com/azure/cosmos-db/partition-data). This will cause the aggregate consumed throughput (RU/s) to be appear to be under the provisioned RUs, but a single partition consumed throughput (RU/s) will exceed the provisioned throughput |
There was a problem hiding this comment.
For a client request timeout it could be caused by hot partition.
….com/Azure/azure-cosmos-dotnet-v3 into users/jawilley/diagnostics/tsg_pages
|
|
||
| Symptoms: 401 MAC signature issue is seen shortly after a container creation, and only occur until the container creation is completed. | ||
|
|
||
| Cause: There is a race condition with container creation. An application instance is trying to access the container before container creation is complete. The most common scenario for this if the application is running, and the container is deleted and recreated with the same name while the application is running. The SDK will attempt to use the new container, but the container creation is still in progress so it does not have the keys. No newline at end of file |
There was a problem hiding this comment.
What is the best practice to handle this scenario (I usually recommend not reusing names (like avoiding the drop/recreate scenario)) - what do others recommend?
There was a problem hiding this comment.
That's a good point. This is pretty much what already exists in the current public TSG. We can update it based on feedback.
|
|
||
| Fix: Change the Time To Live to prevent the item from getting purged. | ||
|
|
||
| ### 4. Lazy indexing |
There was a problem hiding this comment.
Shouldn't exist anymore? Didn't we deprecate lazy indexing?
There was a problem hiding this comment.
This is still in the official version so I'm going to leave it. It's possible with old SDKs or some other scenario users might still have it.
|
|
||
| ## Issue | ||
|
|
||
| 'Request rate too large' or error code 429 indicates that your requests are being throttled, because the consumed throughput (RU/s) has exceeded the [provisioned throughput](https://docs.microsoft.com/azure/cosmos-db/set-throughput). The SDK will automatically retry requests based on the specified retry policy. If you get this failure often, consider increasing the throughput on the collection. Check the portal's metrics to see if you are getting 429 errors. Review your partition key to ensure it results in an [even distribution of storage and request volume](https://docs.microsoft.com/azure/cosmos-db/partition-data). |
There was a problem hiding this comment.
I think it would be good to extend here - like how do I check with the on-board telemetry whether distribution is even etc. Don't block on this - we can iterate on it later. But given how common this comes up I think this should be a very granular walkthrough of how to troubleshoot.
Pull Request Template
Description
Adding an initial small set of troubleshooting pages. These pages will be added to the service exceptions messages and to the client generated exceptions. These pages will eventually move to Azure Docs once they have stabilized.
Type of change
Please delete options that are not relevant.