Skip to content

Diagnostics: Add Troubleshooting guide pages#1302

Merged
j82w merged 28 commits into
masterfrom
users/jawilley/diagnostics/tsg_pages
Jun 2, 2020
Merged

Diagnostics: Add Troubleshooting guide pages#1302
j82w merged 28 commits into
masterfrom
users/jawilley/diagnostics/tsg_pages

Conversation

@j82w
Copy link
Copy Markdown
Contributor

@j82w j82w commented Mar 25, 2020

Pull Request Template

Description

Adding an initial small set of troubleshooting pages. These pages will be added to the service exceptions messages and to the client generated exceptions. These pages will eventually move to Azure Docs once they have stabilized.

Type of change

Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

Jake Willey and others added 12 commits February 28, 2020 10:01
Co-Authored-By: Matias Quaranta <ealsur@users.noreply.github.com>
Co-Authored-By: Matias Quaranta <ealsur@users.noreply.github.com>
Co-Authored-By: Matias Quaranta <ealsur@users.noreply.github.com>
Co-Authored-By: Matias Quaranta <ealsur@users.noreply.github.com>
Co-Authored-By: Matias Quaranta <ealsur@users.noreply.github.com>
Co-Authored-By: Matias Quaranta <ealsur@users.noreply.github.com>
Co-Authored-By: Matias Quaranta <ealsur@users.noreply.github.com>
@j82w j82w added the Diagnostics Issues around diagnostics and troubleshooting label Mar 25, 2020
@j82w j82w self-assigned this Mar 25, 2020
@j82w j82w changed the title Users/jawilley/diagnostics/tsg pages Adding Troubleshooting guide pages and links to CosmosException Mar 25, 2020
ealsur
ealsur previously approved these changes Mar 25, 2020
Comment thread TroubleshootingGuides/CosmosNotFound.md Outdated
Comment thread TroubleshootingGuides/CosmosRequestRateTooLarge.md Outdated
Co-Authored-By: Matias Quaranta <ealsur@users.noreply.github.com>
Comment thread Microsoft.Azure.Cosmos/src/Diagnostics/CosmosTroubleshootingLink.cs Outdated
Comment thread Microsoft.Azure.Cosmos/src/Diagnostics/CosmosTroubleshootingLink.cs Outdated
Comment thread Microsoft.Azure.Cosmos/src/Diagnostics/CosmosTroubleshootingLink.cs Outdated
Comment thread TroubleshootingGuides/CosmosClientRequestTimeout.md Outdated
Comment thread Microsoft.Azure.Cosmos/src/Diagnostics/CosmosTroubleshootingLink.cs Outdated
@j82w j82w requested review from khdang and sboshra as code owners April 29, 2020 22:05
Comment thread TroubleshootingGuides/CosmosMacSignature.md Outdated
Comment thread TroubleshootingGuides/CosmosMacSignature.md Outdated
Comment thread TroubleshootingGuides/CosmosMacSignature.md Outdated
There is a race condition with container creation. An application instance is trying to access the container before container creation is complete. The most common scenario for this if the application is running, and the container is deleted and recreated with the same name while the application is running. The SDK will attempt to use the new container, but the container creation is still in progress so it does not have the keys.

#### Symptoms
401 MAC signature issue is seen shortly after a container creation, and only occur until the container creation is completed. No newline at end of file
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also lets cover the clock-skew issue as well (of-late nit frequent but possible) (NON BLOCKING).

@milismsft any others?

Comment thread TroubleshootingGuides/CosmosNotFound.md Outdated
Comment thread TroubleshootingGuides/CosmosNotFound.md Outdated
Comment thread TroubleshootingGuides/CosmosRequestHeaderTooLarge.md Outdated
Comment thread TroubleshootingGuides/CosmosRequestHeaderTooLarge.md Outdated
Comment thread TroubleshootingGuides/CosmosRequestRateTooLarge.md
Comment thread TroubleshootingGuides/CosmosRequestTimeoutClient.md Outdated
## Troubleshooting steps
These are the known causes for this issue.

### High CPU utilization
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also most practical scenario might be cross-partition query doing much higher work.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't see this be scenario yet. It can be added later if we actually have incidents that were caused by it.

Follow the [performance tips](https://docs.microsoft.com/azure/cosmos-db/performance-tips), and use a single CosmosClient instance across an entire process.|

### Hot partition key
Azure Cosmos DB distributes the overall provisioned throughput evenly across physical partitions. One partition is having all of it's resources consumed while other partitions go unused. Check portal metrics to see if the workload is encountering a hot [partition key](https://docs.microsoft.com/azure/cosmos-db/partition-data). This will cause the aggregate consumed throughput (RU/s) to be appear to be under the provisioned RUs, but a single partition consumed throughput (RU/s) will exceed the provisioned throughput
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it related to 408?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a client request timeout it could be caused by hot partition.

Comment thread TroubleshootingGuides/CosmosRequestTimeoutClient.md Outdated
Comment thread TroubleshootingGuides/CosmosRequestTimeoutService.md
Comment thread TroubleshootingGuides/CosmosSNATPortExhaustion.md
@j82w j82w mentioned this pull request May 27, 2020

Symptoms: 401 MAC signature issue is seen shortly after a container creation, and only occur until the container creation is completed.

Cause: There is a race condition with container creation. An application instance is trying to access the container before container creation is complete. The most common scenario for this if the application is running, and the container is deleted and recreated with the same name while the application is running. The SDK will attempt to use the new container, but the container creation is still in progress so it does not have the keys. No newline at end of file
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the best practice to handle this scenario (I usually recommend not reusing names (like avoiding the drop/recreate scenario)) - what do others recommend?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point. This is pretty much what already exists in the current public TSG. We can update it based on feedback.


Fix: Change the Time To Live to prevent the item from getting purged.

### 4. Lazy indexing
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't exist anymore? Didn't we deprecate lazy indexing?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still in the official version so I'm going to leave it. It's possible with old SDKs or some other scenario users might still have it.


## Issue

'Request rate too large' or error code 429 indicates that your requests are being throttled, because the consumed throughput (RU/s) has exceeded the [provisioned throughput](https://docs.microsoft.com/azure/cosmos-db/set-throughput). The SDK will automatically retry requests based on the specified retry policy. If you get this failure often, consider increasing the throughput on the collection. Check the portal's metrics to see if you are getting 429 errors. Review your partition key to ensure it results in an [even distribution of storage and request volume](https://docs.microsoft.com/azure/cosmos-db/partition-data).
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be good to extend here - like how do I check with the on-board telemetry whether distribution is even etc. Don't block on this - we can iterate on it later. But given how common this comes up I think this should be a very granular walkthrough of how to troubleshoot.

@j82w j82w merged commit e9823a6 into master Jun 2, 2020
@j82w j82w deleted the users/jawilley/diagnostics/tsg_pages branch June 2, 2020 18:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Diagnostics Issues around diagnostics and troubleshooting

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants