Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
de903b4
Adding initial troubleshooting pages.
Feb 28, 2020
1def3af
Updated id based on feedback
Feb 28, 2020
8f0d568
Update TroubleshootingGuides/Cosmos503_9000SNATPortExhaustion.md
j82w Mar 17, 2020
a8afbc7
Update TroubleshootingGuides/Cosmos503_0000RequestTimeout.md
j82w Mar 17, 2020
f6b831d
Update TroubleshootingGuides/Cosmos503_0000RequestTimeout.md
j82w Mar 17, 2020
c773d02
Update TroubleshootingGuides/Cosmos429_0000RequestRateTooLarge.md
j82w Mar 17, 2020
35928da
Update TroubleshootingGuides/Cosmos404_0000NotFound.md
j82w Mar 17, 2020
72267bc
Update TroubleshootingGuides/Cosmos304_0000NotModified.md
j82w Mar 17, 2020
575e9de
Update TroubleshootingGuides/Cosmos503_0000RequestTimeout.md
j82w Mar 17, 2020
300c210
Updated based on feedback
Mar 25, 2020
a7d4c3b
Merge remote-tracking branch 'origin/master' into users/jawilley/diag…
Mar 25, 2020
f316b68
Added link to exception and update files to just use names
Mar 25, 2020
46aeead
Updated naming
Mar 25, 2020
830e3e6
Update TroubleshootingGuides/CosmosRequestRateTooLarge.md
j82w Mar 25, 2020
dfa1c44
Adding tests
Mar 25, 2020
c9aafc8
Merging to latest
Mar 30, 2020
ac69f79
Merge branch 'users/jawilley/diagnostics/tsg_pages' of https://github…
Mar 30, 2020
304c32c
Updated not found based on comments
Mar 30, 2020
864d1c7
Revert changes
Apr 23, 2020
cbf4ecb
Merge remote-tracking branch 'origin/master' into users/jawilley/diag…
Apr 23, 2020
2690410
Updated files based on feedback
Apr 23, 2020
08687ad
Merge branch 'master' into users/jawilley/diagnostics/tsg_pages
j82w Apr 24, 2020
e7aafa5
Add more TSGs and update docs
Apr 29, 2020
06e21b0
Merge branch 'users/jawilley/diagnostics/tsg_pages' of https://github…
Apr 29, 2020
f5d156c
Merge branch 'master' into users/jawilley/diagnostics/tsg_pages
j82w Apr 29, 2020
16b5f7f
Updated documentation based on feedback
May 12, 2020
1c6feb1
Merge branch 'users/jawilley/diagnostics/tsg_pages' of https://github…
May 12, 2020
eb16297
Fixed formatting
May 12, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -242,4 +242,4 @@ private string ToStringHelper(
return stringBuilder.ToString();
}
}
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -1304,7 +1304,8 @@ public async Task ItemReplaceAsyncTest()
partitionKey: new Cosmos.PartitionKey(originalStatus),
item: testItem);
Assert.Fail("Replace changing partition key is not supported.");
}catch(CosmosException ce)
}
catch (CosmosException ce)
{
Assert.AreEqual((HttpStatusCode)400, ce.StatusCode);
}
Expand Down
32 changes: 32 additions & 0 deletions TroubleshootingGuides/CosmosMacSignature.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
## CosmosUnauthorized

| | | |
|---|---|---|
|TypeName|CosmosNotModified|
|Status|401_0000|
|Category|Service|

## Description
HTTP 401: The MAC signature found in the HTTP request is not the same as the computed signature
If you received the following 401 error message: "The MAC signature found in the HTTP request is not the same as the computed signature." it can be caused by the following scenarios.

## Troubleshooting steps

### 1. Key was not properly rotated.

Symptom: 401 MAC signature is seen shortly after a key rotation and eventually stops without any changes.

Cause: The key was rotated and did not follow the [best practices](secure-access-to-data.md#key-rotation). This is usually the case. Cosmos DB account key rotation can take anywhere from a few seconds to possibly days depending on the Cosmos DB account size.

### 2. The key is misconfigured

Symptoms: 401 MAC signature issue will be consistent and happens for all calls using that key

Cause: The key is misconfigured on the application so the key does not match the account or entire key was not copied.


### 3. Race condition with create container

Symptoms: 401 MAC signature issue is seen shortly after a container creation, and only occur until the container creation is completed.

Cause: There is a race condition with container creation. An application instance is trying to access the container before container creation is complete. The most common scenario for this if the application is running, and the container is deleted and recreated with the same name while the application is running. The SDK will attempt to use the new container, but the container creation is still in progress so it does not have the keys.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the best practice to handle this scenario (I usually recommend not reusing names (like avoiding the drop/recreate scenario)) - what do others recommend?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point. This is pretty much what already exists in the current public TSG. We can update it based on feedback.

42 changes: 42 additions & 0 deletions TroubleshootingGuides/CosmosNotFound.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
## CosmosNotFound

| | | |
|---|---|---|
|TypeName|CosmosNotFound|
|Status|404_0000|
|Category|Service|

## Description

This status code represents that the resource no longer exists.

## Known causes

The document does exists, but still returns a 404.

### 1. Race condition
Cause: There is multiple SDK client instances and the read happened before the write.

Fix:
1. For session consistency the create item will return a session token that can be passed between SDK instances to guarantee that the read request is reading from a replica with that change.
2. Change the [consistency level](https://docs.microsoft.com/azure/cosmos-db/consistency-levels-choosing) to a [stronger level](https://docs.microsoft.com/azure/cosmos-db/consistency-levels-tradeoffs)

### 2. Invalid Partition Key and ID combination
Cause: The partition key and id combination are not valid.

Fix: Fix the application logic that is causing the incorrect combination.

### 3. TTL purge
Cause: The item had the [Time To Live (TTL)](https://docs.microsoft.com/azure/cosmos-db/time-to-live) property set. The item was purged because the time to live had expired.

Fix: Change the Time To Live to prevent the item from getting purged.

### 4. Lazy indexing
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't exist anymore? Didn't we deprecate lazy indexing?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still in the official version so I'm going to leave it. It's possible with old SDKs or some other scenario users might still have it.

Cause: The [lazy indexing](https://docs.microsoft.com/azure/cosmos-db/index-policy#indexing-mode) has not caught up.

Fix: Wait for the indexing to catch up or change the indexing policy

### 5. Parent resource deleted
Cause: The database and/or container that the item exists in has been deleted.

Fix: [Restore](https://docs.microsoft.com/azure/cosmos-db/online-backup-and-restore#backup-retention-period) the parent resource or recreate the resources.
14 changes: 14 additions & 0 deletions TroubleshootingGuides/CosmosNotModified.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
## CosmosNotModified

| | | |
|---|---|---|
|TypeName|CosmosNotModified|
|Status|304_0000|
|Category|Service|

## Description

This status code in changefeed simply means there is no new items to process. This is expected and the SDK is designed to handle it.

## Related documentation
* [Change feed overview](https://docs.microsoft.com/azure/cosmos-db/change-feed)
33 changes: 33 additions & 0 deletions TroubleshootingGuides/CosmosRequestHeaderTooLarge.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
## CosmosRequestHeaderTooLarge

| | | |
|---|---|---|
|TypeName|CosmosRequestHeaderTooLarge|
|Status|400_0000|
|Category|Service|

## Description
The size of the header has grown to large and is exceeding the maximum allowed size. It's always recommended to use the latest SDK. Make sure to use at least version 3.x or 2.x, which adds header size tracing to the exception message.

## Troubleshooting steps

### 1. Session Token too large
Symptoms: The 400 bad request is happening on point operations where the continuation token is not being used. The exception started without making any changes to the application.

Cause: The session token grows as the number of partitions increase in the container. The numbers of partition increase as the amount of data increase or if the thoughput is increased.

Temprorary mitigation: Restart the application will reset all the session token. This session token will eventually grow back to the previous size that causes the issue.

Fixes:
1. Follow the performance tips and convert the application to Direct + TCP connection mode. Direct + TCP does not have the header size restriction like HTTP does which avoids this issue. Make sure to use SDK version greater than 2.9.3 which has a fix for query opertaions when the service interop is not available.
2. If the application cannot be converted to Direct + TCP and the session token is the cause, then mitigation can be done by changing the client consistency level. The session token is only used for session consistency which is the default for Cosmos DB. Any other consistency level will not use the session token.


### 2. Continuation token too large
Symptoms: The 400 bad request is happening on query operations where the continuation token is being passed in.

Cause: The continuation token has grown to large. Different queries will have different continuation token sizes.

Fixes:
1. Follow the performance tips and convert the application to Direct + TCP connection mode. Direct + TCP does not have the header size restriction like HTTP does which avoids this issue.
2. If the application cannot be converted to Direct + TCP and the continuation token is the cause, then try setting the ResponseContinuationTokenLimitInKb option. The option can be found in the FeedOptions for v2 or the QueryRequestOptions in v3.
19 changes: 19 additions & 0 deletions TroubleshootingGuides/CosmosRequestRateTooLarge.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
## CosmosRequestRateTooLarge

| | | |
|---|---|---|
|TypeName|CosmosRequestRateTooLarge|
|Status|429_0000|
|Category|Service|

## Issue

'Request rate too large' or error code 429 indicates that your requests are being throttled, because the consumed throughput (RU/s) has exceeded the [provisioned throughput](https://docs.microsoft.com/azure/cosmos-db/set-throughput). The SDK will automatically retry requests based on the specified retry policy. If you get this failure often, consider increasing the throughput on the collection. Check the portal's metrics to see if you are getting 429 errors. Review your partition key to ensure it results in an [even distribution of storage and request volume](https://docs.microsoft.com/azure/cosmos-db/partition-data).
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be good to extend here - like how do I check with the on-board telemetry whether distribution is even etc. Don't block on this - we can iterate on it later. But given how common this comes up I think this should be a very granular walkthrough of how to troubleshoot.


## Solution

Use the portal or the SDK to increase the provisioned throughput.

## Related documentation
* [Provision throughput on containers and databases](https://docs.microsoft.com/azure/cosmos-db/set-throughput)
* [Request units in Azure Cosmos DB](https://docs.microsoft.com/azure/cosmos-db/request-units)
Comment thread
j82w marked this conversation as resolved.
45 changes: 45 additions & 0 deletions TroubleshootingGuides/CosmosRequestTimeoutClient.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
## CosmosRequestTimeoutClient

| | | |
|---|---|---|
|TypeName|CosmosRequestTimeoutClient|
|Status|408_0000|
|Category|Connectivity|


## Issue

The SDK was not able to connect to the Azure Cosmos DB service.

## Troubleshooting steps
These are the known causes for this issue.

### 1. High CPU utilization (most common case)
Cause: For optimal latency it is recommended that CPU usage should be roughly 40%. It is recommended to look at CPU utilization at 10 second intervals. If the interval is larger then CPU spikes can be missed by getting averaged in with lower values. This is more common with cross partition queries where it might do multiple connections for a single request.

Fix: The application should be scaled up/out.

### 2. Socket / Port availability might be low
Cause: When running in Azure, clients using the .NET SDK can hit Azure SNAT (PAT) port exhaustion.

Fix: Follow the CosmosSNATPortExhuastion guide.

### 3. Creating multiple Client instances
Cause: This might lead to connection contention and timeout issues.

Fix:Follow the [performance tips](https://docs.microsoft.com/azure/cosmos-db/performance-tips), and use a single CosmosClient instance across an entire process.|

### 4. Hot partition key
Cause: Azure Cosmos DB distributes the overall provisioned throughput evenly across physical partitions. One partition is having all of it's resources consumed while other partitions go unused. Check portal metrics to see if the workload is encountering a hot [partition key](https://docs.microsoft.com/azure/cosmos-db/partition-data). This will cause the aggregate consumed throughput (RU/s) to be appear to be under the provisioned RUs, but a single partition consumed throughput (RU/s) will exceed the provisioned throughput

Fix: The partition key should be changed to avoid the heavily used value.

### 5. High degree of concurrency
Cause: The application is doing a high level of conccurrency which can lead to contention on the channel

Fix: Try to scale the application up/out.

### 6. Large requests and/or responses
Cause: Large requests or responses can lead to head-of-line blocking on the channel and exacerbate contention, even with a relatively low degree of concurrency.

Fix: Try to scale the application up/out.
22 changes: 22 additions & 0 deletions TroubleshootingGuides/CosmosRequestTimeoutService.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
## CosmosRequestTimeoutService
Comment thread
j82w marked this conversation as resolved.

| | | |
|---|---|---|
|TypeName|CosmosRequestTimeoutService|
|Status|408_0000|
|Category|Service|

## Issue

The SDK was able to connect to the Azure Cosmos DB service, but the request timed out.

## Troubleshooting steps

### 1. Check the portal metrics
Use the [Azure monitoring](https://docs.microsoft.com/azure/cosmos-db/monitor-cosmos-db) to check if the 408 request timeout was from the service.

### 2. Failure rate is within Cosmos DB SLA
The application should be able to handle transient failures and retry when necessary.

### 3. Failure rate is violating the Cosmos DB SLA
Please contact Azure support.
25 changes: 25 additions & 0 deletions TroubleshootingGuides/CosmosSNATPortExhaustion.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
## CosmosSNATPortExhuastion
Comment thread
j82w marked this conversation as resolved.

| | | |
|---|---|---|
|TypeName|CosmosSNATPortExhuastion|
|Status|503_0000|
|Category|Connectivity|

## Issue

If your app is deployed on Azure Virtual Machines without a public IP address, by default [Azure SNAT ports](https://docs.microsoft.com/azure/load-balancer/load-balancer-outbound-connections#preallocatedports) establish connections to any endpoint outside of your VM. The number of connections allowed from the VM to the Azure Cosmos DB endpoint is limited by the [Azure SNAT configuration](https://docs.microsoft.com/azure/load-balancer/load-balancer-outbound-connections#preallocatedports).

Azure SNAT ports are used only when your VM has a private IP address and a process from the VM tries to connect to a public IP address.

## Troubleshooting steps

There are two workarounds to avoid Azure SNAT limitation:

* Add your Azure Cosmos DB service endpoint to the subnet of your Azure Virtual Machines virtual network. For more information, see [Azure Virtual Network service endpoints](https://docs.microsoft.com/azure/virtual-network/virtual-network-service-endpoints-overview).

When the service endpoint is enabled, the requests are no longer sent from a public IP to Azure Cosmos DB. Instead, the virtual network and subnet identity are sent. This change might result in firewall drops if only public IPs are allowed. If you use a firewall, when you enable the service endpoint, add a subnet to the firewall by using [Virtual Network ACLs](https://docs.microsoft.com/azure/virtual-network/virtual-networks-acl).
* Assign a public IP to your Azure VM.

## Related documentation
* [Diagnose and troubleshoot issues when using Azure Cosmos DB .NET SDK](https://docs.microsoft.com/azure/cosmos-db/troubleshoot-dot-net-sdk)