From de903b4ad5084b9e9edd8b30b323221892c33a0f Mon Sep 17 00:00:00 2001 From: Jake Willey Date: Fri, 28 Feb 2020 10:01:53 -0800 Subject: [PATCH 01/20] Adding initial troubleshooting pages. --- .../Cosmos1000RequestRateTooLarge.md | 24 ++++++++ .../Cosmos1001NotModified.md | 20 ++++++ TroubleshootingGuides/Cosmos1002NotFound.md | 34 +++++++++++ .../Cosmos5000RequestTimeout.md | 61 +++++++++++++++++++ .../Cosmos5001SNATPortExhaustion.md | 31 ++++++++++ 5 files changed, 170 insertions(+) create mode 100644 TroubleshootingGuides/Cosmos1000RequestRateTooLarge.md create mode 100644 TroubleshootingGuides/Cosmos1001NotModified.md create mode 100644 TroubleshootingGuides/Cosmos1002NotFound.md create mode 100644 TroubleshootingGuides/Cosmos5000RequestTimeout.md create mode 100644 TroubleshootingGuides/Cosmos5001SNATPortExhaustion.md diff --git a/TroubleshootingGuides/Cosmos1000RequestRateTooLarge.md b/TroubleshootingGuides/Cosmos1000RequestRateTooLarge.md new file mode 100644 index 0000000000..035b4509a6 --- /dev/null +++ b/TroubleshootingGuides/Cosmos1000RequestRateTooLarge.md @@ -0,0 +1,24 @@ +## Cosmos1000 + + + + + + + + + + + + + + +
TypeNameCosmos1000RequestRateTooLarge
CheckIdCosmos1000
CategoryService
+ +## Issue + +Request rate too large' or error code 429 indicates that your requests are being throttled, because the consumed throughput (RU/s) has exceeded the provisioned throughput. The SDK will automatically retry requests based on the specified retry policy. If you get this failure often, consider increasing the throughput on the collection. Check the portal’s metrics to see if you are getting 429 errors. Review your partition key to ensure it results in an even distribution of storage and request volume. + +## Solution + +Use the portal or the SDK to increase the throttling. \ No newline at end of file diff --git a/TroubleshootingGuides/Cosmos1001NotModified.md b/TroubleshootingGuides/Cosmos1001NotModified.md new file mode 100644 index 0000000000..df0a1b7a49 --- /dev/null +++ b/TroubleshootingGuides/Cosmos1001NotModified.md @@ -0,0 +1,20 @@ +## Cosmos1001 + + + + + + + + + + + + + + +
TypeNameCosmos1001NotModified
CheckIdCosmos1001
CategoryService
+ +## Description + +This status code in changefeed simply means there is no new items to process. This is expected and the SDK is designed to handle it. diff --git a/TroubleshootingGuides/Cosmos1002NotFound.md b/TroubleshootingGuides/Cosmos1002NotFound.md new file mode 100644 index 0000000000..788f4e8f64 --- /dev/null +++ b/TroubleshootingGuides/Cosmos1002NotFound.md @@ -0,0 +1,34 @@ +## Cosmos1002 + + + + + + + + + + + + + + +
TypeNameCosmos1002NotFound
CheckIdCosmos1002
CategoryService
+ +## Description + +This status code represents that the resource no longer exists. + +## Known issues + +The document does exists, but still returns a 404. + +### Cause 1: Race condition +There is multiple SDK client instances and the read happened before the write. + +### Solution +1. For session consistency the create item will return a session token that can be passed between SDK instances to guarantee that the read request is reading from a replica with that change. +2. Change the consistency level to a stronger level + +### Cause 2: Invalid chacters in id field +For this scenario use query to get the item and replace/escape the invalid characters. diff --git a/TroubleshootingGuides/Cosmos5000RequestTimeout.md b/TroubleshootingGuides/Cosmos5000RequestTimeout.md new file mode 100644 index 0000000000..237efbe7ef --- /dev/null +++ b/TroubleshootingGuides/Cosmos5000RequestTimeout.md @@ -0,0 +1,61 @@ +## Cosmos1000 + + + + + + + + + + + + + + +
TypeNameCosmos5000TransportException
CheckIdCosmos5000
CategoryConnectivity
+ +## Issue + +The SDK was not able to connect to the Azure Cosmos DB service. + +## Troubleshooting steps + +These are the known causes for this issue. + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Possible causeSolution
High CPU utilization. This is the most common cause. It is recommended to look at CPU utilization at 10 second intervals. If the interval is larger then CPU spikes can be missed by getting averaged in with lower values.The application should be scaled up/out.
Socket / Port availability might be low. When running in Azure, clients using the .NET SDK can hit Azure SNAT (PAT) port exhaustion.Follow the Cosmos1001 guide.
Creating multiple DocumentClient instances might lead to connection contention and timeout issues.Follow the [performance tips](performance-tips.md), and use a single DocumentClient instance across an entire process.
Retries occur from throttled requests. The SDK retries internally without surfacing this to the caller. Check the [portal metrics](monitor-accounts.md) for 429 throttled requests
Hot partition key. Azure Cosmos DB distributes the overall provisioned throughput evenly across physical partitions. Check portal metrics to see if the workload is encountering a hot [partition key](partition-data.md). This will cause the aggregate consumed throughput (RU/s) to be appear to be under the provisioned RUs, but a single partition consumed throughput (RU/s) will exceed the provisioned throughput.The partition key should be changed to avoid the heavily used value.
A high degree of concurrency can lead to contention on the channelTry to scale the application up/out.
Large requests or responses can lead to head-of-line blocking on the channel and exacerbate contention, even with a relatively low degree of concurrency.Try to scale the application up/out.
+ +SDK logs can be captured through [Trace Listener](https://github.com/Azure/azure-cosmosdb-dotnet/blob/master/docs/documentdb-sdk_capture_etl.md) to get more details. This can cause a significant performance impact. \ No newline at end of file diff --git a/TroubleshootingGuides/Cosmos5001SNATPortExhaustion.md b/TroubleshootingGuides/Cosmos5001SNATPortExhaustion.md new file mode 100644 index 0000000000..b58d411ae3 --- /dev/null +++ b/TroubleshootingGuides/Cosmos5001SNATPortExhaustion.md @@ -0,0 +1,31 @@ +## Cosmos1001 + + + + + + + + + + + + + + +
TypeNameCosmos1001SNATPortExhuastion
CheckIdCosmos1001
CategoryConnectivity
+ +## Issue + +If your app is deployed on Azure Virtual Machines without a public IP address, by default [Azure SNAT ports](https://docs.microsoft.com/azure/load-balancer/load-balancer-outbound-connections#preallocatedports) establish connections to any endpoint outside of your VM. The number of connections allowed from the VM to the Azure Cosmos DB endpoint is limited by the [Azure SNAT configuration](https://docs.microsoft.com/azure/load-balancer/load-balancer-outbound-connections#preallocatedports). + + Azure SNAT ports are used only when your VM has a private IP address and a process from the VM tries to connect to a public IP address. + +## Troubleshooting steps + +There are two workarounds to avoid Azure SNAT limitation: + +* Add your Azure Cosmos DB service endpoint to the subnet of your Azure Virtual Machines virtual network. For more information, see [Azure Virtual Network service endpoints](https://docs.microsoft.com/azure/virtual-network/virtual-network-service-endpoints-overview). + + When the service endpoint is enabled, the requests are no longer sent from a public IP to Azure Cosmos DB. Instead, the virtual network and subnet identity are sent. This change might result in firewall drops if only public IPs are allowed. If you use a firewall, when you enable the service endpoint, add a subnet to the firewall by using [Virtual Network ACLs](https://docs.microsoft.com/azure/virtual-network/virtual-networks-acl). +* Assign a public IP to your Azure VM. \ No newline at end of file From 1def3af86320601e2c76ec80c78aced07d479a23 Mon Sep 17 00:00:00 2001 From: Jake Willey Date: Fri, 28 Feb 2020 12:18:56 -0800 Subject: [PATCH 02/20] Updated id based on feedback --- ...osmos1001NotModified.md => Cosmos304_0000NotModified.md} | 6 +++--- .../{Cosmos1002NotFound.md => Cosmos404_0000NotFound.md} | 6 +++--- ...RateTooLarge.md => Cosmos429_0000RequestRateTooLarge.md} | 6 +++--- ...000RequestTimeout.md => Cosmos503_0000RequestTimeout.md} | 4 ++-- ...ortExhaustion.md => Cosmos503_9000SNATPortExhaustion.md} | 6 +++--- 5 files changed, 14 insertions(+), 14 deletions(-) rename TroubleshootingGuides/{Cosmos1001NotModified.md => Cosmos304_0000NotModified.md} (77%) rename TroubleshootingGuides/{Cosmos1002NotFound.md => Cosmos404_0000NotFound.md} (90%) rename TroubleshootingGuides/{Cosmos1000RequestRateTooLarge.md => Cosmos429_0000RequestRateTooLarge.md} (88%) rename TroubleshootingGuides/{Cosmos5000RequestTimeout.md => Cosmos503_0000RequestTimeout.md} (97%) rename TroubleshootingGuides/{Cosmos5001SNATPortExhaustion.md => Cosmos503_9000SNATPortExhaustion.md} (94%) diff --git a/TroubleshootingGuides/Cosmos1001NotModified.md b/TroubleshootingGuides/Cosmos304_0000NotModified.md similarity index 77% rename from TroubleshootingGuides/Cosmos1001NotModified.md rename to TroubleshootingGuides/Cosmos304_0000NotModified.md index df0a1b7a49..87364134c7 100644 --- a/TroubleshootingGuides/Cosmos1001NotModified.md +++ b/TroubleshootingGuides/Cosmos304_0000NotModified.md @@ -1,13 +1,13 @@ -## Cosmos1001 +## Cosmos304_0000 - + - + diff --git a/TroubleshootingGuides/Cosmos1002NotFound.md b/TroubleshootingGuides/Cosmos404_0000NotFound.md similarity index 90% rename from TroubleshootingGuides/Cosmos1002NotFound.md rename to TroubleshootingGuides/Cosmos404_0000NotFound.md index 788f4e8f64..42bf22bd58 100644 --- a/TroubleshootingGuides/Cosmos1002NotFound.md +++ b/TroubleshootingGuides/Cosmos404_0000NotFound.md @@ -1,13 +1,13 @@ -## Cosmos1002 +## Cosmos404_0000
TypeNameCosmos1001NotModifiedCosmos304_0000NotModified
CheckIdCosmos1001Cosmos304_0000
Category
- + - + diff --git a/TroubleshootingGuides/Cosmos1000RequestRateTooLarge.md b/TroubleshootingGuides/Cosmos429_0000RequestRateTooLarge.md similarity index 88% rename from TroubleshootingGuides/Cosmos1000RequestRateTooLarge.md rename to TroubleshootingGuides/Cosmos429_0000RequestRateTooLarge.md index 035b4509a6..39a7ea2268 100644 --- a/TroubleshootingGuides/Cosmos1000RequestRateTooLarge.md +++ b/TroubleshootingGuides/Cosmos429_0000RequestRateTooLarge.md @@ -1,13 +1,13 @@ -## Cosmos1000 +## Cosmos429_0000
TypeNameCosmos1002NotFoundCosmos404_0000NotFound
CheckIdCosmos1002Cosmos404_0000
Category
- + - + diff --git a/TroubleshootingGuides/Cosmos5000RequestTimeout.md b/TroubleshootingGuides/Cosmos503_0000RequestTimeout.md similarity index 97% rename from TroubleshootingGuides/Cosmos5000RequestTimeout.md rename to TroubleshootingGuides/Cosmos503_0000RequestTimeout.md index 237efbe7ef..06abbb5a22 100644 --- a/TroubleshootingGuides/Cosmos5000RequestTimeout.md +++ b/TroubleshootingGuides/Cosmos503_0000RequestTimeout.md @@ -3,11 +3,11 @@
TypeNameCosmos1000RequestRateTooLargeCosmos429_0000RequestRateTooLarge
CheckIdCosmos1000Cosmos429_0000
Category
- + - + diff --git a/TroubleshootingGuides/Cosmos5001SNATPortExhaustion.md b/TroubleshootingGuides/Cosmos503_9000SNATPortExhaustion.md similarity index 94% rename from TroubleshootingGuides/Cosmos5001SNATPortExhaustion.md rename to TroubleshootingGuides/Cosmos503_9000SNATPortExhaustion.md index b58d411ae3..1014b4baf4 100644 --- a/TroubleshootingGuides/Cosmos5001SNATPortExhaustion.md +++ b/TroubleshootingGuides/Cosmos503_9000SNATPortExhaustion.md @@ -1,13 +1,13 @@ -## Cosmos1001 +## Cosmos503_9000
TypeNameCosmos5000TransportExceptionCosmos503_0000TransportException
CheckIdCosmos5000Cosmos503_0000
Category
- + - + From 8f0d568d55f053b8a46e5f0772cbd923db2602a7 Mon Sep 17 00:00:00 2001 From: j82w Date: Tue, 17 Mar 2020 13:51:39 -0700 Subject: [PATCH 03/20] Update TroubleshootingGuides/Cosmos503_9000SNATPortExhaustion.md Co-Authored-By: Matias Quaranta --- TroubleshootingGuides/Cosmos503_9000SNATPortExhaustion.md | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/TroubleshootingGuides/Cosmos503_9000SNATPortExhaustion.md b/TroubleshootingGuides/Cosmos503_9000SNATPortExhaustion.md index 1014b4baf4..8eb1c2d8ea 100644 --- a/TroubleshootingGuides/Cosmos503_9000SNATPortExhaustion.md +++ b/TroubleshootingGuides/Cosmos503_9000SNATPortExhaustion.md @@ -28,4 +28,7 @@ There are two workarounds to avoid Azure SNAT limitation: * Add your Azure Cosmos DB service endpoint to the subnet of your Azure Virtual Machines virtual network. For more information, see [Azure Virtual Network service endpoints](https://docs.microsoft.com/azure/virtual-network/virtual-network-service-endpoints-overview). When the service endpoint is enabled, the requests are no longer sent from a public IP to Azure Cosmos DB. Instead, the virtual network and subnet identity are sent. This change might result in firewall drops if only public IPs are allowed. If you use a firewall, when you enable the service endpoint, add a subnet to the firewall by using [Virtual Network ACLs](https://docs.microsoft.com/azure/virtual-network/virtual-networks-acl). -* Assign a public IP to your Azure VM. \ No newline at end of file +* Assign a public IP to your Azure VM. + +## Related documentation +* [Diagnose and troubleshoot issues when using Azure Cosmos DB .NET SDK](https://docs.microsoft.com/azure/cosmos-db/troubleshoot-dot-net-sdk) From a8afbc74ae68a78f242232735559e37875a56c06 Mon Sep 17 00:00:00 2001 From: j82w Date: Tue, 17 Mar 2020 13:51:47 -0700 Subject: [PATCH 04/20] Update TroubleshootingGuides/Cosmos503_0000RequestTimeout.md Co-Authored-By: Matias Quaranta --- TroubleshootingGuides/Cosmos503_0000RequestTimeout.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/TroubleshootingGuides/Cosmos503_0000RequestTimeout.md b/TroubleshootingGuides/Cosmos503_0000RequestTimeout.md index 06abbb5a22..7901383ccd 100644 --- a/TroubleshootingGuides/Cosmos503_0000RequestTimeout.md +++ b/TroubleshootingGuides/Cosmos503_0000RequestTimeout.md @@ -45,7 +45,7 @@ These are the known causes for this issue. - + @@ -58,4 +58,4 @@ These are the known causes for this issue.
TypeNameCosmos1001SNATPortExhuastionCosmos503_9000SNATPortExhuastion
CheckIdCosmos1001Cosmos503_9000
CategoryCheck the [portal metrics](monitor-accounts.md) for 429 throttled requests
Hot partition key. Azure Cosmos DB distributes the overall provisioned throughput evenly across physical partitions. Check portal metrics to see if the workload is encountering a hot [partition key](partition-data.md). This will cause the aggregate consumed throughput (RU/s) to be appear to be under the provisioned RUs, but a single partition consumed throughput (RU/s) will exceed the provisioned throughput.Hot partition key. Azure Cosmos DB distributes the overall provisioned throughput evenly across physical partitions. Check portal metrics to see if the workload is encountering a hot [partition key](https://docs.microsoft.com/azure/cosmos-db/partition-data). This will cause the aggregate consumed throughput (RU/s) to be appear to be under the provisioned RUs, but a single partition consumed throughput (RU/s) will exceed the provisioned throughput. The partition key should be changed to avoid the heavily used value.
-SDK logs can be captured through [Trace Listener](https://github.com/Azure/azure-cosmosdb-dotnet/blob/master/docs/documentdb-sdk_capture_etl.md) to get more details. This can cause a significant performance impact. \ No newline at end of file +SDK logs can be captured through [Trace Listener](https://github.com/Azure/azure-cosmosdb-dotnet/blob/master/docs/documentdb-sdk_capture_etl.md) to get more details. This can cause a significant performance impact. From f6b831d3575cc0150c3dd2e7fe709c0211fb29ef Mon Sep 17 00:00:00 2001 From: j82w Date: Tue, 17 Mar 2020 13:52:00 -0700 Subject: [PATCH 05/20] Update TroubleshootingGuides/Cosmos503_0000RequestTimeout.md Co-Authored-By: Matias Quaranta --- TroubleshootingGuides/Cosmos503_0000RequestTimeout.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/TroubleshootingGuides/Cosmos503_0000RequestTimeout.md b/TroubleshootingGuides/Cosmos503_0000RequestTimeout.md index 7901383ccd..4c935b84b4 100644 --- a/TroubleshootingGuides/Cosmos503_0000RequestTimeout.md +++ b/TroubleshootingGuides/Cosmos503_0000RequestTimeout.md @@ -42,7 +42,7 @@ These are the known causes for this issue. Retries occur from throttled requests. The SDK retries internally without surfacing this to the caller. - Check the [portal metrics](monitor-accounts.md) for 429 throttled requests + Check the [portal metrics](https://docs.microsoft.com/azure/cosmos-db/monitor-cosmos-db) for 429 throttled requests Hot partition key. Azure Cosmos DB distributes the overall provisioned throughput evenly across physical partitions. Check portal metrics to see if the workload is encountering a hot [partition key](https://docs.microsoft.com/azure/cosmos-db/partition-data). This will cause the aggregate consumed throughput (RU/s) to be appear to be under the provisioned RUs, but a single partition consumed throughput (RU/s) will exceed the provisioned throughput. From c773d02e243673da8a0c661d6a8da92e46a12ad2 Mon Sep 17 00:00:00 2001 From: j82w Date: Tue, 17 Mar 2020 13:52:11 -0700 Subject: [PATCH 06/20] Update TroubleshootingGuides/Cosmos429_0000RequestRateTooLarge.md Co-Authored-By: Matias Quaranta --- TroubleshootingGuides/Cosmos429_0000RequestRateTooLarge.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/TroubleshootingGuides/Cosmos429_0000RequestRateTooLarge.md b/TroubleshootingGuides/Cosmos429_0000RequestRateTooLarge.md index 39a7ea2268..180d8010bf 100644 --- a/TroubleshootingGuides/Cosmos429_0000RequestRateTooLarge.md +++ b/TroubleshootingGuides/Cosmos429_0000RequestRateTooLarge.md @@ -17,8 +17,8 @@ ## Issue -Request rate too large' or error code 429 indicates that your requests are being throttled, because the consumed throughput (RU/s) has exceeded the provisioned throughput. The SDK will automatically retry requests based on the specified retry policy. If you get this failure often, consider increasing the throughput on the collection. Check the portal’s metrics to see if you are getting 429 errors. Review your partition key to ensure it results in an even distribution of storage and request volume. +'Request rate too large' or error code 429 indicates that your requests are being throttled, because the consumed throughput (RU/s) has exceeded the [provisioned throughput](https://docs.microsoft.com/azure/cosmos-db/set-throughput). The SDK will automatically retry requests based on the specified retry policy. If you get this failure often, consider increasing the throughput on the collection. Check the portal's metrics to see if you are getting 429 errors. Review your partition key to ensure it results in an [even distribution of storage and request volume](https://docs.microsoft.com/azure/cosmos-db/partition-data). ## Solution -Use the portal or the SDK to increase the throttling. \ No newline at end of file +Use the portal or the SDK to increase the throttling. From 35928dac6ca8a0e75da8caf121c0562c8eee0e3c Mon Sep 17 00:00:00 2001 From: j82w Date: Tue, 17 Mar 2020 13:52:17 -0700 Subject: [PATCH 07/20] Update TroubleshootingGuides/Cosmos404_0000NotFound.md Co-Authored-By: Matias Quaranta --- TroubleshootingGuides/Cosmos404_0000NotFound.md | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/TroubleshootingGuides/Cosmos404_0000NotFound.md b/TroubleshootingGuides/Cosmos404_0000NotFound.md index 42bf22bd58..d425c65211 100644 --- a/TroubleshootingGuides/Cosmos404_0000NotFound.md +++ b/TroubleshootingGuides/Cosmos404_0000NotFound.md @@ -30,5 +30,9 @@ There is multiple SDK client instances and the read happened before the write. 1. For session consistency the create item will return a session token that can be passed between SDK instances to guarantee that the read request is reading from a replica with that change. 2. Change the consistency level to a stronger level +### Related documentation +* [Consistency levels](https://docs.microsoft.com/azure/cosmos-db/consistency-levels) +* [Choose the right consistency level](https://docs.microsoft.com/azure/cosmos-db/consistency-levels-choosing) +* [Consistency, availability, and performance tradeoffs](https://docs.microsoft.com/azure/cosmos-db/consistency-levels-tradeoffs) ### Cause 2: Invalid chacters in id field For this scenario use query to get the item and replace/escape the invalid characters. From 72267bcc27af409acf4dc25dfb4ed9996ac303de Mon Sep 17 00:00:00 2001 From: j82w Date: Tue, 17 Mar 2020 13:52:59 -0700 Subject: [PATCH 08/20] Update TroubleshootingGuides/Cosmos304_0000NotModified.md Co-Authored-By: Matias Quaranta --- TroubleshootingGuides/Cosmos304_0000NotModified.md | 3 +++ 1 file changed, 3 insertions(+) diff --git a/TroubleshootingGuides/Cosmos304_0000NotModified.md b/TroubleshootingGuides/Cosmos304_0000NotModified.md index 87364134c7..734bef7fc1 100644 --- a/TroubleshootingGuides/Cosmos304_0000NotModified.md +++ b/TroubleshootingGuides/Cosmos304_0000NotModified.md @@ -18,3 +18,6 @@ ## Description This status code in changefeed simply means there is no new items to process. This is expected and the SDK is designed to handle it. + +## Related documentation +* [Change feed overview](https://docs.microsoft.com/azure/cosmos-db/change-feed) From 575e9deb2dea7d0e82f2161202ee2b78ea153bad Mon Sep 17 00:00:00 2001 From: j82w Date: Tue, 17 Mar 2020 13:53:06 -0700 Subject: [PATCH 09/20] Update TroubleshootingGuides/Cosmos503_0000RequestTimeout.md Co-Authored-By: Matias Quaranta --- TroubleshootingGuides/Cosmos503_0000RequestTimeout.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/TroubleshootingGuides/Cosmos503_0000RequestTimeout.md b/TroubleshootingGuides/Cosmos503_0000RequestTimeout.md index 4c935b84b4..ec0d7f5b05 100644 --- a/TroubleshootingGuides/Cosmos503_0000RequestTimeout.md +++ b/TroubleshootingGuides/Cosmos503_0000RequestTimeout.md @@ -38,7 +38,7 @@ These are the known causes for this issue. Creating multiple DocumentClient instances might lead to connection contention and timeout issues. - Follow the [performance tips](performance-tips.md), and use a single DocumentClient instance across an entire process. + Follow the [performance tips](https://docs.microsoft.com/azure/cosmos-db/performance-tips), and use a single CosmosClient instance across an entire process. Retries occur from throttled requests. The SDK retries internally without surfacing this to the caller. From 300c210716443bd30a71e5458ced48a1a302b3ed Mon Sep 17 00:00:00 2001 From: Jake Willey Date: Wed, 25 Mar 2020 06:35:25 -0700 Subject: [PATCH 10/20] Updated based on feedback --- TroubleshootingGuides/Cosmos503_0000RequestTimeout.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/TroubleshootingGuides/Cosmos503_0000RequestTimeout.md b/TroubleshootingGuides/Cosmos503_0000RequestTimeout.md index ec0d7f5b05..d0709e681c 100644 --- a/TroubleshootingGuides/Cosmos503_0000RequestTimeout.md +++ b/TroubleshootingGuides/Cosmos503_0000RequestTimeout.md @@ -37,7 +37,7 @@ These are the known causes for this issue. Follow the Cosmos1001 guide. - Creating multiple DocumentClient instances might lead to connection contention and timeout issues. + Creating multiple Client instances might lead to connection contention and timeout issues. Follow the [performance tips](https://docs.microsoft.com/azure/cosmos-db/performance-tips), and use a single CosmosClient instance across an entire process. From f316b687d0425a94b2b38f69641995d5c4a748b0 Mon Sep 17 00:00:00 2001 From: Jake Willey Date: Wed, 25 Mar 2020 08:27:55 -0700 Subject: [PATCH 11/20] Added link to exception and update files to just use names --- .../Diagnostics/CosmosTroubleshootingLinks.cs | 138 ++++++++++++++++++ .../src/Microsoft.Azure.Cosmos.csproj | 2 +- .../CosmosExceptions/CosmosException.cs | 20 ++- ...meout.md => CosmosClientRequestTimeout.md} | 4 +- ...s404_0000NotFound.md => CosmosNotFound.md} | 0 ...000NotModified.md => CosmosNotModified.md} | 0 ...oLarge.md => CosmosRequestRateTooLarge.md} | 0 ...austion.md => CosmosSNATPortExhaustion.md} | 0 .../CosmosServiceRequestTimeout.md | 61 ++++++++ 9 files changed, 221 insertions(+), 4 deletions(-) create mode 100644 Microsoft.Azure.Cosmos/src/Diagnostics/CosmosTroubleshootingLinks.cs rename TroubleshootingGuides/{Cosmos503_0000RequestTimeout.md => CosmosClientRequestTimeout.md} (97%) rename TroubleshootingGuides/{Cosmos404_0000NotFound.md => CosmosNotFound.md} (100%) rename TroubleshootingGuides/{Cosmos304_0000NotModified.md => CosmosNotModified.md} (100%) rename TroubleshootingGuides/{Cosmos429_0000RequestRateTooLarge.md => CosmosRequestRateTooLarge.md} (100%) rename TroubleshootingGuides/{Cosmos503_9000SNATPortExhaustion.md => CosmosSNATPortExhaustion.md} (100%) create mode 100644 TroubleshootingGuides/CosmosServiceRequestTimeout.md diff --git a/Microsoft.Azure.Cosmos/src/Diagnostics/CosmosTroubleshootingLinks.cs b/Microsoft.Azure.Cosmos/src/Diagnostics/CosmosTroubleshootingLinks.cs new file mode 100644 index 0000000000..251ab00433 --- /dev/null +++ b/Microsoft.Azure.Cosmos/src/Diagnostics/CosmosTroubleshootingLinks.cs @@ -0,0 +1,138 @@ +//------------------------------------------------------------ +// Copyright (c) Microsoft Corporation. All rights reserved. +//------------------------------------------------------------ +namespace Microsoft.Azure.Cosmos +{ + using System; + using System.Collections.Generic; + using System.Net; + using Microsoft.Azure.Documents; + + internal sealed class CosmosTroubleshootingLinks + { + private static readonly IReadOnlyDictionary<(int, int), CosmosTroubleshootingLinks> StatusCodeToLink; + + internal string Link { get; } + internal int StatusCode { get; } + internal int SubStatusCode { get; } + internal bool IsServiceException { get; } + + private CosmosTroubleshootingLinks( + int statusCode, + int subStatusCode, + bool isServiceException, + string link) + { + this.StatusCode = statusCode; + this.SubStatusCode = subStatusCode; + this.IsServiceException = isServiceException; + this.Link = link ?? throw new ArgumentNullException(nameof(link)); + } + + private void AddToDictionary(Dictionary<(int, int), CosmosTroubleshootingLinks> dictionary) + { + dictionary.Add((this.StatusCode, this.SubStatusCode), this); + } + + static CosmosTroubleshootingLinks() + { + Dictionary<(int, int), CosmosTroubleshootingLinks> linkMap = new Dictionary<(int, int), CosmosTroubleshootingLinks>(); + NotFound.AddToDictionary(linkMap); + RequestRateTooLarge.AddToDictionary(linkMap); + NotModified.AddToDictionary(linkMap); + ClientTransportRequestTimeout.AddToDictionary(linkMap); + ServiceTransportRequestTimeout.AddToDictionary(linkMap); + TransportExceptionHighCpu.AddToDictionary(linkMap); + + CosmosTroubleshootingLinks.StatusCodeToLink = linkMap; + } + + internal static bool TryGetTroubleshootingLinks( + CosmosException cosmosException, + out CosmosTroubleshootingLinks troubleshootingLink) + { + if (TryGetTransportException(cosmosException, out troubleshootingLink)) + { + return true; + } + + return CosmosTroubleshootingLinks.StatusCodeToLink.TryGetValue( + ((int)cosmosException.StatusCode, cosmosException.SubStatusCode), + out troubleshootingLink); + } + + private static bool TryGetTransportException(CosmosException exception, out CosmosTroubleshootingLinks troubleshootingLink) + { + Exception innerException = exception.InnerException; + while (innerException != null) + { + if (innerException is TransportException transportException) + { + if (transportException.IsClientCpuOverloaded) + { + troubleshootingLink = TransportExceptionHighCpu; + return true; + } + + if (TransportException.IsTimeout(transportException.ErrorCode)) + { + if (transportException.UserRequestSent) + { + troubleshootingLink = ServiceTransportRequestTimeout; + return true; + } + else + { + troubleshootingLink = ClientTransportRequestTimeout; + return true; + } + } + + } + else + { + innerException = innerException.InnerException; + } + } + + troubleshootingLink = null; + return false; + } + + private static readonly CosmosTroubleshootingLinks NotFound = new CosmosTroubleshootingLinks( + statusCode: (int)HttpStatusCode.NotFound, + subStatusCode: default, + isServiceException: true, + link: "https://aka.ms/CosmosTsgNotFound"); + + private static readonly CosmosTroubleshootingLinks RequestRateTooLarge = new CosmosTroubleshootingLinks( + statusCode: 429, + subStatusCode: 3200, + isServiceException: true, + link: "https://aka.ms/CosmosTsgRequestRateTooLarge"); + + private static readonly CosmosTroubleshootingLinks NotModified = new CosmosTroubleshootingLinks( + statusCode: (int)HttpStatusCode.NotModified, + subStatusCode: default, + isServiceException: true, + link: "https://aka.ms/CosmosTsgNotModified"); + + private static readonly CosmosTroubleshootingLinks ClientTransportRequestTimeout = new CosmosTroubleshootingLinks( + statusCode: (int)HttpStatusCode.RequestTimeout, + subStatusCode: 8000, + isServiceException: false, + link: "https://aka.ms/CosmosTsgClientTransportRequestTimeout"); + + private static readonly CosmosTroubleshootingLinks ServiceTransportRequestTimeout = new CosmosTroubleshootingLinks( + statusCode: (int)HttpStatusCode.RequestTimeout, + subStatusCode: 9000, + isServiceException: true, + link: "https://aka.ms/CosmosTsgServiceTransportRequestTimeout"); + + private static readonly CosmosTroubleshootingLinks TransportExceptionHighCpu = new CosmosTroubleshootingLinks( + statusCode: (int)HttpStatusCode.ServiceUnavailable, + subStatusCode: 9001, + isServiceException: false, + link: "https://aka.ms/CosmosTsgTransportExceptionHighCpu"); + } +} \ No newline at end of file diff --git a/Microsoft.Azure.Cosmos/src/Microsoft.Azure.Cosmos.csproj b/Microsoft.Azure.Cosmos/src/Microsoft.Azure.Cosmos.csproj index 674faa6607..1ca1c6e839 100644 --- a/Microsoft.Azure.Cosmos/src/Microsoft.Azure.Cosmos.csproj +++ b/Microsoft.Azure.Cosmos/src/Microsoft.Azure.Cosmos.csproj @@ -71,7 +71,7 @@ - + diff --git a/Microsoft.Azure.Cosmos/src/Resource/CosmosExceptions/CosmosException.cs b/Microsoft.Azure.Cosmos/src/Resource/CosmosExceptions/CosmosException.cs index b6b5b6c09a..08e61249d2 100644 --- a/Microsoft.Azure.Cosmos/src/Resource/CosmosExceptions/CosmosException.cs +++ b/Microsoft.Azure.Cosmos/src/Resource/CosmosExceptions/CosmosException.cs @@ -180,13 +180,31 @@ internal ResponseMessage ToCosmosResponseMessage(RequestMessage request) diagnostics: this.DiagnosticsContext); } + private bool TryGetTroubleshootingLink(out string tsgLink) + { + if (CosmosTroubleshootingLinks.TryGetTroubleshootingLinks(this, out CosmosTroubleshootingLinks link)) + { + tsgLink = link.Link; + return true; + } + + tsgLink = null; + return false; + } + private string ToStringHelper(bool includeDiagnostics) { StringBuilder stringBuilder = new StringBuilder(); stringBuilder.Append(this.GetType().FullName); + stringBuilder.Append(" : "); + + if (this.TryGetTroubleshootingLink(out string tsgLink)) + { + stringBuilder.Append($" Troubleshooting Guide: \"{tsgLink}\"; "); + } + if (this.Message != null) { - stringBuilder.Append(" : "); stringBuilder.Append(this.Message); stringBuilder.AppendLine(); } diff --git a/TroubleshootingGuides/Cosmos503_0000RequestTimeout.md b/TroubleshootingGuides/CosmosClientRequestTimeout.md similarity index 97% rename from TroubleshootingGuides/Cosmos503_0000RequestTimeout.md rename to TroubleshootingGuides/CosmosClientRequestTimeout.md index d0709e681c..b40b5af459 100644 --- a/TroubleshootingGuides/Cosmos503_0000RequestTimeout.md +++ b/TroubleshootingGuides/CosmosClientRequestTimeout.md @@ -3,11 +3,11 @@ - + - + diff --git a/TroubleshootingGuides/Cosmos404_0000NotFound.md b/TroubleshootingGuides/CosmosNotFound.md similarity index 100% rename from TroubleshootingGuides/Cosmos404_0000NotFound.md rename to TroubleshootingGuides/CosmosNotFound.md diff --git a/TroubleshootingGuides/Cosmos304_0000NotModified.md b/TroubleshootingGuides/CosmosNotModified.md similarity index 100% rename from TroubleshootingGuides/Cosmos304_0000NotModified.md rename to TroubleshootingGuides/CosmosNotModified.md diff --git a/TroubleshootingGuides/Cosmos429_0000RequestRateTooLarge.md b/TroubleshootingGuides/CosmosRequestRateTooLarge.md similarity index 100% rename from TroubleshootingGuides/Cosmos429_0000RequestRateTooLarge.md rename to TroubleshootingGuides/CosmosRequestRateTooLarge.md diff --git a/TroubleshootingGuides/Cosmos503_9000SNATPortExhaustion.md b/TroubleshootingGuides/CosmosSNATPortExhaustion.md similarity index 100% rename from TroubleshootingGuides/Cosmos503_9000SNATPortExhaustion.md rename to TroubleshootingGuides/CosmosSNATPortExhaustion.md diff --git a/TroubleshootingGuides/CosmosServiceRequestTimeout.md b/TroubleshootingGuides/CosmosServiceRequestTimeout.md new file mode 100644 index 0000000000..ce183033a1 --- /dev/null +++ b/TroubleshootingGuides/CosmosServiceRequestTimeout.md @@ -0,0 +1,61 @@ +## Cosmos1000 + +
TypeNameCosmos503_0000TransportExceptionCosmos408_0000TransportException
CheckIdCosmos503_0000Cosmos408_0000
Category
+ + + + + + + + + + + + +
TypeNameCosmos408_9001ServiceTransportException
CheckIdCosmos408_0000
CategoryConnectivity
+ +## Issue + +The SDK was able to send the request to Cosmos DB, but the operation timed out. + +## Troubleshooting steps + +These are the known causes for this issue. + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Possible causeSolution
High CPU utilization. This is the most common cause. It is recommended to look at CPU utilization at 10 second intervals. If the interval is larger then CPU spikes can be missed by getting averaged in with lower values.The application should be scaled up/out.
Socket / Port availability might be low. When running in Azure, clients using the .NET SDK can hit Azure SNAT (PAT) port exhaustion.Follow the Cosmos1001 guide.
Creating multiple Client instances might lead to connection contention and timeout issues.Follow the [performance tips](https://docs.microsoft.com/azure/cosmos-db/performance-tips), and use a single CosmosClient instance across an entire process.
Retries occur from throttled requests. The SDK retries internally without surfacing this to the caller. Check the [portal metrics](https://docs.microsoft.com/azure/cosmos-db/monitor-cosmos-db) for 429 throttled requests
Hot partition key. Azure Cosmos DB distributes the overall provisioned throughput evenly across physical partitions. Check portal metrics to see if the workload is encountering a hot [partition key](https://docs.microsoft.com/azure/cosmos-db/partition-data). This will cause the aggregate consumed throughput (RU/s) to be appear to be under the provisioned RUs, but a single partition consumed throughput (RU/s) will exceed the provisioned throughput.The partition key should be changed to avoid the heavily used value.
A high degree of concurrency can lead to contention on the channelTry to scale the application up/out.
Large requests or responses can lead to head-of-line blocking on the channel and exacerbate contention, even with a relatively low degree of concurrency.Try to scale the application up/out.
+ +SDK logs can be captured through [Trace Listener](https://github.com/Azure/azure-cosmosdb-dotnet/blob/master/docs/documentdb-sdk_capture_etl.md) to get more details. This can cause a significant performance impact. From 46aeead6f5978d23471fb1d4bf1a1dd8f7ed628e Mon Sep 17 00:00:00 2001 From: Jake Willey Date: Wed, 25 Mar 2020 08:32:36 -0700 Subject: [PATCH 12/20] Updated naming --- ...gLinks.cs => CosmosTroubleshootingLink.cs} | 32 +++++++++---------- .../CosmosExceptions/CosmosException.cs | 2 +- 2 files changed, 17 insertions(+), 17 deletions(-) rename Microsoft.Azure.Cosmos/src/Diagnostics/{CosmosTroubleshootingLinks.cs => CosmosTroubleshootingLink.cs} (76%) diff --git a/Microsoft.Azure.Cosmos/src/Diagnostics/CosmosTroubleshootingLinks.cs b/Microsoft.Azure.Cosmos/src/Diagnostics/CosmosTroubleshootingLink.cs similarity index 76% rename from Microsoft.Azure.Cosmos/src/Diagnostics/CosmosTroubleshootingLinks.cs rename to Microsoft.Azure.Cosmos/src/Diagnostics/CosmosTroubleshootingLink.cs index 251ab00433..ef1efd5b8f 100644 --- a/Microsoft.Azure.Cosmos/src/Diagnostics/CosmosTroubleshootingLinks.cs +++ b/Microsoft.Azure.Cosmos/src/Diagnostics/CosmosTroubleshootingLink.cs @@ -8,16 +8,16 @@ namespace Microsoft.Azure.Cosmos using System.Net; using Microsoft.Azure.Documents; - internal sealed class CosmosTroubleshootingLinks + internal sealed class CosmosTroubleshootingLink { - private static readonly IReadOnlyDictionary<(int, int), CosmosTroubleshootingLinks> StatusCodeToLink; + private static readonly IReadOnlyDictionary<(int, int), CosmosTroubleshootingLink> StatusCodeToLink; internal string Link { get; } internal int StatusCode { get; } internal int SubStatusCode { get; } internal bool IsServiceException { get; } - private CosmosTroubleshootingLinks( + private CosmosTroubleshootingLink( int statusCode, int subStatusCode, bool isServiceException, @@ -29,14 +29,14 @@ private CosmosTroubleshootingLinks( this.Link = link ?? throw new ArgumentNullException(nameof(link)); } - private void AddToDictionary(Dictionary<(int, int), CosmosTroubleshootingLinks> dictionary) + private void AddToDictionary(Dictionary<(int, int), CosmosTroubleshootingLink> dictionary) { dictionary.Add((this.StatusCode, this.SubStatusCode), this); } - static CosmosTroubleshootingLinks() + static CosmosTroubleshootingLink() { - Dictionary<(int, int), CosmosTroubleshootingLinks> linkMap = new Dictionary<(int, int), CosmosTroubleshootingLinks>(); + Dictionary<(int, int), CosmosTroubleshootingLink> linkMap = new Dictionary<(int, int), CosmosTroubleshootingLink>(); NotFound.AddToDictionary(linkMap); RequestRateTooLarge.AddToDictionary(linkMap); NotModified.AddToDictionary(linkMap); @@ -44,24 +44,24 @@ static CosmosTroubleshootingLinks() ServiceTransportRequestTimeout.AddToDictionary(linkMap); TransportExceptionHighCpu.AddToDictionary(linkMap); - CosmosTroubleshootingLinks.StatusCodeToLink = linkMap; + CosmosTroubleshootingLink.StatusCodeToLink = linkMap; } internal static bool TryGetTroubleshootingLinks( CosmosException cosmosException, - out CosmosTroubleshootingLinks troubleshootingLink) + out CosmosTroubleshootingLink troubleshootingLink) { if (TryGetTransportException(cosmosException, out troubleshootingLink)) { return true; } - return CosmosTroubleshootingLinks.StatusCodeToLink.TryGetValue( + return CosmosTroubleshootingLink.StatusCodeToLink.TryGetValue( ((int)cosmosException.StatusCode, cosmosException.SubStatusCode), out troubleshootingLink); } - private static bool TryGetTransportException(CosmosException exception, out CosmosTroubleshootingLinks troubleshootingLink) + private static bool TryGetTransportException(CosmosException exception, out CosmosTroubleshootingLink troubleshootingLink) { Exception innerException = exception.InnerException; while (innerException != null) @@ -99,37 +99,37 @@ private static bool TryGetTransportException(CosmosException exception, out Cosm return false; } - private static readonly CosmosTroubleshootingLinks NotFound = new CosmosTroubleshootingLinks( + private static readonly CosmosTroubleshootingLink NotFound = new CosmosTroubleshootingLink( statusCode: (int)HttpStatusCode.NotFound, subStatusCode: default, isServiceException: true, link: "https://aka.ms/CosmosTsgNotFound"); - private static readonly CosmosTroubleshootingLinks RequestRateTooLarge = new CosmosTroubleshootingLinks( + private static readonly CosmosTroubleshootingLink RequestRateTooLarge = new CosmosTroubleshootingLink( statusCode: 429, subStatusCode: 3200, isServiceException: true, link: "https://aka.ms/CosmosTsgRequestRateTooLarge"); - private static readonly CosmosTroubleshootingLinks NotModified = new CosmosTroubleshootingLinks( + private static readonly CosmosTroubleshootingLink NotModified = new CosmosTroubleshootingLink( statusCode: (int)HttpStatusCode.NotModified, subStatusCode: default, isServiceException: true, link: "https://aka.ms/CosmosTsgNotModified"); - private static readonly CosmosTroubleshootingLinks ClientTransportRequestTimeout = new CosmosTroubleshootingLinks( + private static readonly CosmosTroubleshootingLink ClientTransportRequestTimeout = new CosmosTroubleshootingLink( statusCode: (int)HttpStatusCode.RequestTimeout, subStatusCode: 8000, isServiceException: false, link: "https://aka.ms/CosmosTsgClientTransportRequestTimeout"); - private static readonly CosmosTroubleshootingLinks ServiceTransportRequestTimeout = new CosmosTroubleshootingLinks( + private static readonly CosmosTroubleshootingLink ServiceTransportRequestTimeout = new CosmosTroubleshootingLink( statusCode: (int)HttpStatusCode.RequestTimeout, subStatusCode: 9000, isServiceException: true, link: "https://aka.ms/CosmosTsgServiceTransportRequestTimeout"); - private static readonly CosmosTroubleshootingLinks TransportExceptionHighCpu = new CosmosTroubleshootingLinks( + private static readonly CosmosTroubleshootingLink TransportExceptionHighCpu = new CosmosTroubleshootingLink( statusCode: (int)HttpStatusCode.ServiceUnavailable, subStatusCode: 9001, isServiceException: false, diff --git a/Microsoft.Azure.Cosmos/src/Resource/CosmosExceptions/CosmosException.cs b/Microsoft.Azure.Cosmos/src/Resource/CosmosExceptions/CosmosException.cs index 08e61249d2..85dc3675c2 100644 --- a/Microsoft.Azure.Cosmos/src/Resource/CosmosExceptions/CosmosException.cs +++ b/Microsoft.Azure.Cosmos/src/Resource/CosmosExceptions/CosmosException.cs @@ -182,7 +182,7 @@ internal ResponseMessage ToCosmosResponseMessage(RequestMessage request) private bool TryGetTroubleshootingLink(out string tsgLink) { - if (CosmosTroubleshootingLinks.TryGetTroubleshootingLinks(this, out CosmosTroubleshootingLinks link)) + if (CosmosTroubleshootingLink.TryGetTroubleshootingLinks(this, out CosmosTroubleshootingLink link)) { tsgLink = link.Link; return true; From 830e3e634484a37f3ead7b6a485850ee334e35d0 Mon Sep 17 00:00:00 2001 From: j82w Date: Wed, 25 Mar 2020 09:54:05 -0700 Subject: [PATCH 13/20] Update TroubleshootingGuides/CosmosRequestRateTooLarge.md Co-Authored-By: Matias Quaranta --- TroubleshootingGuides/CosmosRequestRateTooLarge.md | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/TroubleshootingGuides/CosmosRequestRateTooLarge.md b/TroubleshootingGuides/CosmosRequestRateTooLarge.md index 180d8010bf..2c1212db0e 100644 --- a/TroubleshootingGuides/CosmosRequestRateTooLarge.md +++ b/TroubleshootingGuides/CosmosRequestRateTooLarge.md @@ -21,4 +21,8 @@ ## Solution -Use the portal or the SDK to increase the throttling. +Use the portal or the SDK to increase the provisioned throughput. + +## Related documentation +* [Provision throughput on containers and databases](https://docs.microsoft.com/azure/cosmos-db/set-throughput) +* [Request units in Azure Cosmos DB](https://docs.microsoft.com/azure/cosmos-db/request-units) From dfa1c44a034aea778062ff222cc7e4a28c062c92 Mon Sep 17 00:00:00 2001 From: Jake Willey Date: Wed, 25 Mar 2020 11:59:03 -0700 Subject: [PATCH 14/20] Adding tests --- .../Microsoft.Azure.Cosmos.EmulatorTests/CosmosItemTests.cs | 2 ++ 1 file changed, 2 insertions(+) diff --git a/Microsoft.Azure.Cosmos/tests/Microsoft.Azure.Cosmos.EmulatorTests/CosmosItemTests.cs b/Microsoft.Azure.Cosmos/tests/Microsoft.Azure.Cosmos.EmulatorTests/CosmosItemTests.cs index 3311c3fa92..1d2b61ed95 100644 --- a/Microsoft.Azure.Cosmos/tests/Microsoft.Azure.Cosmos.EmulatorTests/CosmosItemTests.cs +++ b/Microsoft.Azure.Cosmos/tests/Microsoft.Azure.Cosmos.EmulatorTests/CosmosItemTests.cs @@ -395,6 +395,7 @@ public async Task ReplaceItemStreamTest() Assert.IsFalse(response.IsSuccessStatusCode); Assert.IsNotNull(response); Assert.AreEqual(HttpStatusCode.NotFound, response.StatusCode, response.ErrorMessage); + Assert.IsTrue(response.ErrorMessage.Contains("https://aka.ms/CosmosTsgNotFound")); } } @@ -1543,6 +1544,7 @@ public async Task VerifyToManyRequestTest(bool isQuery) ResponseMessage failedResponseMessage = failedToManyRequests.First(); Assert.AreEqual(failedResponseMessage.StatusCode, (HttpStatusCode)429); Assert.IsNotNull(failedResponseMessage.ErrorMessage); + Assert.IsNotNull(failedResponseMessage.ErrorMessage.Contains("https://aka.ms/CosmosTsgRequestRateTooLarge")); string diagnostics = failedResponseMessage.Diagnostics.ToString(); Assert.IsNotNull(diagnostics); } From 304c32cc2cb6fdd259f9d4a91638d8e4fc129aa3 Mon Sep 17 00:00:00 2001 From: Jake Willey Date: Mon, 30 Mar 2020 08:01:43 -0700 Subject: [PATCH 15/20] Updated not found based on comments --- .../Diagnostics/CosmosTroubleshootingLink.cs | 11 +++---- .../CosmosExceptions/CosmosException.cs | 29 ++++++++++++++++--- TroubleshootingGuides/CosmosNotFound.md | 4 +-- 3 files changed, 32 insertions(+), 12 deletions(-) diff --git a/Microsoft.Azure.Cosmos/src/Diagnostics/CosmosTroubleshootingLink.cs b/Microsoft.Azure.Cosmos/src/Diagnostics/CosmosTroubleshootingLink.cs index ef1efd5b8f..7701debf2d 100644 --- a/Microsoft.Azure.Cosmos/src/Diagnostics/CosmosTroubleshootingLink.cs +++ b/Microsoft.Azure.Cosmos/src/Diagnostics/CosmosTroubleshootingLink.cs @@ -48,22 +48,23 @@ static CosmosTroubleshootingLink() } internal static bool TryGetTroubleshootingLinks( - CosmosException cosmosException, + int statusCodes, + int subStatusCode, + Exception innerException, out CosmosTroubleshootingLink troubleshootingLink) { - if (TryGetTransportException(cosmosException, out troubleshootingLink)) + if (TryGetTransportException(innerException, out troubleshootingLink)) { return true; } return CosmosTroubleshootingLink.StatusCodeToLink.TryGetValue( - ((int)cosmosException.StatusCode, cosmosException.SubStatusCode), + (statusCodes, subStatusCode), out troubleshootingLink); } - private static bool TryGetTransportException(CosmosException exception, out CosmosTroubleshootingLink troubleshootingLink) + private static bool TryGetTransportException(Exception innerException, out CosmosTroubleshootingLink troubleshootingLink) { - Exception innerException = exception.InnerException; while (innerException != null) { if (innerException is TransportException transportException) diff --git a/Microsoft.Azure.Cosmos/src/Resource/CosmosExceptions/CosmosException.cs b/Microsoft.Azure.Cosmos/src/Resource/CosmosExceptions/CosmosException.cs index ce9749b5f2..085d2ef798 100644 --- a/Microsoft.Azure.Cosmos/src/Resource/CosmosExceptions/CosmosException.cs +++ b/Microsoft.Azure.Cosmos/src/Resource/CosmosExceptions/CosmosException.cs @@ -32,7 +32,8 @@ internal CosmosException( statusCodes, subStatusCode, message, - activityId), innerException) + activityId, + innerException), innerException) { this.ResponseBody = message; this.stackTrace = stackTrace; @@ -186,9 +187,17 @@ internal ResponseMessage ToCosmosResponseMessage(RequestMessage request) diagnostics: this.DiagnosticsContext); } - private bool TryGetTroubleshootingLink(out string tsgLink) + private static bool TryGetTroubleshootingLink( + HttpStatusCode statusCode, + int subStatusCode, + Exception innerException, + out string tsgLink) { - if (CosmosTroubleshootingLink.TryGetTroubleshootingLinks(this, out CosmosTroubleshootingLink link)) + if (CosmosTroubleshootingLink.TryGetTroubleshootingLinks( + (int)statusCode, + subStatusCode, + innerException, + out CosmosTroubleshootingLink link)) { tsgLink = link.Link; return true; @@ -202,7 +211,8 @@ private static string GetMessageHelper( HttpStatusCode statusCode, int subStatusCode, string responseBody, - string activityId) + string activityId, + Exception innerException) { StringBuilder stringBuilder = new StringBuilder(); @@ -210,6 +220,17 @@ private static string GetMessageHelper( stringBuilder.Append($"{statusCode} ({(int)statusCode})"); stringBuilder.Append("; Substatus: "); stringBuilder.Append(subStatusCode); + + if (CosmosException.TryGetTroubleshootingLink( + statusCode, + subStatusCode, + innerException, + out string tsgLink)) + { + stringBuilder.Append("; Troubleshooting: "); + stringBuilder.Append(tsgLink); + } + stringBuilder.Append("; ActivityId: "); stringBuilder.Append(activityId ?? string.Empty); stringBuilder.Append("; Reason: ("); diff --git a/TroubleshootingGuides/CosmosNotFound.md b/TroubleshootingGuides/CosmosNotFound.md index d425c65211..6385d59aee 100644 --- a/TroubleshootingGuides/CosmosNotFound.md +++ b/TroubleshootingGuides/CosmosNotFound.md @@ -33,6 +33,4 @@ There is multiple SDK client instances and the read happened before the write. ### Related documentation * [Consistency levels](https://docs.microsoft.com/azure/cosmos-db/consistency-levels) * [Choose the right consistency level](https://docs.microsoft.com/azure/cosmos-db/consistency-levels-choosing) -* [Consistency, availability, and performance tradeoffs](https://docs.microsoft.com/azure/cosmos-db/consistency-levels-tradeoffs) -### Cause 2: Invalid chacters in id field -For this scenario use query to get the item and replace/escape the invalid characters. +* [Consistency, availability, and performance tradeoffs](https://docs.microsoft.com/azure/cosmos-db/consistency-levels-tradeoffs) \ No newline at end of file From 864d1c77baa0ce0e99b400328c0726f63d76cc57 Mon Sep 17 00:00:00 2001 From: Jake Willey Date: Thu, 23 Apr 2020 07:30:57 -0700 Subject: [PATCH 16/20] Revert changes --- .../Diagnostics/CosmosTroubleshootingLink.cs | 139 ------------------ .../CosmosExceptions/CosmosException.cs | 41 +----- .../CosmosItemTests.cs | 17 ++- 3 files changed, 17 insertions(+), 180 deletions(-) delete mode 100644 Microsoft.Azure.Cosmos/src/Diagnostics/CosmosTroubleshootingLink.cs diff --git a/Microsoft.Azure.Cosmos/src/Diagnostics/CosmosTroubleshootingLink.cs b/Microsoft.Azure.Cosmos/src/Diagnostics/CosmosTroubleshootingLink.cs deleted file mode 100644 index 7701debf2d..0000000000 --- a/Microsoft.Azure.Cosmos/src/Diagnostics/CosmosTroubleshootingLink.cs +++ /dev/null @@ -1,139 +0,0 @@ -//------------------------------------------------------------ -// Copyright (c) Microsoft Corporation. All rights reserved. -//------------------------------------------------------------ -namespace Microsoft.Azure.Cosmos -{ - using System; - using System.Collections.Generic; - using System.Net; - using Microsoft.Azure.Documents; - - internal sealed class CosmosTroubleshootingLink - { - private static readonly IReadOnlyDictionary<(int, int), CosmosTroubleshootingLink> StatusCodeToLink; - - internal string Link { get; } - internal int StatusCode { get; } - internal int SubStatusCode { get; } - internal bool IsServiceException { get; } - - private CosmosTroubleshootingLink( - int statusCode, - int subStatusCode, - bool isServiceException, - string link) - { - this.StatusCode = statusCode; - this.SubStatusCode = subStatusCode; - this.IsServiceException = isServiceException; - this.Link = link ?? throw new ArgumentNullException(nameof(link)); - } - - private void AddToDictionary(Dictionary<(int, int), CosmosTroubleshootingLink> dictionary) - { - dictionary.Add((this.StatusCode, this.SubStatusCode), this); - } - - static CosmosTroubleshootingLink() - { - Dictionary<(int, int), CosmosTroubleshootingLink> linkMap = new Dictionary<(int, int), CosmosTroubleshootingLink>(); - NotFound.AddToDictionary(linkMap); - RequestRateTooLarge.AddToDictionary(linkMap); - NotModified.AddToDictionary(linkMap); - ClientTransportRequestTimeout.AddToDictionary(linkMap); - ServiceTransportRequestTimeout.AddToDictionary(linkMap); - TransportExceptionHighCpu.AddToDictionary(linkMap); - - CosmosTroubleshootingLink.StatusCodeToLink = linkMap; - } - - internal static bool TryGetTroubleshootingLinks( - int statusCodes, - int subStatusCode, - Exception innerException, - out CosmosTroubleshootingLink troubleshootingLink) - { - if (TryGetTransportException(innerException, out troubleshootingLink)) - { - return true; - } - - return CosmosTroubleshootingLink.StatusCodeToLink.TryGetValue( - (statusCodes, subStatusCode), - out troubleshootingLink); - } - - private static bool TryGetTransportException(Exception innerException, out CosmosTroubleshootingLink troubleshootingLink) - { - while (innerException != null) - { - if (innerException is TransportException transportException) - { - if (transportException.IsClientCpuOverloaded) - { - troubleshootingLink = TransportExceptionHighCpu; - return true; - } - - if (TransportException.IsTimeout(transportException.ErrorCode)) - { - if (transportException.UserRequestSent) - { - troubleshootingLink = ServiceTransportRequestTimeout; - return true; - } - else - { - troubleshootingLink = ClientTransportRequestTimeout; - return true; - } - } - - } - else - { - innerException = innerException.InnerException; - } - } - - troubleshootingLink = null; - return false; - } - - private static readonly CosmosTroubleshootingLink NotFound = new CosmosTroubleshootingLink( - statusCode: (int)HttpStatusCode.NotFound, - subStatusCode: default, - isServiceException: true, - link: "https://aka.ms/CosmosTsgNotFound"); - - private static readonly CosmosTroubleshootingLink RequestRateTooLarge = new CosmosTroubleshootingLink( - statusCode: 429, - subStatusCode: 3200, - isServiceException: true, - link: "https://aka.ms/CosmosTsgRequestRateTooLarge"); - - private static readonly CosmosTroubleshootingLink NotModified = new CosmosTroubleshootingLink( - statusCode: (int)HttpStatusCode.NotModified, - subStatusCode: default, - isServiceException: true, - link: "https://aka.ms/CosmosTsgNotModified"); - - private static readonly CosmosTroubleshootingLink ClientTransportRequestTimeout = new CosmosTroubleshootingLink( - statusCode: (int)HttpStatusCode.RequestTimeout, - subStatusCode: 8000, - isServiceException: false, - link: "https://aka.ms/CosmosTsgClientTransportRequestTimeout"); - - private static readonly CosmosTroubleshootingLink ServiceTransportRequestTimeout = new CosmosTroubleshootingLink( - statusCode: (int)HttpStatusCode.RequestTimeout, - subStatusCode: 9000, - isServiceException: true, - link: "https://aka.ms/CosmosTsgServiceTransportRequestTimeout"); - - private static readonly CosmosTroubleshootingLink TransportExceptionHighCpu = new CosmosTroubleshootingLink( - statusCode: (int)HttpStatusCode.ServiceUnavailable, - subStatusCode: 9001, - isServiceException: false, - link: "https://aka.ms/CosmosTsgTransportExceptionHighCpu"); - } -} \ No newline at end of file diff --git a/Microsoft.Azure.Cosmos/src/Resource/CosmosExceptions/CosmosException.cs b/Microsoft.Azure.Cosmos/src/Resource/CosmosExceptions/CosmosException.cs index 085d2ef798..7268f8e457 100644 --- a/Microsoft.Azure.Cosmos/src/Resource/CosmosExceptions/CosmosException.cs +++ b/Microsoft.Azure.Cosmos/src/Resource/CosmosExceptions/CosmosException.cs @@ -32,8 +32,7 @@ internal CosmosException( statusCodes, subStatusCode, message, - activityId, - innerException), innerException) + activityId), innerException) { this.ResponseBody = message; this.stackTrace = stackTrace; @@ -187,32 +186,11 @@ internal ResponseMessage ToCosmosResponseMessage(RequestMessage request) diagnostics: this.DiagnosticsContext); } - private static bool TryGetTroubleshootingLink( - HttpStatusCode statusCode, - int subStatusCode, - Exception innerException, - out string tsgLink) - { - if (CosmosTroubleshootingLink.TryGetTroubleshootingLinks( - (int)statusCode, - subStatusCode, - innerException, - out CosmosTroubleshootingLink link)) - { - tsgLink = link.Link; - return true; - } - - tsgLink = null; - return false; - } - private static string GetMessageHelper( HttpStatusCode statusCode, int subStatusCode, string responseBody, - string activityId, - Exception innerException) + string activityId) { StringBuilder stringBuilder = new StringBuilder(); @@ -220,17 +198,6 @@ private static string GetMessageHelper( stringBuilder.Append($"{statusCode} ({(int)statusCode})"); stringBuilder.Append("; Substatus: "); stringBuilder.Append(subStatusCode); - - if (CosmosException.TryGetTroubleshootingLink( - statusCode, - subStatusCode, - innerException, - out string tsgLink)) - { - stringBuilder.Append("; Troubleshooting: "); - stringBuilder.Append(tsgLink); - } - stringBuilder.Append("; ActivityId: "); stringBuilder.Append(activityId ?? string.Empty); stringBuilder.Append("; Reason: ("); @@ -241,7 +208,7 @@ private static string GetMessageHelper( } private string ToStringHelper( - StringBuilder stringBuilder) + StringBuilder stringBuilder) { if (stringBuilder == null) { @@ -275,4 +242,4 @@ private string ToStringHelper( return stringBuilder.ToString(); } } -} +} \ No newline at end of file diff --git a/Microsoft.Azure.Cosmos/tests/Microsoft.Azure.Cosmos.EmulatorTests/CosmosItemTests.cs b/Microsoft.Azure.Cosmos/tests/Microsoft.Azure.Cosmos.EmulatorTests/CosmosItemTests.cs index 5015ba82b1..46f61392dd 100644 --- a/Microsoft.Azure.Cosmos/tests/Microsoft.Azure.Cosmos.EmulatorTests/CosmosItemTests.cs +++ b/Microsoft.Azure.Cosmos/tests/Microsoft.Azure.Cosmos.EmulatorTests/CosmosItemTests.cs @@ -79,6 +79,15 @@ public async Task NegativeCreateDropItemTest() Assert.AreEqual(HttpStatusCode.BadRequest, response.StatusCode); } + [TestMethod] + public async Task MemoryStreamBufferIsAccessibleOnResponse() + { + ToDoActivity testItem = ToDoActivity.CreateRandomToDoActivity(); + ResponseMessage response = await this.Container.CreateItemStreamAsync(streamPayload: TestCommon.SerializerCore.ToStream(testItem), partitionKey: new Cosmos.PartitionKey(testItem.status)); + Assert.IsNotNull(response); + Assert.IsTrue((response.Content as MemoryStream).TryGetBuffer(out _)); + } + [TestMethod] public async Task CustomSerilizerTest() { @@ -395,7 +404,6 @@ public async Task ReplaceItemStreamTest() Assert.IsFalse(response.IsSuccessStatusCode); Assert.IsNotNull(response); Assert.AreEqual(HttpStatusCode.NotFound, response.StatusCode, response.ErrorMessage); - Assert.IsTrue(response.ErrorMessage.Contains("https://aka.ms/CosmosTsgNotFound")); } } @@ -1057,7 +1065,8 @@ public async Task ItemEpkQuerySingleKeyRangeValidation() properties: new Dictionary() { {"x-ms-effective-partition-key-string", "AA" } - }); + }, + feedRangeInternal: null); Assert.IsTrue(partitionKeyRanges.Count == 1, "Only 1 partition key range should be selected since the EPK option is set."); } @@ -1295,7 +1304,8 @@ public async Task ItemReplaceAsyncTest() partitionKey: new Cosmos.PartitionKey(originalStatus), item: testItem); Assert.Fail("Replace changing partition key is not supported."); - }catch(CosmosException ce) + } + catch (CosmosException ce) { Assert.AreEqual((HttpStatusCode)400, ce.StatusCode); } @@ -1560,7 +1570,6 @@ public async Task VerifyToManyRequestTest(bool isQuery) ResponseMessage failedResponseMessage = failedToManyRequests.First(); Assert.AreEqual(failedResponseMessage.StatusCode, (HttpStatusCode)429); Assert.IsNotNull(failedResponseMessage.ErrorMessage); - Assert.IsNotNull(failedResponseMessage.ErrorMessage.Contains("https://aka.ms/CosmosTsgRequestRateTooLarge")); string diagnostics = failedResponseMessage.Diagnostics.ToString(); Assert.IsNotNull(diagnostics); } From 269041014413b4d44fb83e7217f91a6d630edff8 Mon Sep 17 00:00:00 2001 From: Jake Willey Date: Thu, 23 Apr 2020 08:06:09 -0700 Subject: [PATCH 17/20] Updated files based on feedback --- .../CosmosClientRequestTimeout.md | 61 ------------------- TroubleshootingGuides/CosmosNotFound.md | 23 +++---- TroubleshootingGuides/CosmosNotModified.md | 21 ++----- .../CosmosRequestRateTooLarge.md | 21 ++----- .../CosmosRequestTimeoutClient.md | 53 ++++++++++++++++ .../CosmosRequestTimeoutService.md | 26 ++++++++ .../CosmosSNATPortExhaustion.md | 2 +- .../CosmosServiceRequestTimeout.md | 61 ------------------- 8 files changed, 99 insertions(+), 169 deletions(-) delete mode 100644 TroubleshootingGuides/CosmosClientRequestTimeout.md create mode 100644 TroubleshootingGuides/CosmosRequestTimeoutClient.md create mode 100644 TroubleshootingGuides/CosmosRequestTimeoutService.md delete mode 100644 TroubleshootingGuides/CosmosServiceRequestTimeout.md diff --git a/TroubleshootingGuides/CosmosClientRequestTimeout.md b/TroubleshootingGuides/CosmosClientRequestTimeout.md deleted file mode 100644 index b40b5af459..0000000000 --- a/TroubleshootingGuides/CosmosClientRequestTimeout.md +++ /dev/null @@ -1,61 +0,0 @@ -## Cosmos1000 - - - - - - - - - - - - - - -
TypeNameCosmos408_0000TransportException
CheckIdCosmos408_0000
CategoryConnectivity
- -## Issue - -The SDK was not able to connect to the Azure Cosmos DB service. - -## Troubleshooting steps - -These are the known causes for this issue. - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Possible causeSolution
High CPU utilization. This is the most common cause. It is recommended to look at CPU utilization at 10 second intervals. If the interval is larger then CPU spikes can be missed by getting averaged in with lower values.The application should be scaled up/out.
Socket / Port availability might be low. When running in Azure, clients using the .NET SDK can hit Azure SNAT (PAT) port exhaustion.Follow the Cosmos1001 guide.
Creating multiple Client instances might lead to connection contention and timeout issues.Follow the [performance tips](https://docs.microsoft.com/azure/cosmos-db/performance-tips), and use a single CosmosClient instance across an entire process.
Retries occur from throttled requests. The SDK retries internally without surfacing this to the caller. Check the [portal metrics](https://docs.microsoft.com/azure/cosmos-db/monitor-cosmos-db) for 429 throttled requests
Hot partition key. Azure Cosmos DB distributes the overall provisioned throughput evenly across physical partitions. Check portal metrics to see if the workload is encountering a hot [partition key](https://docs.microsoft.com/azure/cosmos-db/partition-data). This will cause the aggregate consumed throughput (RU/s) to be appear to be under the provisioned RUs, but a single partition consumed throughput (RU/s) will exceed the provisioned throughput.The partition key should be changed to avoid the heavily used value.
A high degree of concurrency can lead to contention on the channelTry to scale the application up/out.
Large requests or responses can lead to head-of-line blocking on the channel and exacerbate contention, even with a relatively low degree of concurrency.Try to scale the application up/out.
- -SDK logs can be captured through [Trace Listener](https://github.com/Azure/azure-cosmosdb-dotnet/blob/master/docs/documentdb-sdk_capture_etl.md) to get more details. This can cause a significant performance impact. diff --git a/TroubleshootingGuides/CosmosNotFound.md b/TroubleshootingGuides/CosmosNotFound.md index 6385d59aee..183a5597a0 100644 --- a/TroubleshootingGuides/CosmosNotFound.md +++ b/TroubleshootingGuides/CosmosNotFound.md @@ -1,19 +1,10 @@ -## Cosmos404_0000 - - - - - - - - - - - - - - -
TypeNameCosmos404_0000NotFound
CheckIdCosmos404_0000
CategoryService
+## CosmosNotFound + +| | | | +|---|---|---| +|TypeName|CosmosNotFound| +|Status|404_0000| +|Category|Service| ## Description diff --git a/TroubleshootingGuides/CosmosNotModified.md b/TroubleshootingGuides/CosmosNotModified.md index 734bef7fc1..2dd4c5d66d 100644 --- a/TroubleshootingGuides/CosmosNotModified.md +++ b/TroubleshootingGuides/CosmosNotModified.md @@ -1,19 +1,10 @@ -## Cosmos304_0000 +## CosmosNotModified - - - - - - - - - - - - - -
TypeNameCosmos304_0000NotModified
CheckIdCosmos304_0000
CategoryService
+| | | | +|---|---|---| +|TypeName|CosmosNotModified| +|Status|304_0000| +|Category|Service| ## Description diff --git a/TroubleshootingGuides/CosmosRequestRateTooLarge.md b/TroubleshootingGuides/CosmosRequestRateTooLarge.md index 2c1212db0e..ab3d72be24 100644 --- a/TroubleshootingGuides/CosmosRequestRateTooLarge.md +++ b/TroubleshootingGuides/CosmosRequestRateTooLarge.md @@ -1,19 +1,10 @@ -## Cosmos429_0000 +## CosmosRequestRateTooLarge - - - - - - - - - - - - - -
TypeNameCosmos429_0000RequestRateTooLarge
CheckIdCosmos429_0000
CategoryService
+| | | | +|---|---|---| +|TypeName|CosmosRequestRateTooLarge| +|Status|429_0000| +|Category|Service| ## Issue diff --git a/TroubleshootingGuides/CosmosRequestTimeoutClient.md b/TroubleshootingGuides/CosmosRequestTimeoutClient.md new file mode 100644 index 0000000000..b0a0e4b225 --- /dev/null +++ b/TroubleshootingGuides/CosmosRequestTimeoutClient.md @@ -0,0 +1,53 @@ +## CosmosRequestTimeoutClient + +| | | | +|---|---|---| +|TypeName|CosmosRequestTimeoutClient| +|Status|408_0000| +|Category|Connectivity| + + +## Issue + +The SDK was not able to connect to the Azure Cosmos DB service. + +## Troubleshooting steps +These are the known causes for this issue. + +### High CPU utilization +This is the most common cause. It is recommended to look at CPU utilization at 10 second intervals. If the interval is larger then CPU spikes can be missed by getting averaged in with lower values. + +#### Solution: +The application should be scaled up/out. + +### Socket / Port availability might be low +When running in Azure, clients using the .NET SDK can hit Azure SNAT (PAT) port exhaustion. + +#### Solution: +Follow the CosmosSNATPortExhuastion guide. + +### Creating multiple Client instances +This might lead to connection contention and timeout issues. + +#### Solution: +Follow the [performance tips](https://docs.microsoft.com/azure/cosmos-db/performance-tips), and use a single CosmosClient instance across an entire process.| + +### Hot partition key +Azure Cosmos DB distributes the overall provisioned throughput evenly across physical partitions. One partition is having all of it's resources consumed while other partitions go unused. Check portal metrics to see if the workload is encountering a hot [partition key](https://docs.microsoft.com/azure/cosmos-db/partition-data). This will cause the aggregate consumed throughput (RU/s) to be appear to be under the provisioned RUs, but a single partition consumed throughput (RU/s) will exceed the provisioned throughput + +#### Solution: +The partition key should be changed to avoid the heavily used value. + +### High degree of concurrency +The application is doing a high level of conccurrency which can lead to contention on the channel + +#### Solution: +Try to scale the application up/out. + +### Large requests and/or responses +Large requests or responses can lead to head-of-line blocking on the channel and exacerbate contention, even with a relatively low degree of concurrency. + +#### Solution: +Try to scale the application up/out. + +SDK logs can be captured through [Trace Listener](https://github.com/Azure/azure-cosmosdb-dotnet/blob/master/docs/documentdb-sdk_capture_etl.md) to get more details. This can cause a significant performance impact. diff --git a/TroubleshootingGuides/CosmosRequestTimeoutService.md b/TroubleshootingGuides/CosmosRequestTimeoutService.md new file mode 100644 index 0000000000..59e214627b --- /dev/null +++ b/TroubleshootingGuides/CosmosRequestTimeoutService.md @@ -0,0 +1,26 @@ +## CosmosRequestTimeoutService + +| | | | +|---|---|---| +|TypeName|CosmosRequestTimeoutService| +|Status|408_0000| +|Category|Service| + +## Issue + +The SDK was able to connect to the Azure Cosmos DB service, but the request timed out. + +## Troubleshooting steps +These are the known causes for this issue. + +### Transient +It most likely a transient issue. Check if the failure rate is violating the Cosmos DB SLA. If it is violating the SLA please contact Azure support. The application should be able to handle transient + +#### Failure rate is within Cosmos DB SLA + The application should be able to handle transient failures. + +#### Failure rate is violating the Cosmos DB SLA +Please contact Azure support. + + +SDK logs can be captured through [Trace Listener](https://github.com/Azure/azure-cosmosdb-dotnet/blob/master/docs/documentdb-sdk_capture_etl.md) to get more details. This can cause a significant performance impact. diff --git a/TroubleshootingGuides/CosmosSNATPortExhaustion.md b/TroubleshootingGuides/CosmosSNATPortExhaustion.md index 8eb1c2d8ea..e571e88235 100644 --- a/TroubleshootingGuides/CosmosSNATPortExhaustion.md +++ b/TroubleshootingGuides/CosmosSNATPortExhaustion.md @@ -3,7 +3,7 @@ - + diff --git a/TroubleshootingGuides/CosmosServiceRequestTimeout.md b/TroubleshootingGuides/CosmosServiceRequestTimeout.md deleted file mode 100644 index ce183033a1..0000000000 --- a/TroubleshootingGuides/CosmosServiceRequestTimeout.md +++ /dev/null @@ -1,61 +0,0 @@ -## Cosmos1000 - -
TypeNameCosmos503_9000SNATPortExhuastionCosmosSNATPortExhuastion
CheckId
- - - - - - - - - - - - -
TypeNameCosmos408_9001ServiceTransportException
CheckIdCosmos408_0000
CategoryConnectivity
- -## Issue - -The SDK was able to send the request to Cosmos DB, but the operation timed out. - -## Troubleshooting steps - -These are the known causes for this issue. - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Possible causeSolution
High CPU utilization. This is the most common cause. It is recommended to look at CPU utilization at 10 second intervals. If the interval is larger then CPU spikes can be missed by getting averaged in with lower values.The application should be scaled up/out.
Socket / Port availability might be low. When running in Azure, clients using the .NET SDK can hit Azure SNAT (PAT) port exhaustion.Follow the Cosmos1001 guide.
Creating multiple Client instances might lead to connection contention and timeout issues.Follow the [performance tips](https://docs.microsoft.com/azure/cosmos-db/performance-tips), and use a single CosmosClient instance across an entire process.
Retries occur from throttled requests. The SDK retries internally without surfacing this to the caller. Check the [portal metrics](https://docs.microsoft.com/azure/cosmos-db/monitor-cosmos-db) for 429 throttled requests
Hot partition key. Azure Cosmos DB distributes the overall provisioned throughput evenly across physical partitions. Check portal metrics to see if the workload is encountering a hot [partition key](https://docs.microsoft.com/azure/cosmos-db/partition-data). This will cause the aggregate consumed throughput (RU/s) to be appear to be under the provisioned RUs, but a single partition consumed throughput (RU/s) will exceed the provisioned throughput.The partition key should be changed to avoid the heavily used value.
A high degree of concurrency can lead to contention on the channelTry to scale the application up/out.
Large requests or responses can lead to head-of-line blocking on the channel and exacerbate contention, even with a relatively low degree of concurrency.Try to scale the application up/out.
- -SDK logs can be captured through [Trace Listener](https://github.com/Azure/azure-cosmosdb-dotnet/blob/master/docs/documentdb-sdk_capture_etl.md) to get more details. This can cause a significant performance impact. From e7aafa51892ff029fb2c62edee4ed38e4a668e46 Mon Sep 17 00:00:00 2001 From: Jake Willey Date: Wed, 29 Apr 2020 14:52:12 -0700 Subject: [PATCH 18/20] Add more TSGs and update docs --- TroubleshootingGuides/CosmosMacSignature.md | 31 +++++++++++++++++++ .../CosmosRequestHeaderTooLarge.md | 31 +++++++++++++++++++ .../CosmosSNATPortExhaustion.md | 23 +++++--------- 3 files changed, 69 insertions(+), 16 deletions(-) create mode 100644 TroubleshootingGuides/CosmosMacSignature.md create mode 100644 TroubleshootingGuides/CosmosRequestHeaderTooLarge.md diff --git a/TroubleshootingGuides/CosmosMacSignature.md b/TroubleshootingGuides/CosmosMacSignature.md new file mode 100644 index 0000000000..3e95b42b33 --- /dev/null +++ b/TroubleshootingGuides/CosmosMacSignature.md @@ -0,0 +1,31 @@ +## CosmosNotModified + +| | | | +|---|---|---| +|TypeName|CosmosNotModified| +|Status|401_0000| +|Category|Service| + +## Description +HTTP 401: The MAC signature found in the HTTP request is not the same as the computed signature +If you received the following 401 error message: "The MAC signature found in the HTTP request is not the same as the computed signature." it can be caused by the following scenarios. + +## Troubleshooting steps + +### Key was not properly rotated +The key was rotated and did not follow the [best practices](secure-access-to-data.md#key-rotation). This is usually the case. Cosmos DB account key rotation can take anywhere from a few seconds to possibly days depending on the Cosmos DB account size. + +#### Symptoms +401 MAC signature is seen shortly after a key rotation and eventually stops without any changes. + +### The key is misconfigured +The key is misconfigured on the application so the key does not match the account or entire key was not copied. + +#### Symptoms +401 MAC signature issue will be consistent and happens for all calls + +### Race condition with create container +There is a race condition with container creation. An application instance is trying to access the container before container creation is complete. The most common scenario for this if the application is running, and the container is deleted and recreated with the same name while the application is running. The SDK will attempt to use the new container, but the container creation is still in progress so it does not have the keys. + +#### Symptoms +401 MAC signature issue is seen shortly after a container creation, and only occur until the container creation is completed. \ No newline at end of file diff --git a/TroubleshootingGuides/CosmosRequestHeaderTooLarge.md b/TroubleshootingGuides/CosmosRequestHeaderTooLarge.md new file mode 100644 index 0000000000..d56c3a0d02 --- /dev/null +++ b/TroubleshootingGuides/CosmosRequestHeaderTooLarge.md @@ -0,0 +1,31 @@ +## CosmosNotModified + +| | | | +|---|---|---| +|TypeName|CosmosNotModified| +|Status|304_0000| +|Category|Service| + +## Description +HTTP 401: The MAC signature found in the HTTP request is not the same as the computed signature +If you received the following 401 error message: "The MAC signature found in the HTTP request is not the same as the computed signature." it can be caused by the following scenarios. + +## Troubleshooting steps + +### Key was not properly rotated +The key was rotated and did not follow the [best practices](secure-access-to-data.md#key-rotation). This is usually the case. Cosmos DB account key rotation can take anywhere from a few seconds to possibly days depending on the Cosmos DB account size. + +#### Symptoms +401 MAC signature is seen shortly after a key rotation and eventually stops without any changes. + +### The key is misconfigured +The key is misconfigured on the application so the key does not match the account or entire key was not copied. + +#### Symptoms +401 MAC signature issue will be consistent and happens for all calls + +### Race condition with create container +There is a race condition with container creation. An application instance is trying to access the container before container creation is complete. The most common scenario for this if the application is running, and the container is deleted and recreated with the same name while the application is running. The SDK will attempt to use the new container, but the container creation is still in progress so it does not have the keys. + +#### Symptoms +401 MAC signature issue is seen shortly after a container creation, and only occur until the container creation is completed. \ No newline at end of file diff --git a/TroubleshootingGuides/CosmosSNATPortExhaustion.md b/TroubleshootingGuides/CosmosSNATPortExhaustion.md index e571e88235..00d2c902bc 100644 --- a/TroubleshootingGuides/CosmosSNATPortExhaustion.md +++ b/TroubleshootingGuides/CosmosSNATPortExhaustion.md @@ -1,19 +1,10 @@ -## Cosmos503_9000 - - - - - - - - - - - - - - -
TypeNameCosmosSNATPortExhuastion
CheckIdCosmos503_9000
CategoryConnectivity
+## CosmosSNATPortExhuastion + +| | | | +|---|---|---| +|TypeName|CosmosSNATPortExhuastion| +|Status|503_0000| +|Category|Connectivity| ## Issue From 16b5f7fa5bba32453eae9c8438330c0713181d24 Mon Sep 17 00:00:00 2001 From: Jake Willey Date: Tue, 12 May 2020 10:17:30 -0700 Subject: [PATCH 19/20] Updated documentation based on feedback --- TroubleshootingGuides/CosmosMacSignature.md | 27 ++++++------ TroubleshootingGuides/CosmosNotFound.md | 33 ++++++++++---- .../CosmosRequestHeaderTooLarge.md | 34 +++++++------- .../CosmosRequestTimeoutClient.md | 44 ++++++++----------- .../CosmosRequestTimeoutService.md | 16 +++---- 5 files changed, 79 insertions(+), 75 deletions(-) diff --git a/TroubleshootingGuides/CosmosMacSignature.md b/TroubleshootingGuides/CosmosMacSignature.md index 3e95b42b33..1e54361890 100644 --- a/TroubleshootingGuides/CosmosMacSignature.md +++ b/TroubleshootingGuides/CosmosMacSignature.md @@ -1,4 +1,4 @@ -## CosmosNotModified +## CosmosUnauthorized | | | | |---|---|---| @@ -12,20 +12,21 @@ If you received the following 401 error message: "The MAC signature found in the ## Troubleshooting steps -### Key was not properly rotated -The key was rotated and did not follow the [best practices](secure-access-to-data.md#key-rotation). This is usually the case. Cosmos DB account key rotation can take anywhere from a few seconds to possibly days depending on the Cosmos DB account size. +### 1. Key was not properly rotated. -#### Symptoms -401 MAC signature is seen shortly after a key rotation and eventually stops without any changes. + Symptom: 401 MAC signature is seen shortly after a key rotation and eventually stops without any changes. -### The key is misconfigured -The key is misconfigured on the application so the key does not match the account or entire key was not copied. + Cause: The key was rotated and did not follow the [best practices](secure-access-to-data.md#key-rotation). This is usually the case. Cosmos DB account key rotation can take anywhere from a few seconds to possibly days depending on the Cosmos DB account size. -#### Symptoms -401 MAC signature issue will be consistent and happens for all calls +### 2. The key is misconfigured -### Race condition with create container -There is a race condition with container creation. An application instance is trying to access the container before container creation is complete. The most common scenario for this if the application is running, and the container is deleted and recreated with the same name while the application is running. The SDK will attempt to use the new container, but the container creation is still in progress so it does not have the keys. + Symptoms: 401 MAC signature issue will be consistent and happens for all calls using that key -#### Symptoms -401 MAC signature issue is seen shortly after a container creation, and only occur until the container creation is completed. \ No newline at end of file + Cause: The key is misconfigured on the application so the key does not match the account or entire key was not copied. + + +### 3. Race condition with create container + + Symptoms: 401 MAC signature issue is seen shortly after a container creation, and only occur until the container creation is completed. + + Cause: There is a race condition with container creation. An application instance is trying to access the container before container creation is complete. The most common scenario for this if the application is running, and the container is deleted and recreated with the same name while the application is running. The SDK will attempt to use the new container, but the container creation is still in progress so it does not have the keys. \ No newline at end of file diff --git a/TroubleshootingGuides/CosmosNotFound.md b/TroubleshootingGuides/CosmosNotFound.md index 183a5597a0..14fceba69d 100644 --- a/TroubleshootingGuides/CosmosNotFound.md +++ b/TroubleshootingGuides/CosmosNotFound.md @@ -14,14 +14,29 @@ This status code represents that the resource no longer exists. The document does exists, but still returns a 404. -### Cause 1: Race condition -There is multiple SDK client instances and the read happened before the write. +### 1. Race condition + Cause: There is multiple SDK client instances and the read happened before the write. -### Solution -1. For session consistency the create item will return a session token that can be passed between SDK instances to guarantee that the read request is reading from a replica with that change. -2. Change the consistency level to a stronger level + Fix: + 1. For session consistency the create item will return a session token that can be passed between SDK instances to guarantee that the read request is reading from a replica with that change. + 2. Change the [consistency level](https://docs.microsoft.com/azure/cosmos-db/consistency-levels-choosing) to a [stronger level](https://docs.microsoft.com/azure/cosmos-db/consistency-levels-tradeoffs) -### Related documentation -* [Consistency levels](https://docs.microsoft.com/azure/cosmos-db/consistency-levels) -* [Choose the right consistency level](https://docs.microsoft.com/azure/cosmos-db/consistency-levels-choosing) -* [Consistency, availability, and performance tradeoffs](https://docs.microsoft.com/azure/cosmos-db/consistency-levels-tradeoffs) \ No newline at end of file +### 2. Invalid Partition Key and ID combination + Cause: The partition key and id combination are not valid. + + Fix: Fix the application logic that is causing the incorrect combination. + +### 3. TTL purge + Cause: The item had the [Time To Live (TTL)](https://docs.microsoft.com/azure/cosmos-db/time-to-live) property set. The item was purged because the time to live had expired. + + Fix: Change the Time To Live to prevent the item from getting purged. + +### 4. Lazy indexing + Cause: The [lazy indexing](https://docs.microsoft.com/azure/cosmos-db/index-policy#indexing-mode) has not caught up. + + Fix: Wait for the indexing to catch up or change the indexing policy + +### 5. Parent resource deleted + Cause: The database and/or container that the item exists in has been deleted. + + Fix: [Restore](https://docs.microsoft.com/azure/cosmos-db/online-backup-and-restore#backup-retention-period) the parent resource or recreate the resources. \ No newline at end of file diff --git a/TroubleshootingGuides/CosmosRequestHeaderTooLarge.md b/TroubleshootingGuides/CosmosRequestHeaderTooLarge.md index d56c3a0d02..9e41ae825a 100644 --- a/TroubleshootingGuides/CosmosRequestHeaderTooLarge.md +++ b/TroubleshootingGuides/CosmosRequestHeaderTooLarge.md @@ -1,31 +1,31 @@ -## CosmosNotModified +## CosmosRequestHeaderTooLarge | | | | |---|---|---| -|TypeName|CosmosNotModified| -|Status|304_0000| +|TypeName|CosmosRequestHeaderTooLarge| +|Status|400_0000| |Category|Service| ## Description -HTTP 401: The MAC signature found in the HTTP request is not the same as the computed signature -If you received the following 401 error message: "The MAC signature found in the HTTP request is not the same as the computed signature." it can be caused by the following scenarios. +The size of the header has grown to large and is exceeding the maximum allowed size. It's always recommended to use the latest SDK. Make sure to use at least version 3.x or 2.x, which adds header size tracing to the exception message. ## Troubleshooting steps -### Key was not properly rotated -The key was rotated and did not follow the [best practices](secure-access-to-data.md#key-rotation). This is usually the case. Cosmos DB account key rotation can take anywhere from a few seconds to possibly days depending on the Cosmos DB account size. +### 1. Session Token too large + Symptoms: The 400 bad request is happening on point operations where the continuation token is not being used. -#### Symptoms -401 MAC signature is seen shortly after a key rotation and eventually stops without any changes. + Cause: The session token grows as the number of partitions increase in the container. The numbers of partition increase as the amount of data increase or if the thoughput is increased. -### The key is misconfigured -The key is misconfigured on the application so the key does not match the account or entire key was not copied. + Temprorary mitigation: Restart the application will reset all the session token. This session token will eventually grow back to the previous size that causes the issue. -#### Symptoms -401 MAC signature issue will be consistent and happens for all calls + Fixes: + 1. Follow the performance tips and convert the application to Direct + TCP connection mode. Direct + TCP does not have the header size restriction like HTTP does which avoids this issue. Make sure to use SDK version greater than 2.9.3 which has a fix for query opertaions when the service interop is not available. + 2. If the application cannot be converted to Direct + TCP and the session token is the cause, then mitigation can be done by changing the client consistency level. The session token is only used for session consistency which is the default for Cosmos DB. Any other consistency level will not use the session token. -### Race condition with create container -There is a race condition with container creation. An application instance is trying to access the container before container creation is complete. The most common scenario for this if the application is running, and the container is deleted and recreated with the same name while the application is running. The SDK will attempt to use the new container, but the container creation is still in progress so it does not have the keys. -#### Symptoms -401 MAC signature issue is seen shortly after a container creation, and only occur until the container creation is completed. \ No newline at end of file +### 2. Continuation token too large + Symptoms: The 400 bad request is happening on query operations where the continuation token is being passed in. + Cause: The continuation token has grown to large. Different queries will have different continuation token sizes. + Fixes: + 1. Follow the performance tips and convert the application to Direct + TCP connection mode. Direct + TCP does not have the header size restriction like HTTP does which avoids this issue. + 2. If the application cannot be converted to Direct + TCP and the continuation token is the cause, then try setting the ResponseContinuationTokenLimitInKb option. The option can be found in the FeedOptions for v2 or the QueryRequestOptions in v3. \ No newline at end of file diff --git a/TroubleshootingGuides/CosmosRequestTimeoutClient.md b/TroubleshootingGuides/CosmosRequestTimeoutClient.md index b0a0e4b225..373dff1773 100644 --- a/TroubleshootingGuides/CosmosRequestTimeoutClient.md +++ b/TroubleshootingGuides/CosmosRequestTimeoutClient.md @@ -14,40 +14,32 @@ The SDK was not able to connect to the Azure Cosmos DB service. ## Troubleshooting steps These are the known causes for this issue. -### High CPU utilization -This is the most common cause. It is recommended to look at CPU utilization at 10 second intervals. If the interval is larger then CPU spikes can be missed by getting averaged in with lower values. +### 1. High CPU utilization (most common case) + Cause: For optimal latency it is recommended that CPU usage should be roughly 40%. It is recommended to look at CPU utilization at 10 second intervals. If the interval is larger then CPU spikes can be missed by getting averaged in with lower values. This is more common with cross partition queries where it might do multiple connections for a single request. -#### Solution: -The application should be scaled up/out. + Fix: The application should be scaled up/out. -### Socket / Port availability might be low -When running in Azure, clients using the .NET SDK can hit Azure SNAT (PAT) port exhaustion. +### 2. Socket / Port availability might be low + Cause: When running in Azure, clients using the .NET SDK can hit Azure SNAT (PAT) port exhaustion. -#### Solution: -Follow the CosmosSNATPortExhuastion guide. + Fix: Follow the CosmosSNATPortExhuastion guide. -### Creating multiple Client instances -This might lead to connection contention and timeout issues. +### 3. Creating multiple Client instances + Cause: This might lead to connection contention and timeout issues. -#### Solution: -Follow the [performance tips](https://docs.microsoft.com/azure/cosmos-db/performance-tips), and use a single CosmosClient instance across an entire process.| + Fix:Follow the [performance tips](https://docs.microsoft.com/azure/cosmos-db/performance-tips), and use a single CosmosClient instance across an entire process.| -### Hot partition key -Azure Cosmos DB distributes the overall provisioned throughput evenly across physical partitions. One partition is having all of it's resources consumed while other partitions go unused. Check portal metrics to see if the workload is encountering a hot [partition key](https://docs.microsoft.com/azure/cosmos-db/partition-data). This will cause the aggregate consumed throughput (RU/s) to be appear to be under the provisioned RUs, but a single partition consumed throughput (RU/s) will exceed the provisioned throughput +### 4. Hot partition key + Cause: Azure Cosmos DB distributes the overall provisioned throughput evenly across physical partitions. One partition is having all of it's resources consumed while other partitions go unused. Check portal metrics to see if the workload is encountering a hot [partition key](https://docs.microsoft.com/azure/cosmos-db/partition-data). This will cause the aggregate consumed throughput (RU/s) to be appear to be under the provisioned RUs, but a single partition consumed throughput (RU/s) will exceed the provisioned throughput -#### Solution: -The partition key should be changed to avoid the heavily used value. + Fix: The partition key should be changed to avoid the heavily used value. -### High degree of concurrency -The application is doing a high level of conccurrency which can lead to contention on the channel +### 5. High degree of concurrency + Cause: The application is doing a high level of conccurrency which can lead to contention on the channel -#### Solution: -Try to scale the application up/out. + Fix: Try to scale the application up/out. -### Large requests and/or responses -Large requests or responses can lead to head-of-line blocking on the channel and exacerbate contention, even with a relatively low degree of concurrency. +### 6. Large requests and/or responses + Cause: Large requests or responses can lead to head-of-line blocking on the channel and exacerbate contention, even with a relatively low degree of concurrency. -#### Solution: -Try to scale the application up/out. - -SDK logs can be captured through [Trace Listener](https://github.com/Azure/azure-cosmosdb-dotnet/blob/master/docs/documentdb-sdk_capture_etl.md) to get more details. This can cause a significant performance impact. + Fix: Try to scale the application up/out. \ No newline at end of file diff --git a/TroubleshootingGuides/CosmosRequestTimeoutService.md b/TroubleshootingGuides/CosmosRequestTimeoutService.md index 59e214627b..99630598e1 100644 --- a/TroubleshootingGuides/CosmosRequestTimeoutService.md +++ b/TroubleshootingGuides/CosmosRequestTimeoutService.md @@ -11,16 +11,12 @@ The SDK was able to connect to the Azure Cosmos DB service, but the request timed out. ## Troubleshooting steps -These are the known causes for this issue. -### Transient -It most likely a transient issue. Check if the failure rate is violating the Cosmos DB SLA. If it is violating the SLA please contact Azure support. The application should be able to handle transient +### 1. Check the portal metrics + Use the [Azure monitoring](https://docs.microsoft.com/azure/cosmos-db/monitor-cosmos-db) to check if the 408 request timeout was from the service. -#### Failure rate is within Cosmos DB SLA - The application should be able to handle transient failures. +### 2. Failure rate is within Cosmos DB SLA + The application should be able to handle transient failures and retry when necessary. -#### Failure rate is violating the Cosmos DB SLA -Please contact Azure support. - - -SDK logs can be captured through [Trace Listener](https://github.com/Azure/azure-cosmosdb-dotnet/blob/master/docs/documentdb-sdk_capture_etl.md) to get more details. This can cause a significant performance impact. +### 3. Failure rate is violating the Cosmos DB SLA + Please contact Azure support. \ No newline at end of file From eb162972086da2e47ef23454293f708b90df6148 Mon Sep 17 00:00:00 2001 From: Jake Willey Date: Tue, 12 May 2020 10:28:01 -0700 Subject: [PATCH 20/20] Fixed formatting --- TroubleshootingGuides/CosmosNotFound.md | 2 +- TroubleshootingGuides/CosmosRequestHeaderTooLarge.md | 4 +++- 2 files changed, 4 insertions(+), 2 deletions(-) diff --git a/TroubleshootingGuides/CosmosNotFound.md b/TroubleshootingGuides/CosmosNotFound.md index 14fceba69d..64f7001807 100644 --- a/TroubleshootingGuides/CosmosNotFound.md +++ b/TroubleshootingGuides/CosmosNotFound.md @@ -10,7 +10,7 @@ This status code represents that the resource no longer exists. -## Known issues +## Known causes The document does exists, but still returns a 404. diff --git a/TroubleshootingGuides/CosmosRequestHeaderTooLarge.md b/TroubleshootingGuides/CosmosRequestHeaderTooLarge.md index 9e41ae825a..35ad6a5d32 100644 --- a/TroubleshootingGuides/CosmosRequestHeaderTooLarge.md +++ b/TroubleshootingGuides/CosmosRequestHeaderTooLarge.md @@ -12,7 +12,7 @@ The size of the header has grown to large and is exceeding the maximum allowed s ## Troubleshooting steps ### 1. Session Token too large - Symptoms: The 400 bad request is happening on point operations where the continuation token is not being used. + Symptoms: The 400 bad request is happening on point operations where the continuation token is not being used. The exception started without making any changes to the application. Cause: The session token grows as the number of partitions increase in the container. The numbers of partition increase as the amount of data increase or if the thoughput is increased. @@ -25,7 +25,9 @@ The size of the header has grown to large and is exceeding the maximum allowed s ### 2. Continuation token too large Symptoms: The 400 bad request is happening on query operations where the continuation token is being passed in. + Cause: The continuation token has grown to large. Different queries will have different continuation token sizes. + Fixes: 1. Follow the performance tips and convert the application to Direct + TCP connection mode. Direct + TCP does not have the header size restriction like HTTP does which avoids this issue. 2. If the application cannot be converted to Direct + TCP and the continuation token is the cause, then try setting the ResponseContinuationTokenLimitInKb option. The option can be found in the FeedOptions for v2 or the QueryRequestOptions in v3. \ No newline at end of file