Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Internal] Design Docs: Adds Design Document for Client Telemetry #3590

Merged
22 commits merged into from
Jun 9, 2023
Merged
Changes from 16 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
70 changes: 69 additions & 1 deletion docs/observability.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,4 +33,72 @@ flowchart TD
OtherLogic --> GetResponse(Get Response for the request)
SendResponse --> OperationCall

```
```

## Send telemetry from SDK to service (Private Preview)

### Introduction
SDK sends aggregated telemetry data every 10 minutes to Microsoft. We collect following information as part of this feature:
sourabh1007 marked this conversation as resolved.
Show resolved Hide resolved
sourabh1007 marked this conversation as resolved.
Show resolved Hide resolved
1. Cache Latencies : Right now, it covers only Collection Cache
sourabh1007 marked this conversation as resolved.
Show resolved Hide resolved
2. Client System Usage (during an operation) :
* CPU usage
* Memory Usage
* Thread Starvation
* Network Connections Opened (only TCP Connections)
3. Operation Latencies and Request Units (RUs).
sourabh1007 marked this conversation as resolved.
Show resolved Hide resolved
4. Network Request Latencies. (sampled to, top 10 slowest to a replica)
sourabh1007 marked this conversation as resolved.
Show resolved Hide resolved

> Note: We don't collect any PII data as part of this feature.

### Limitations
1. AAD Support is not available.

### Components

**Telemetry Job:** Background task which collects the data and sends it to a Microsoft service every 10 minutes.

**Collectors:** In-memory storage which keeps the telemetry data collected during an operation. There are 3 types of collectors including:
* _Operational Data Collector_: It keeps operation level latencies and request units.
* _Network Data Collector_: It keeps all the metrics related to network or TCP calls. It has its own Sampler which sample-in only slowest TCP calls for a particular replica.
* _Cache Data Collector_: It keeps all the cache call latencies. Right now, only collection cache is covered.

**Get VM Information**: It makes [Azure Instance Metadata](https://learn.microsoft.com/azure/virtual-machines/instance-metadata-service?tabs=windows) call. If customer is not on Azure VM, we won't have this information and customer will see a warning with exception in the Trace Logs (if enabled).
sourabh1007 marked this conversation as resolved.
Show resolved Hide resolved

**Processor**: Its resposibility is to get all the data and divide it into small chunks (<2MB) and send each chunk to the Microsoft service.
sourabh1007 marked this conversation as resolved.
Show resolved Hide resolved

```mermaid
flowchart TD
subgraph TelemetryJob[Telemetry Background Job]
subgraph Storage[In Memory Storage or Collectors]
subgraph NetworkDataCollector[Network Data Collector]
TcpDatapoint(Network Request Datapoint) --> NetworkHistogram[(Histogram)]
DataSampler(Sampler)
end
subgraph DataCollector[Operational Data Collector]
OpsDatapoint(Operation Datapoint) --> OperationHistogram[(Histogram)]
end
subgraph CacheCollector[Cache Data Collector]
CacheDatapoint(Cache Request Datapoint) --> CacheHistogram[(Histogram)]
end
end
subgraph TelemetryTask[Telemetry Task Every 10 min]
CacheAccountInfo(Cached Account Properties) --> VMInfo
VMInfo(Get VM Information) --> CollectSystemUsage
CollectSystemUsage(Record System Usage Information) --> GetDataFromCollector
end
subgraph Processor
GetDataFromCollector(Fetch Data from Collectors) --> Serializer
Serializer(Serialize and divide the Payload) --> SendCTOverHTTP(Send Data over HTTP to Service)
end
Storage --> |Get Aggregated data|GetDataFromCollector
end
```

### Benefits
sourabh1007 marked this conversation as resolved.
Show resolved Hide resolved
Enabling this feature provides numerous benefits. The telemetry data collected will allow us to identify and address potential issues. This results in a superior support experience and ensures that some issues can even be resolved before they impact your application. In short, customers with this feature enabled can expect a smoother and more reliable experience.
sourabh1007 marked this conversation as resolved.
Show resolved Hide resolved

### Impact of this feature enabled
* _Latency_: Customer should not see any impact on latency.
* _Total RPS_: It depends on the infrastructure the application using SDK is hosted on among other factors but the impact should not exceed 10%.
* _Any other impact_: Collector needs around 18MB of in-memory storage to hold the data and this storage is always constant (it means it doesn't grow, no matter how much data we have)
* Benchmark Numbers: https://github.com/Azure/azure-cosmos-dotnet-v3/blob/master/Microsoft.Azure.Cosmos/tests/Microsoft.Azure.Cosmos.Performance.Tests/Contracts/BenchmarkResults.json