- RFC Name: Response Time Observability
- RFC ID: 0035
- Start Date: 2017-10-23
- Owner: Mike Goldsmith
- Current Status: ACCEPTED
From a diagnostics point of view it is valuable to collect and analyse each operation an SDK performs to help identify problem areas. Each operation goes through a number of phases and tracking each of them both individually and as a whole allows in-depth interrogation.
In addition, it would also be useful to identify if and when an operation takes an unusually long time to complete. This analysis will help aid investigation by providing contextual information that is only available during the operation processing.
Tracing is recording details about discrete steps or phases of a request lifecycle, eg request encoding / decoding or dispatching to server. These phases are timed independently and contain additional contextual information. A single request is typically made up of multiple tracing points.
OpenTracing is a standardised API to structure tracing information in a consistent and predictable manner. This document will not cover all the topics and history of tracing or OpenTracing (please refer to above links for more complete documentation), and instead will only cover key topics and concepts.
Briefly; OpenTracing tracing data is a nested structure of Spans
with durations, contextual information and relationships to other Spans
. A structural example is below:
A good way to visualise the span structure is with an x-axis time graph like the following:
The key concepts OpenTracing are described below:
Type | Description |
Tracer | Top level structure that can be used to create a SpanBuilder. |
SpanBuilder | Creates and starts Spans with contextual information. |
SpanContext | Contextual information for a given Span. This information can be transported between process / services. |
Span | An in-process tracing structure; includes start timestamp, end timestamps and additional contextual information (referred to as tags and baggage). |
This RFC proposes that the Couchbase SDKs adopt the OpenTracing API to provide per request tracing. However, because the OpenTracing API is just way to construct and manage related, discrete timings, the Couchbase SDKs are required to provide their own implementations to achieve the desired behaviour.
NOTE: The OpenTracing API includes a mechanism to import and export a SpanContext across process boundaries. The initial implementation focuses on intra-process tracing only and does not require a Tracer implement the Extract/Inject functionality. This may be added in the future to allow application developers to start a trace in their own application and the Couchbase SDK could Import the SpanContext so subsequent tracing information is collated together.
Each SDK is to create spans using the OpenTracing API and are created via the Tracer / SpanBuilder objects. Each SDK will have differing implementations and therefore idiomatic differences so this RFC will not be prescriptive on how these are implemented. The OpenTracing project provide artifacts for many languages that provide the required API.
Each operation or query needs to be uniquely identified both on the client and on the server, where appropriate. This will be used to tie logs from the client and server together and each service has different requirements to uniquely identify an operation. Each span should have an couchbase.operation_id tag with the unique identifier, as described below.
Service | Description |
KV | Operation Opaque
NOTE: Opaque is not unique by itself as some SDK implementations are internally incrementing numbers. As part of this RFC, a new Connection ID feature will be added to uniquely identify a client connection with a Couchbase Server and combined with the client's opaque, we have a unique KV operation. |
N1QL | Queries provide a client_context_id field. When a client_context_id field is empty, the SDK will auto-generate a UUID value.
(Not sure if it's used in server logs?) |
FTS | Not currently available - created MB-27696 to track. |
Analytics | Queries provide a client_context_id field. When a client_context_id field is empty, the SDK will auto-generate a UUID value.
(Not sure if it's used in server logs?) |
View | Not currently available |
NOTE: Both N1QL and Analytics queries allow the application to override client_context_id. Also some SDKs implement a sequence generator that will clash between instances and/or it uses a HTTP library so doesn't know the local host:port.
NOTE: In the case of sub-doc operations, the top-level operation should be represented within the span hierarchy with remapped subdoc commands as children. eg MapRemove > Subdoc_Mutate.
Each span's operation name should represent the activity that is being traced. For example, a KV operation would have a top-level span with operation_name of the operation type, eg GET and a couchbase.operation_id tag with the opaque.
The following is a list of trace points that each SDK should implement for each service (KV, View, N1QL, Search and Analytics) and each span's operation name should match the trace point name.
Name | Description |
request_queue
(optional) |
Request queue time represents the time a requests spends waiting to be sent to the server. |
request_encoding | Request transcoding is when a request content provides content to the server (eg UPSERT) and converts the value from a native object into a common structure (eg JSON).
NOTE: This can be omitted if the SDK does not perform request encoding or the request does not require have a body. |
dispatch_to_server | The time spent sending the request to the server and waiting for a response. This phase encompasses a number of layers, many of which are outside of the SDKs control, for example; task scheduling and writing/reading to the network. |
response_decoding | Response transcoding is when a request retrieves content (eg GET) from the server and reconstructs the response body into a native data type.
NOTE: This can be omitted if the SDK does not perform response decoding or the response does not require have a body. |
response_queue
(optional) |
The resolution duration represents the amount of time spent waiting for the application to be ready to handle the operation response. |
The following tags should be appended to all spans where appropriate. Additional tags can be appended per SDK with the following required.
Tag | Description |
couchbase.operation_id | The unique ID of the operation, as described above. |
couchbase.document_key | The document key (KV operations only).
NOTE: This tag should be omitted if log redaction is enabled within the SDK. |
peer.service | The service type, one of the following:
kv, view, n1ql, search, analytics |
local.address | The local socket hostname / IP and port, in the format: {hostname}:{port}
To be added to dispatch spans when the local socket is available. |
peer.address | The remote socket hostname / IP and port, in the format: {hostname}:{port}
To be added to dispatch spans when the remote socket is available. |
span.kind | "client" |
db.type | "couchbase" |
component | The client's identifier string (the 'a' property in the updated HELLO request), the same one that is shared with the server to identify the SDK.
NOTE: This string does not require trimming to 200 characters. |
peer.latency | The server duration with precision suffix. The suffix is required to indicate precision because KV is recorded in microseconds and N1QL query metrics use milliseconds.
For example: 123us, 32.12ms |
Previously, a KV operation was identified using a combination of the local FQDN or IP, port and opaque, however this relies on both the client and server agreeing on network names and is complicated when some components may run under a local NAT, for instance, kubernetes pods.
To address this; the 'hello' request will be extended to be a JSON object with the following properties defined. The server will associate the given uuid to the socket and use it when logging slow operations and tracing data.
Name | Description |
a (string) | The client agent string - this can not be longer than 200 characters and must be trimmed if exceeds 200.
NOTE: This is the same value that was previously sent and includes details such as client name, version, etc. |
i (string) | The connection ID is be made up of two components separated by a forward slash. The first value will be consistent for all connections in a given client instance, and the second value is per connection.
Each part is a randomly generated uint64 number that is formatted as a hex string and padded to 16 characters. |
An example key is below
{
"a":"couchbase-net-sdk/2.4.5.0 (clr/4.0.30319.42000) (os/Microsoft Windows NT 10.0.16299.0)",
"i":"66388CF5BFCF7522/18CC8791579B567C"
}
NOTE: The indentation is for visibility where the JSON that is sent to the server should be compressed to remove whitespace.
NOTE: The HTTP services should also be extended to take the client / connection ID with the key cb-client-id. The support and adoption of this new header and updated logged in the server is not expected to happen in the short-term but should be implemented.
To help identify when an operation is exceeds a reasonable threshold, each SDK will implement a ThresholdLoggingTracer that will track operations that exceed the threshold and log them periodically.
The threshold tracer receives completed spans and verifies if an operation has exceeded the given threshold for the operation type. Operations that exceed the threshold are periodically logged with a total count and a sample of the slowest ones.
The following are SDK configuration properties.
Property Name | Description |
OperationTracingEnabled | Boolean used to determine tracing is enabled. Defaults to using the ThesholdLoggingTracer if enabled. When false a Noop or similar tracing implementation should be used instead.
Default value: true |
OperationTracingServerDurationEnabled | Boolean used to instruct the SDK to try and retrieve duration metrics from the server for KV operations.
KV - Controls if the Framing Extras hello flag is issued during socket setup. |
ThresholdLoggingTracerInterval | The interval between executions that processes the collected operation spans.
Default value: 10000 (10 seconds) Expressed as milliseconds. |
ThresholdLoggingTracerSampleSize | The maximum number of items to log per service.
Default value: 10 |
ThresholdLoggingTracerKVThreshold | The KV operation operation threshold.
Default Value: 500000 (500 milliseconds) Expressed as microseconds. |
ThresholdLoggingTracerViewsThreshold | The View query operation threshold.
Default Value: 1000000 (1 second) Expressed as microseconds. |
ThresholdLoggingTracerN1qlThreshold | The N1QL query operation threshold.
Default Value: 1000000 (1 second) Expressed as microseconds. |
ThresholdLoggingTracerSearchThreshold | The FTS query operation threshold.
Default Value: 1000000 (1 second) Expressed as microseconds. |
ThresholdLoggingTracerAnalyticsThreshold | The Analytics query operation threshold.
Default Value: 1000000 (1 second) Expressed as microseconds. |
OrphanedResponseLoggingEnabled | Boolean used to determine if orphaned response logging is enabled.
Default value: true |
OrphanedResponseLoggingInterval | The interval used to flush orphaned response information to the log.
Default value: 10000 (10 seconds) Expressed as milliseconds |
OrphanedResponseLoggingSampleSize | The number of sample orphaned responses whose to log additional information for per execution.
Default value: 10 |
NOTE: The per-service operation and query floors are a guide for default values and how they are expressed. How the SDK receives these values can be represented by whatever is idiomatic to the SDK.
Slow operation summaries are to be logged at the INFO, or equivalent, level.
This RFC does not go into how logging should be configured as each SDK and environment is different and the standard SDK logging infrastructure / abstraction should be used. Also, some logging infrastructure implementations provide ways to send log entries to an alternative source, eg LogStash, so this RFC will not go into alternatives storing locations at this time.
See each SDKs documentation on how to configure logging:
When ThresholdLoggingTracer executes to process any collected spans, each of the spans that is to be output should be transformed into this structure.
Property | Description |
operation_name | The operation name is operation type that the application used with the
eg get, get_and_touch, upsert, etc |
last_operation_id | The last operation ID. eg KV opaque, n1ql context ID.
Note: for KV operations that use the opaque for the the operation ID should use the 0x prefix to indicate it's a hex value. |
last_local_address | The local socket hostname / IP and port, separated by a colon.
For example: 255.123.11.134:54321 |
last_remote_address | The server hostname / IP and port separated by a colon.
For example: 10.112.170.101:11210 |
last_local_id | The last connection ID used to send a packet to the server.
For example: 66388CF5BFCF7522/18CC8791579B567C |
total_us | The total time taken for the operation.
Expressed as microseconds. |
encode_us | The calculated sum of all encode sub-spans.
Expressed as microseconds. |
dispatch_us | The calculated sum of all dispatch sub-spans.
Expressed as microseconds. |
server_us | The calculated sum of all server duration sub-spans.
Only present if server durations are enabled. Expressed as microseconds. |
decode_us | The calculated sum of all decode sub-spans.
Expressed as microseconds. |
last_dispatch_us | The time taken for the last dispatch to server.
Expressed as microseconds. |
NOTE: Any of the above properties can be omitted if there is no appropriate value to be used.
An example output from the threshold logging tracer:
[
{
"service":"kv",
"count":75,
"top":[
{
"operation_name":"get",
"last_operation_id": "0x21",
"last_local_address":"10.211.55.3:52450",
"last_remote_address":"10.112.180.101:11210",
"last_local_id": "66388CF5BFCF7522/18CC8791579B567C",
"total_duration_us":18908,
"encode_us":256,
"dispatch_us":825,
"decode_us":16,
"server_duration_us":14
},
{
"operaion_name":"set",
"last_operation_id": "0x22",
"last_local_address":"10.211.55.3:52450",
"last_remote_address":"10.112.180.101:11210",
"last_local_id": "66388CF5BFCF7522/18CC8791579B567C",
"total_duration_us":11468,
"encode_us":3832,
"dispatch_us":565,
"decode_us":15
},
{
"operaion_name":"get",
"last_operation_id": "0x23",
"last_local_address":"10.211.55.3:52450",
"last_remote_address":"10.112.180.101:11210",
"last_local_id": "66388CF5BFCF7522/18CC8791579B567C",
"total_duration_us":2996,
"encode_us":4,
"dispatch_us":2829,
"decode_us":8
},
{
"operaion_name":"set",
"last_operation_id": "0x24",
"last_local_address":"10.211.55.3:52450",
"last_remote_address":"10.112.180.101:11210",
"last_local_id": "66388CF5BFCF7522/18CC8791579B567C",
"total_duration_us":2777,
"encode_us":15,
"dispatch_us":2627,
"decode_us":11
},
{
"operaion_name":"set",
"last_operation_id": "0x25",
"last_local_address":"10.211.55.3:52450",
"last_remote_address":"10.112.180.101:11210",
"last_local_id": "66388CF5BFCF7522/18CC8791579B567C",
"total_duration_us":1331,
"encode_us":16,
"dispatch_us":1206,
"decode_us":5
}
]
}
]
The LogTracer simply logs each span as it finished using the configured logging infrastructure. This could be useful if a customer has configured a file monitoring process that can process and analyse tracing information.
Each SDK must either use or implement a default Noop style tracer for when tracing is disabled. Depending on the OpenTracing API library, this may already be provided.
This implementation will use minimal resources and effectively ignores tracing without altering code control paths. The NoopTracer should be used when the configuration disables tracing.
As part of the spans that dispatch a request to a server, it is desirable to track the server-side duration to help identify where a requests duration is spent (SDK, Server or in-between) . Most services provide a mechanism to retrieve the duration, as described below.
KV operation server duration timings are encoded into the response using "flexible framing extras" (https://github.com/couchbase/kv_engine/blob/master/docs/BinaryProtocol.md#response-header-with-flexible-framing-extras).
Server duration is enabled via Hello negotiation with the feature code (0x10) and when enabled the response packet has the following changes:
- Magic byte is 0x18 (instead of 0x81)
- Header byte index 2 indicates the total extras length
- Framing extras is encoded as a series of variable length FrameInfo structures and is is returned directly after the header and before the regular extras byte
Each FrameInfo has a descriptor byte followed by a variable number of bytes depending on the type of information.
Bit | Description |
0-3 | Frame type descriptor |
4-7 | Length |
The initial design of framing extras defines one structure, the server duration. This is a two byte floating number described in microseconds with a variable precision and is encoded/decoded as follows:
encoded = (micros * 2) ^ (1.0 / 1.74)
decoded = (encoded ^ 1.74) / 2
Decoding examples:
C
std::chrono::microseconds Tracer::decodeMicros(uint16_t encoded) const {
auto usecs = uint64_t(std::pow(encoded, 1.74) / 2);
return std::chrono::microseconds(usecs);
}
.NET (C#)
public static double Decode(ushort encoded)
{
return System.Math.Pow(encoded, 1.74) / 2;
}
An example unit test to prove the encoded values are being decoded as expected:
[TestCase((ushort) 0, 0.0)]
[TestCase((ushort) 1234, 119635.03533802561)]
[TestCase((ushort) 65535, 120125042.10125735)]
public void Can_Decode_Server_Duration(ushort encoded, double expected)
{
var decoded = Math.Pow(encoded, 1.74) / 2;
Assert.AreEqual(expected, decoded);
}
N1QL, FTS and Analytics queries are all dispatched over HTTP with both the request and response encoded as JSON. The query duration time is returned in the response body JSON in the metrics.elapsedTime property for N1QL and Analytics and the root property took for FTS. This value is expressed as microseconds, so is the same precision as other metrics.
Because the request durations form part of the response body it cannot be read independently of the response body. This is troublesome for streamed responses as the server duration timings may not be readable until after all results have been read, which could be a long time depending on how the application processes them. To help distinguish the different phases, a streamed request should be made up of the following spans:
Span Name | Description |
request_encoding | The amount of time taken to prepare the JSON request body ready for dispatching to the server. |
dispatch_to_server | The amount of time taken between sending the request to the server and when the response HTTP header has been read. The response body has not been processed yet. |
stream_results | The total amount of time taken to stream all of the result rows. |
N1QL returns additional metrics related to a query execution as part of the payload and can be added as tags to the query span. A summary of the default properties is below:
Property | Description |
elapsedTime | The time taken between receiving the request and returning a response. The property is used for the server-duration property described above.
Expressed as milliseconds. |
executionTime | The time taken to execute the query, not taking into account preparing the response.
Expressed as milliseconds. |
resultCount | The number of result sets. |
resultSize | The number of rows. |
N1QL also provides an enhanced profiling mode where additional details are recorded and returned in the response JSON. Profiling is enabled by setting a request property and the extra information described in the enhanced profiling could be converted into Span Tags to provide in-depth analysis but is not required at this stage.
As an example, the available N1QL tags may look like this. Because the profile details are returned in a multi-level JSON structure, each level must be concatenated together using an underscore. Primitive types (integers, strings, boolean) map easily in a one to one with their tag name and arrays should be concatenated together with a double underscore '__'. Also note any non alphanumeric characters should be trimmed, eg "~version" becomes "version".
For example, given the below structure, it would create the following tags:
"profile": {
"phaseTimes": {
"authorize": "1.544104ms"
},
"phaseCounts": {
"fetch": 1,
"primaryScan": 1
},
"~version": [
"2.0.0-N1QL",
"5.1.0-1256-enterprise"
]
}
Name | Value |
profile_phaseTimes_authorise | "1.544104ms" |
profile_phaseCounts_fetch | 1 |
profile_phaseCounts_primaryScan | 1 |
profile_version | "2.0.0-N1QL__5.1.0-1256-enterprise" |
There is not currently a way to retrieve the server duration for a view query.
Currently, error handling may return messages that are generic in nature. Tracing information must be added to key messages such as Timeouts to help correlate data between when a timeout happens and when the response is received from the server.
Consider this line of Java code:
JsonDocument fetched = bucket.get("u:king_arthur", 1, TimeUnit.MICROSECONDS);
It will always throw a RuntimeException, since the timeout specified is too short for even a kernel context switch, let alone to get network IO done. When catching the exception and printing the message, one gets "java.util.concurrent.TimeoutException". This does not help the user understand possible causes.
Tracing information will be appended log messages, with the format:
{existing message} {compresed_json_object}
For example, the above exception log entry would look like:
java.lang.concurrent.TimeoutException: {"s":"kv:get","i":"0x7b1","i":"002c2b0d250e6fc5/002c2b0c723e11c5","b":"default","l":"_10.157.77.74:16584"_,"r":"my.server.io:11210"," t":2500000}
Where appropriate, Timeout exceptions should be extended to append the following information:
Property | Description |
Operation name | One of: kv, view, n1ql, fts, cbas |
Operation ID | The operation's unique identifier, as described in the trace information. |
Local endpoint | The local hostname / IP and port, formatted with a colon.
For example: localhost:12345 |
Time observed before timeout | The amount if time observed before the timeout occurred.
Expressed as microseconds. |
Server last dispatched to | The last server the operation was dispatched to. Note this may be empty as a server may not have been selected if a timeout is that low. |
Note that the Go SDK is an exception to this section, in that it does not log timeouts with additional data as the context is not able to be carried along with a Go error. Go will rely on the threshold loggers information to provide correlatable information for orphan responses and in a future major release will provide contextual errors.
For packets that come in whose request may have timed out, it is important to be able to know if the server duration was related directly or indirectly. However, we also do not want to overwhelm the log when there may be a big burst of operations that have timed out and the responses come in delayed putting the system into a pathological state.
For this, we have specific handling of orphaned responses.
NOTE: Orphan response is not related to Threshold logging and works independently. It is possible to replace the Tracer implementation and still have orphan response logging enabled.
Orphan response -> A response received on a connection whose original requester (or requesters in the case of request deduplication) is no longer in scope.
We only maintain the key information from each orphan, utilising the Operation Context (described below) structure.
Prerequisites: Generate a random uint64 client instance and connection IDs.
In implementation, an SDK will change any timeout return messages to include the operation type, instance ID and opaque along with the hostname and timeout duration specified by the app. Separately, as network requests complete that are not associated with any operations in scope, aggregate them for logging, limiting the number to the top 10 by reported server duration on a per ten second basis.
This results in timed out log messages from the application which can be further categorized into a few possible causes with further correlation:
Bucket 1: No orphan responses are received and the connection is eventually disrupted. This likely means network issues.
Bucket 2: Only some map up to a orphan response. All orphan responses are from one server. The topmost orphan response has a large server duration and the subsequent responses are fast (or become fast) and have increasing opaques for this individual client log. This means one slow operation caused congestion per MB-10291 (or other causes).
Bucket 3: Only some map up to a orphan response. orphan responses are from multiple servers. The orphan responses have small server durations relative to the timeout value. This means general environmental issues at the client, the network or at the servers not severe enough to cause connection disruption caused the issue.
Bucket 4: Only some map up to a orphan response. orphan responses are from one server. The orphan responses have small server durations relative to the timeout value. This means an environmental problem (e.g., THP being enabled) at the server in question is causing the issue.
Because there is no way to order the orphaned responses, the insertion order should be maintained up to the sample size.
The following table describes the properties that will form a JSON object that will be appended after the existing timeout message:
Property | Description |
s (string) | Service Type - one of:
kv, view, n1ql, search, analytics KV operations should be in the format: kv:{operation_type} eg kv:get, kv:upsert |
i (string) | Operation ID - the service specific identifier for a given operation
KV use opaque (with a 0x prefix and hex formatted) eg "0x7b1" N1QL & Analytics use the context ID which is a good. |
c (string) | Connection ID (optional)
eg KV would use the new connection ID described above |
b (string) | Bucket name (optional) |
l (string) | Local endpoint host & port (optional) |
r (string) | Remote endpoint host and port |
t (string) | Timeout value (optional)
Should be present for timeout contexts |
d (string) | Server Duration (optional)
Should be present for orphaned contexts |
An Java example would look like this:
java.lang.concurrent.TimeoutException {"s":"kv:get","i":"0x7b1",c":"002c2b0d250e6fc5/002c2b0c723e11c5","b":"default","r":"_10.157.77.74:16584"_,"r":"my.server.io:11210"," t":2500000}
Logged orphaned responses, here with a sample size of 2, every ten seconds while there are orphans to log:
2018-01-08 15:36:51,903 [16] WARN Couchbase.Core.KVNode - Orphaned responses observed: [{"service":"kv","count":2,"top":[{"s":"kv:get","i":"0x71b","c":"002c2b0d250e6fc5/002c2b0c723e11c5","l":"192.168.1.101:11210","r":"10.112.181.101:12110","d":123},{"s":kv:upsert","i":"0x71c","c":"121345/13321/7612","l":"192.168.1.101:11210","r":"10.112.181.101:12110","d":43}]}]
Example Orphaned response output:
https://gist.github.com/MikeGoldsmith/147b85e960378a47bcb0169581952af1
Given the two output examples above, the log entries can be used to correlate operations to identify if a timeout was caused by server duration or was due to something else. The operation_id 0x_0769020a and connection ID_ 002c2b0d250e6fc5/002c2b0c723e11c5 appears in both logs.
Orphaned operation summaries are to be logged at the WARN, or equivalent, level.
The CompositeTracer is a vasard over a collection of sub-Tracer implementations and as a span finishes it passes the span onto each inner_ Tracer_. This would be useful to combine the functionality of a number of other Tracer
implementations.
NOTE: This was moved to be a future enhancement because typically tracer implementations require custom Span implementations as they (likely) include additional information. eg the ThresholdLoggingTracer described above includes a summary object. This means the generic interface ISpan is not as useful.
ThresholdLogReporter reporter = ThresholdLogReporter.builder()
.logInterval(10, TimeUnit.SECONDS)
.sampleSize(10)
.kvThreshold(500000, TimeUnit.MICROSECONDS)
.viewThreshold(1000000, TimeUnit.MICROSECONDS)
.n1qlThreshold(1000000, TimeUnit.MICROSECONDS)
.ftsThreshold(1000000, TimeUnit.MICROSECONDS)
.analyticsThreshold(1000000, TimeUnit.MICROSECONDS)
.build();
CouchbaseEnvironment env = DefaultCouchbaseEnvironment.builder()
.tracingEnabled(true)
.tracer(ThresholdLogTracer.create(reporter))
.build();
CouchbaseCluster cluster = CouchbaseCluster.create(env, "10.112.181.101");
Exception in thread "main" java.lang.RuntimeException: java.util.concurrent.TimeoutException: {"b":"travel-sample","r":"127.0.0.1:11210","s":"kv","c":"0053FF30C2825267/FFFFFFFFD34A9178","t":10000,"i":"0x1","l":"127.0.0.1:64351"}
cluster = Cluster('couchbase://10.112.181.101?enable_tracing=true')
cluster.authenticate(PasswordAuthenticator('Administrator', 'password'))
bucket = cluster.open_bucket('default')
# pending changes regarding queue flush size / log interval
bucket.TRACING_ORPHANED_QUEUE_FLUSH_INTERVAL = 5
bucket.TRACING_ORPHANED_QUEUE_SIZE = 10
bucket.TRACING_THRESHOLD_QUEUE_FLUSH_INTERVAL = 5
bucket.TRACING_THRESHOLD_QUEUE_SIZE = 10
bucket.TRACING_THRESHOLD_KV = 0.00001
bucket.TRACING_THRESHOLD_N1QL= 0.00001
bucket.TRACING_THRESHOLD_VIEW = 0.00001
bucket.TRACING_THRESHOLD_FTS = 0.00001
bucket.TRACING_THRESHOLD_ANALYTICS = 0.00001
On a timeout, the Python SDK would return identification information to the application that may log it. It is returned in a dictionary listed on the "Tracing Output" entry. The keys of this dictionary represent the keys of timed-out operations, as most PYCBC operations have a multi-result option.
Jan 4 13:14:19 myhost02 ERROR __init__ P64321 T140101859555072 <Key=u'foobar', RC=0x17[Client-Side timeout exceeded for operation. Inspect network conditions or increase the timeout, Operational Error, Results=1, C Source=(src/multiresult.c,309), Tracing Output=
{"foobar": {"c": "000000003bdea42d/f5b76648029c27da", "b": "default", "i": 6846049684839775971, "l": "10.0.0.5:59955", "s": "kv:get", "r": "10.0.0.5:59940", "t": 1}, "foo": {"c": "000000003bdea42d/7913a3b7fcfc38ee", "b": "default", "i": 9314454879042790065, "l": "10.0.0.5:59957", "s": "kv:get", "r": "10.0.0.5:59942", "t": 1}, "bar": {"c": "000000003bdea42d/a29e6b704caea8f9", "b": "default", "i": 12610892345893574011, "l": "10.0.0.5:59958", "s": "kv:get", "r": "10.0.0.5:59943", "t": 1}}
< TO ADD >
//KV
{main}[cb,WARN] (server L:626 I:1735235899) Failing command with error LCB_ETIMEDOUT (0x17): {"b":"travel-sample","i":"00000000676d993b/222ce886773e17c8/1500","l":"127.0.0.1:63698","r":"127.0.0.1:11210","s":"kv:get","t":1}
< TO ADD >
See LCB implementation below. The error returned will be similar to previous versions, but log lines will be generated by LCB as described by this RFC.
var tracer = new ThresholdLoggingTracer()
{
Interval = 10000,
SampleSize = 10,
KvThreshold = 500000,
ViewThreshold = 1000000,
N1qlThreshold = 1000000,
SearchThreshold = 1000000,
AnalyticsThreshold = 1000000
};
var config = new ClientConfiguration();
config.Tracer = tracer;
// KV
2018-05-16 10:42:11,733 [Worker#STA_NP] INFO Couchbase.IO.ConnectionBase - The operation has timed out. {"s":"kv","i":"_0769020a","c":"002c2b0d250e6fc5/002c2b0c723e11c5_","b":"default","l":"192.168.1.104:16874","r":"10.112.181.101:11210","t":250000}
// N1QL
2018-05-16 10:42:11,733 [Worker#STA_NP] INFO Couchbase.Query.QueryClient - The operation has timed out. {"s":"n1ql","i":"B639B0AC-1DA1-4013-BA42-127F066C165A","r":"10.112.181.101:8093"}
// Views
2018-05-16 10:42:11,733 [Worker#STA_NP] INFO Couchbase.Views.ViewClient - The operation has timed out. {"s":"view","b":"default","r":"10.112.181.101:8092}
Note: All services other than KV do not have access to all the contextual information. This either because the we use the .NET HttpClient which doesn't expose local endpoint, or the contextual information is not available at the same level where timeouts occur. eg timeout is used to create a CancellationToken, but the raw value is not passed to the Client.
I've raised the following tickets to try and improve on these:
- Views - https://issues.couchbase.com/browse/NCBC-1688
- Query - https://issues.couchbase.com/browse/NCBC-1687
- Search - https://issues.couchbase.com/browse/NCBC-1689
- Analytics - https://issues.couchbase.com/browse/NCBC-1690
tracer := gocb.ThresholdLoggingTracer{}
tracer.TickInterval = 30 * time.Second
tracer.KvFloor = 750 * time.Millisecond
tracer.QueryFloor = 2 * time.Second
tracer.ViewsFloor = 2 * time.Second
tracer.AnalyticsFloor = 2 * time.Second
tracer.SearchFloor = 2 * time.Second
tracer.SampleSize = 15
"operation has timed out"
As per note earlier in document: the Go SDK is an exception to this section, in that it does not log timeouts with additional data as the context is not able to be carried along with a Go error.
< TO ADD >
To ensure each SDK is implemented in the same way with the output being consistent, the following scenarios will be used to verify API, behaviour and output.
API Default Threshold Logging Tracer settings Default Orphaned Response Logger settings Configure Threshold Logging Tracer Configure Orphaned Response Logger
Output Threshold Logging Tracer Orphaned Response Logger Timeout message
To help identify when an operation is unusually slow, each SDK could implement a SlowOperationTracer that will keep a heuristic average duration of the top-level Spans. As a top-level span completes, its duration is then compared to the average plus a deviation percentage and if it is greater, it is logged.
For example; given a deviation percentage of 75%, the pseudo code for a top-level span being finished and reported to the SlowOperationTracer would look like the following.
completedSpans = { 1234, 3321, 12234, 44322, 54533 } // last five operation durations
deviationPercentage = 0.75 // 75%
If span.duration >= completedSpans.average + completedSpans.average * deviationPercentage
begin
logger.log(span)
end
Because the average request durations is likely to be very different between service types (KV, N1QL, etc), there should either be a SlowOperationTracer per service or a single instance by it internally manages a per-service duration buffer. Each instance should be configured with a number of operation durations to remember to calculate the average with and the deviation percentage.
Spans are not logged as they finish so the internal _Span _implementation should be extended to maintain a collection of sub-spans and when building a new Span using SpanBuilder.ChildOf, the new span is added to the parent's span collection. This ensures that when the top-level span is logged, it had a full reference to all sub-spans so a complete structure can be created. An example output JSON structure can be seen here.
If a top-level span has been identified as slow, it should be logged to the configured SDK logging infrastructure.
To prevent flooding logs, each SDK could maintain a heuristic average operation duration and two configuration values slow operation percentile and slow operation floor.
Q: Is there a common way to modify the KV node logging of 'slow'? If so, what is that API and behavior?
A: The KV slow operation threshold is configured on the cluster and can be updated using the cluster management REST interface. A way to programmatically configure the server's KV slow operation threshold by an SDK in the future.
Q: Is there a way to change the logging behavior of other services based on application definition of what is 'slow'?
A: There is both a client level calculated slow operation threshold using a percentile and a slow operation floor that are both flexible will only log trace information for what is unusual.
Q: How is trace logging of timed out operations handled? In particular, is there special handling for reading network responses that are for timed out operations?
A: Timed out operations are returned to the application when the client timeout has been reached, this means the response packet may be received afterwards. To prevent having to maintain state for in memory for operations that have timed out, the SDK will log the trace information at the point the operation timed out, and then the response is received, another log will be written with the same correlation ID and server duration.
Q: Since the logging is JSON and logging with multiple lines makes for hard to consume logs, should it be guaranteed one line or have formatting?
A: Yes, the JSON logged should be in a compressed format where line breaks and whitespace is removed to prevent consuming many lines in a log file and ensure it's easily readable by another process analysing the log files.
Q: Shоuld it be enabled by default? Recording all these timestamps will require a lot of requests to system clock. And storing extra data per operation will increase memory consumption.
A: The PRD requirement was to be on by default, to comply with that we will have it enabled as standard with the option to turn it off if wanted.
Q: Should we report directly to the tracer if no parent span is provided for an operation, or should that count as something we not reporting.
A: Any span that does not have a parent should be treated as a "top-level" span, which should be sent to the tracer. If the span is not meant to be, it should be regarded as a bug.
Date | Description |
2017-12-18 | Add additional details for KV server duration encoding and response packet structure |
2017-12-15 | Re-write and tidy up enhanced log messages section with regards to format and example |
2018-1-2 | Clarify some properties descriptions
Marked request encoding / response decoding as optional Clarified this feature should be "on by default" and that no tracing timings should be recorded if EnableOperationTracing is false Added Performance Concerns sections |
2018-01-04 | Accepted many suggestions and clarifications.
Added two new optional trace points (queue_time and resolution_time). Added two new future tracing points. Updated KV duration encoding algorithm. |
2018-01-22 | Overhauled RFC with focus on adopting OpenTracing API and implementing spans
Added Zombie response handling |
2018-01-29 | Add Client / connection IDs for KV and HTTP services
Updated list of expected tags Updated names of trace points to be more explicit |
2018-01-30 | Added total_time_us to SpanSimmary structure used for logging slow operations. |
2018-02-01 | Updated new Client / Connection ID format
Added example unit test to verify KV server duration decoding |
2018-02-15 | Rename Slow Operation Tracer to Threshold Logging Tracer
Reword ThresholdLoggingTracer description to be less prescriptive of implementation |
2018-02-19 | Updated table of contents to reflect heading and section changes |
2018-02-20 | Updated peer.latency tag description to indicate time precision is required.
Updated configuration names for threshold logging floors |
2018-02-21 | Use defined peer.service tag instead of custom couchbase.service span tag
Update timeout message format Add description paragraph to orphaned response section to describe how to correlate operations |
2018-04-10 | Added last_operation_id and last_local_id span summary properties
Updated Timeout log message format |
2018-05-09 | Extended optional N1QL tags to include standard metrics returned in payload.
Updated ThresholdLoggingTracer example JSON to reflect updated property names. |
2018-05-14 | Updated timeout messages to provide additional data in a JSON object and add table for properties |
2018-05-17 | Update two references to threshold logging tracer interval from 10 seconds to 1 minute |
2018-05-21 | Added verification scenarios / expected output |
2018-05-22 | Revert Threshold Logging Tracer and Orphaned Response Logger default interval to 10 seconds. |
2018-05-23 | Added explicit note for expected logging level for Orphaned / zombie responses. |
2018-05-24 | Rename ThresholdLoggingTracerQueryThreshold to ThresholdLoggingTracerN1qlThreshold. |
2018-05-28 | Add separate config options for Orphaned Response Logging |
2018-06-06 | Update operation context JSON properties separate operation ID and connection ID. |
2018-06-21 | Removed all references to zombies and replaced with orphans. |
Language | Representative | Date |
C | Sergey Avseyev | 2018-06-25 |
Go | Brett Lawson | 2018-06-25 |
Java | Michael Nitschinger | 2018-06-21 |
.NET | Mike Goldsmith | 2018-06-21 |
Node.js | Brett Lawson | 2018-06-25 |
PHP | Sergey Avseyev | 2018-06-25 |
Python | Ellis Breen | 2018-06-25 |
- 13th September 2018:
- Added document key span tag for KV operations.