Meta

RFC Name: Response Time Observability
RFC ID: 0035
Start Date: 2017-10-23
Owner: Mike Goldsmith
Current Status: ACCEPTED

Summary

From a diagnostics point of view it is valuable to collect and analyse each operation an SDK performs to help identify problem areas. Each operation goes through a number of phases and tracking each of them both individually and as a whole allows in-depth interrogation.

In addition, it would also be useful to identify if and when an operation takes an unusually long time to complete. This analysis will help aid investigation by providing contextual information that is only available during the operation processing.

Description

Tracing

Tracing is recording details about discrete steps or phases of a request lifecycle, eg request encoding / decoding or dispatching to server. These phases are timed independently and contain additional contextual information. A single request is typically made up of multiple tracing points.

OpenTracing is a standardised API to structure tracing information in a consistent and predictable manner. This document will not cover all the topics and history of tracing or OpenTracing (please refer to above links for more complete documentation), and instead will only cover key topics and concepts.

Briefly; OpenTracing tracing data is a nested structure of Spans with durations, contextual information and relationships to other Spans. A structural example is below:

A good way to visualise the span structure is with an x-axis time graph like the following:

The key concepts OpenTracing are described below:

Type	Description
Tracer	Top level structure that can be used to create a SpanBuilder.
SpanBuilder	Creates and starts Spans with contextual information.
SpanContext	Contextual information for a given Span. This information can be transported between process / services.
Span	An in-process tracing structure; includes start timestamp, end timestamps and additional contextual information (referred to as tags and baggage).

This RFC proposes that the Couchbase SDKs adopt the OpenTracing API to provide per request tracing. However, because the OpenTracing API is just way to construct and manage related, discrete timings, the Couchbase SDKs are required to provide their own implementations to achieve the desired behaviour.

NOTE: The OpenTracing API includes a mechanism to import and export a SpanContext across process boundaries. The initial implementation focuses on intra-process tracing only and does not require a Tracer implement the Extract/Inject functionality. This may be added in the future to allow application developers to start a trace in their own application and the Couchbase SDK could Import the SpanContext so subsequent tracing information is collated together.

Spans

Each SDK is to create spans using the OpenTracing API and are created via the Tracer / SpanBuilder objects. Each SDK will have differing implementations and therefore idiomatic differences so this RFC will not be prescriptive on how these are implemented. The OpenTracing project provide artifacts for many languages that provide the required API.

Each operation or query needs to be uniquely identified both on the client and on the server, where appropriate. This will be used to tie logs from the client and server together and each service has different requirements to uniquely identify an operation. Each span should have an couchbase.operation_id tag with the unique identifier, as described below.

Service	Description
KV	Operation Opaque NOTE: Opaque is not unique by itself as some SDK implementations are internally incrementing numbers. As part of this RFC, a new Connection ID feature will be added to uniquely identify a client connection with a Couchbase Server and combined with the client's opaque, we have a unique KV operation.
N1QL	Queries provide a client_context_id field. When a client_context_id field is empty, the SDK will auto-generate a UUID value. (Not sure if it's used in server logs?)
FTS	Not currently available - created MB-27696 to track.
Analytics	Queries provide a client_context_id field. When a client_context_id field is empty, the SDK will auto-generate a UUID value. (Not sure if it's used in server logs?)
View	Not currently available

NOTE: Both N1QL and Analytics queries allow the application to override client_context_id. Also some SDKs implement a sequence generator that will clash between instances and/or it uses a HTTP library so doesn't know the local host:port.

NOTE: In the case of sub-doc operations, the top-level operation should be represented within the span hierarchy with remapped subdoc commands as children. eg MapRemove > Subdoc_Mutate.

Operation Name

Each span's operation name should represent the activity that is being traced. For example, a KV operation would have a top-level span with operation_name of the operation type, eg GET and a couchbase.operation_id tag with the opaque.

Trace Points

The following is a list of trace points that each SDK should implement for each service (KV, View, N1QL, Search and Analytics) and each span's operation name should match the trace point name.

Name	Description
request_queue (optional)	Request queue time represents the time a requests spends waiting to be sent to the server.
request_encoding	Request transcoding is when a request content provides content to the server (eg UPSERT) and converts the value from a native object into a common structure (eg JSON). NOTE: This can be omitted if the SDK does not perform request encoding or the request does not require have a body.
dispatch_to_server	The time spent sending the request to the server and waiting for a response. This phase encompasses a number of layers, many of which are outside of the SDKs control, for example; task scheduling and writing/reading to the network.
response_decoding	Response transcoding is when a request retrieves content (eg GET) from the server and reconstructs the response body into a native data type. NOTE: This can be omitted if the SDK does not perform response decoding or the response does not require have a body.
response_queue (optional)	The resolution duration represents the amount of time spent waiting for the application to be ready to handle the operation response.

Client / Connection IDs

Previously, a KV operation was identified using a combination of the local FQDN or IP, port and opaque, however this relies on both the client and server agreeing on network names and is complicated when some components may run under a local NAT, for instance, kubernetes pods.

To address this; the 'hello' request will be extended to be a JSON object with the following properties defined. The server will associate the given uuid to the socket and use it when logging slow operations and tracing data.

Name	Description
a (string)	The client agent string - this can not be longer than 200 characters and must be trimmed if exceeds 200. NOTE: This is the same value that was previously sent and includes details such as client name, version, etc.
i (string)	The connection ID is be made up of two components separated by a forward slash. The first value will be consistent for all connections in a given client instance, and the second value is per connection. Each part is a randomly generated uint64 number that is formatted as a hex string and padded to 16 characters.

An example key is below

{
  "a":"couchbase-net-sdk/2.4.5.0 (clr/4.0.30319.42000) (os/Microsoft Windows NT 10.0.16299.0)",
  "i":"66388CF5BFCF7522/18CC8791579B567C"
}

NOTE: The indentation is for visibility where the JSON that is sent to the server should be compressed to remove whitespace.

NOTE: The HTTP services should also be extended to take the client / connection ID with the key cb-client-id. The support and adoption of this new header and updated logged in the server is not expected to happen in the short-term but should be implemented.

Threshold Logging Tracer

To help identify when an operation is exceeds a reasonable threshold, each SDK will implement a ThresholdLoggingTracer that will track operations that exceed the threshold and log them periodically.

The threshold tracer receives completed spans and verifies if an operation has exceeded the given threshold for the operation type. Operations that exceed the threshold are periodically logged with a total count and a sample of the slowest ones.

The following are SDK configuration properties.

Property Name	Description
OperationTracingEnabled	Boolean used to determine tracing is enabled. Defaults to using the ThesholdLoggingTracer if enabled. When false a Noop or similar tracing implementation should be used instead. Default value: true
OperationTracingServerDurationEnabled	Boolean used to instruct the SDK to try and retrieve duration metrics from the server for KV operations. KV - Controls if the Framing Extras hello flag is issued during socket setup.
ThresholdLoggingTracerInterval	The interval between executions that processes the collected operation spans. Default value: 10000 (10 seconds) Expressed as milliseconds.
ThresholdLoggingTracerSampleSize	The maximum number of items to log per service. Default value: 10
ThresholdLoggingTracerKVThreshold	The KV operation operation threshold. Default Value: 500000 (500 milliseconds) Expressed as microseconds.
ThresholdLoggingTracerViewsThreshold	The View query operation threshold. Default Value: 1000000 (1 second) Expressed as microseconds.
ThresholdLoggingTracerN1qlThreshold	The N1QL query operation threshold. Default Value: 1000000 (1 second) Expressed as microseconds.
ThresholdLoggingTracerSearchThreshold	The FTS query operation threshold. Default Value: 1000000 (1 second) Expressed as microseconds.
ThresholdLoggingTracerAnalyticsThreshold	The Analytics query operation threshold. Default Value: 1000000 (1 second) Expressed as microseconds.
OrphanedResponseLoggingEnabled	Boolean used to determine if orphaned response logging is enabled. Default value: true
OrphanedResponseLoggingInterval	The interval used to flush orphaned response information to the log. Default value: 10000 (10 seconds) Expressed as milliseconds
OrphanedResponseLoggingSampleSize	The number of sample orphaned responses whose to log additional information for per execution. Default value: 10

NOTE: The per-service operation and query floors are a guide for default values and how they are expressed. How the SDK receives these values can be represented by whatever is idiomatic to the SDK.

Slow operation summaries are to be logged at the INFO, or equivalent, level.

This RFC does not go into how logging should be configured as each SDK and environment is different and the standard SDK logging infrastructure / abstraction should be used. Also, some logging infrastructure implementations provide ways to send log entries to an alternative source, eg LogStash, so this RFC will not go into alternatives storing locations at this time.

See each SDKs documentation on how to configure logging:

https://developer.couchbase.com/documentation/server/current/sdk/dotnet/collecting-information-and-logging.html

SpanSummary

When ThresholdLoggingTracer executes to process any collected spans, each of the spans that is to be output should be transformed into this structure.

Property	Description
operation_name	The operation name is operation type that the application used with the eg get, get_and_touch, upsert, etc
last_operation_id	The last operation ID. eg KV opaque, n1ql context ID. Note: for KV operations that use the opaque for the the operation ID should use the 0x prefix to indicate it's a hex value.
last_local_address	The local socket hostname / IP and port, separated by a colon. For example: 255.123.11.134:54321
last_remote_address	The server hostname / IP and port separated by a colon. For example: 10.112.170.101:11210
last_local_id	The last connection ID used to send a packet to the server. For example: 66388CF5BFCF7522/18CC8791579B567C
total_us	The total time taken for the operation. Expressed as microseconds.
encode_us	The calculated sum of all encode sub-spans. Expressed as microseconds.
dispatch_us	The calculated sum of all dispatch sub-spans. Expressed as microseconds.
server_us	The calculated sum of all server duration sub-spans. Only present if server durations are enabled. Expressed as microseconds.
decode_us	The calculated sum of all decode sub-spans. Expressed as microseconds.
last_dispatch_us	The time taken for the last dispatch to server. Expressed as microseconds.

NOTE: Any of the above properties can be omitted if there is no appropriate value to be used.

An example output from the threshold logging tracer:

[
   {
     "service":"kv",
     "count":75,
     "top":[
       {
         "operation_name":"get",
         "last_operation_id": "0x21",
         "last_local_address":"10.211.55.3:52450",
         "last_remote_address":"10.112.180.101:11210",
         "last_local_id": "66388CF5BFCF7522/18CC8791579B567C",
         "total_duration_us":18908,
         "encode_us":256,
         "dispatch_us":825,
         "decode_us":16,
         "server_duration_us":14
       },
       {
         "operaion_name":"set",
         "last_operation_id": "0x22",
         "last_local_address":"10.211.55.3:52450",
         "last_remote_address":"10.112.180.101:11210",
         "last_local_id": "66388CF5BFCF7522/18CC8791579B567C",
         "total_duration_us":11468,
         "encode_us":3832,
         "dispatch_us":565,
         "decode_us":15
       },
       {
         "operaion_name":"get",
         "last_operation_id": "0x23",
         "last_local_address":"10.211.55.3:52450",
         "last_remote_address":"10.112.180.101:11210",
         "last_local_id": "66388CF5BFCF7522/18CC8791579B567C",
         "total_duration_us":2996,
         "encode_us":4,
         "dispatch_us":2829,
         "decode_us":8
       },
       {
         "operaion_name":"set",
         "last_operation_id": "0x24",
         "last_local_address":"10.211.55.3:52450",
         "last_remote_address":"10.112.180.101:11210",
         "last_local_id": "66388CF5BFCF7522/18CC8791579B567C",
         "total_duration_us":2777,
         "encode_us":15,
         "dispatch_us":2627,
         "decode_us":11
       },
       {
         "operaion_name":"set",
         "last_operation_id": "0x25",
         "last_local_address":"10.211.55.3:52450",
         "last_remote_address":"10.112.180.101:11210",
         "last_local_id": "66388CF5BFCF7522/18CC8791579B567C",
         "total_duration_us":1331,
         "encode_us":16,
         "dispatch_us":1206,
         "decode_us":5
       }
     ]
   }
 ]

Log Tracer (Optional)

The LogTracer simply logs each span as it finished using the configured logging infrastructure. This could be useful if a customer has configured a file monitoring process that can process and analyse tracing information.

Noop Tracer

Each SDK must either use or implement a default Noop style tracer for when tracing is disabled. Depending on the OpenTracing API library, this may already be provided.

This implementation will use minimal resources and effectively ignores tracing without altering code control paths. The NoopTracer should be used when the configuration disables tracing.

Server Durations

As part of the spans that dispatch a request to a server, it is desirable to track the server-side duration to help identify where a requests duration is spent (SDK, Server or in-between) . Most services provide a mechanism to retrieve the duration, as described below.

KV - Flexible Framing Extras

KV operation server duration timings are encoded into the response using "flexible framing extras" (https://github.com/couchbase/kv_engine/blob/master/docs/BinaryProtocol.md#response-header-with-flexible-framing-extras).

Server duration is enabled via Hello negotiation with the feature code (0x10) and when enabled the response packet has the following changes:

Magic byte is 0x18 (instead of 0x81)
Header byte index 2 indicates the total extras length
Framing extras is encoded as a series of variable length FrameInfo structures and is is returned directly after the header and before the regular extras byte

FrameInfo

Each FrameInfo has a descriptor byte followed by a variable number of bytes depending on the type of information.

Bit	Description
0-3	Frame type descriptor
4-7	Length

The initial design of framing extras defines one structure, the server duration. This is a two byte floating number described in microseconds with a variable precision and is encoded/decoded as follows:

encoded = (micros * 2) ^ (1.0 / 1.74)
decoded = (encoded ^ 1.74) / 2

Decoding examples:

C

std::chrono::microseconds Tracer::decodeMicros(uint16_t encoded) const {
  auto usecs = uint64_t(std::pow(encoded, 1.74) / 2);
  return std::chrono::microseconds(usecs);
}

.NET (C#)

public static double Decode(ushort encoded)
{
  return System.Math.Pow(encoded, 1.74) / 2;
}

An example unit test to prove the encoded values are being decoded as expected:

[TestCase((ushort) 0, 0.0)]
[TestCase((ushort) 1234, 119635.03533802561)]
[TestCase((ushort) 65535, 120125042.10125735)]
public void Can_Decode_Server_Duration(ushort encoded, double expected)
{
  var decoded = Math.Pow(encoded, 1.74) / 2;
  Assert.AreEqual(expected, decoded);
}

N1QL, Analytics Query Server Duration

N1QL, FTS and Analytics queries are all dispatched over HTTP with both the request and response encoded as JSON. The query duration time is returned in the response body JSON in the metrics.elapsedTime property for N1QL and Analytics and the root property took for FTS. This value is expressed as microseconds, so is the same precision as other metrics.

Because the request durations form part of the response body it cannot be read independently of the response body. This is troublesome for streamed responses as the server duration timings may not be readable until after all results have been read, which could be a long time depending on how the application processes them. To help distinguish the different phases, a streamed request should be made up of the following spans:

Span Name	Description
request_encoding	The amount of time taken to prepare the JSON request body ready for dispatching to the server.
dispatch_to_server	The amount of time taken between sending the request to the server and when the response HTTP header has been read. The response body has not been processed yet.
stream_results	The total amount of time taken to stream all of the result rows.

N1QL Profiling Baggage (optional)

N1QL returns additional metrics related to a query execution as part of the payload and can be added as tags to the query span. A summary of the default properties is below:

Property	Description
elapsedTime	The time taken between receiving the request and returning a response. The property is used for the server-duration property described above. Expressed as milliseconds.
executionTime	The time taken to execute the query, not taking into account preparing the response. Expressed as milliseconds.
resultCount	The number of result sets.
resultSize	The number of rows.

N1QL also provides an enhanced profiling mode where additional details are recorded and returned in the response JSON. Profiling is enabled by setting a request property and the extra information described in the enhanced profiling could be converted into Span Tags to provide in-depth analysis but is not required at this stage.

As an example, the available N1QL tags may look like this. Because the profile details are returned in a multi-level JSON structure, each level must be concatenated together using an underscore. Primitive types (integers, strings, boolean) map easily in a one to one with their tag name and arrays should be concatenated together with a double underscore '__'. Also note any non alphanumeric characters should be trimmed, eg "~version" becomes "version".

For example, given the below structure, it would create the following tags:

"profile": {
  "phaseTimes": {
    "authorize": "1.544104ms"
  },
  "phaseCounts": {
    "fetch": 1,
    "primaryScan": 1
  },
  "~version": [
    "2.0.0-N1QL",
    "5.1.0-1256-enterprise"
  ]
}

Name	Value
profile_phaseTimes_authorise	"1.544104ms"
profile_phaseCounts_fetch	1
profile_phaseCounts_primaryScan	1
profile_version	"2.0.0-N1QL__5.1.0-1256-enterprise"

Views

There is not currently a way to retrieve the server duration for a view query.

Enhancements to Timeout Exceptions

Currently, error handling may return messages that are generic in nature. Tracing information must be added to key messages such as Timeouts to help correlate data between when a timeout happens and when the response is received from the server.

Consider this line of Java code:

JsonDocument fetched = bucket.get("u:king_arthur", 1, TimeUnit.MICROSECONDS);

It will always throw a RuntimeException, since the timeout specified is too short for even a kernel context switch, let alone to get network IO done. When catching the exception and printing the message, one gets "java.util.concurrent.TimeoutException". This does not help the user understand possible causes.

Tracing information will be appended log messages, with the format:

{existing message} {compresed_json_object}

For example, the above exception log entry would look like:

java.lang.concurrent.TimeoutException: {"s":"kv:get","i":"0x7b1","i":"002c2b0d250e6fc5/002c2b0c723e11c5","b":"default","l":"_10.157.77.74:16584"_,"r":"my.server.io:11210"," t":2500000}

Where appropriate, Timeout exceptions should be extended to append the following information:

Property	Description
Operation name	One of: kv, view, n1ql, fts, cbas
Operation ID	The operation's unique identifier, as described in the trace information.
Local endpoint	The local hostname / IP and port, formatted with a colon. For example: localhost:12345
Time observed before timeout	The amount if time observed before the timeout occurred. Expressed as microseconds.
Server last dispatched to	The last server the operation was dispatched to. Note this may be empty as a server may not have been selected if a timeout is that low.

Note that the Go SDK is an exception to this section, in that it does not log timeouts with additional data as the context is not able to be carried along with a Go error. Go will rely on the threshold loggers information to provide correlatable information for orphan responses and in a future major release will provide contextual errors.

Responses for Orphaned (timed-out) Requests

For packets that come in whose request may have timed out, it is important to be able to know if the server duration was related directly or indirectly. However, we also do not want to overwhelm the log when there may be a big burst of operations that have timed out and the responses come in delayed putting the system into a pathological state.

For this, we have specific handling of orphaned responses.

NOTE: Orphan response is not related to Threshold logging and works independently. It is possible to replace the Tracer implementation and still have orphan response logging enabled.

Terminology

Orphan response -> A response received on a connection whose original requester (or requesters in the case of request deduplication) is no longer in scope.

We only maintain the key information from each orphan, utilising the Operation Context (described below) structure.

Implementation Notes

Prerequisites: Generate a random uint64 client instance and connection IDs.

In implementation, an SDK will change any timeout return messages to include the operation type, instance ID and opaque along with the hostname and timeout duration specified by the app. Separately, as network requests complete that are not associated with any operations in scope, aggregate them for logging, limiting the number to the top 10 by reported server duration on a per ten second basis.

This results in timed out log messages from the application which can be further categorized into a few possible causes with further correlation:

Bucket 1: No orphan responses are received and the connection is eventually disrupted. This likely means network issues.

Bucket 2: Only some map up to a orphan response. All orphan responses are from one server. The topmost orphan response has a large server duration and the subsequent responses are fast (or become fast) and have increasing opaques for this individual client log. This means one slow operation caused congestion per MB-10291 (or other causes).

Bucket 3: Only some map up to a orphan response. orphan responses are from multiple servers. The orphan responses have small server durations relative to the timeout value. This means general environmental issues at the client, the network or at the servers not severe enough to cause connection disruption caused the issue.

Bucket 4: Only some map up to a orphan response. orphan responses are from one server. The orphan responses have small server durations relative to the timeout value. This means an environmental problem (e.g., THP being enabled) at the server in question is causing the issue.

Because there is no way to order the orphaned responses, the insertion order should be maintained up to the sample size.

Example Telemetry

The following table describes the properties that will form a JSON object that will be appended after the existing timeout message:

Operation Context Properties

Property	Description
s (string)	Service Type - one of: kv, view, n1ql, search, analytics KV operations should be in the format: kv:{operation_type} eg kv:get, kv:upsert
i (string)	Operation ID - the service specific identifier for a given operation KV use opaque (with a 0x prefix and hex formatted) eg "0x7b1" N1QL & Analytics use the context ID which is a good.
c (string)	Connection ID (optional) eg KV would use the new connection ID described above
b (string)	Bucket name (optional)
l (string)	Local endpoint host & port (optional)
r (string)	Remote endpoint host and port
t (string)	Timeout value (optional) Should be present for timeout contexts
d (string)	Server Duration (optional) Should be present for orphaned contexts

An Java example would look like this:

java.lang.concurrent.TimeoutException {"s":"kv:get","i":"0x7b1",c":"002c2b0d250e6fc5/002c2b0c723e11c5","b":"default","r":"_10.157.77.74:16584"_,"r":"my.server.io:11210"," t":2500000}

Logged orphaned responses, here with a sample size of 2, every ten seconds while there are orphans to log:

2018-01-08 15:36:51,903 [16] WARN Couchbase.Core.KVNode - Orphaned responses observed: [{"service":"kv","count":2,"top":[{"s":"kv:get","i":"0x71b","c":"002c2b0d250e6fc5/002c2b0c723e11c5","l":"192.168.1.101:11210","r":"10.112.181.101:12110","d":123},{"s":kv:upsert","i":"0x71c","c":"121345/13321/7612","l":"192.168.1.101:11210","r":"10.112.181.101:12110","d":43}]}]

Example Orphaned response output:

https://gist.github.com/MikeGoldsmith/147b85e960378a47bcb0169581952af1

Given the two output examples above, the log entries can be used to correlate operations to identify if a timeout was caused by server duration or was due to something else. The operation_id 0x_0769020a and connection ID_ 002c2b0d250e6fc5/002c2b0c723e11c5 appears in both logs.

Orphaned operation summaries are to be logged at the WARN, or equivalent, level.

Future Enhancements

Composite Tracer

The CompositeTracer is a vasard over a collection of sub-Tracer implementations and as a span finishes it passes the span onto each inner_ Tracer_. This would be useful to combine the functionality of a number of other Tracer implementations.

NOTE: This was moved to be a future enhancement because typically tracer implementations require custom Span implementations as they (likely) include additional information. eg the ThresholdLoggingTracer described above includes a summary object. This means the generic interface ISpan is not as useful.

Language Details

Example Java Configuration

ThresholdLogReporter reporter = ThresholdLogReporter.builder()
    .logInterval(10, TimeUnit.SECONDS)
    .sampleSize(10)
    .kvThreshold(500000, TimeUnit.MICROSECONDS)
    .viewThreshold(1000000, TimeUnit.MICROSECONDS)
    .n1qlThreshold(1000000, TimeUnit.MICROSECONDS)
    .ftsThreshold(1000000, TimeUnit.MICROSECONDS)
    .analyticsThreshold(1000000, TimeUnit.MICROSECONDS)
   .build();

CouchbaseEnvironment env = DefaultCouchbaseEnvironment.builder()
   .tracingEnabled(true)
   .tracer(ThresholdLogTracer.create(reporter))
   .build();

CouchbaseCluster cluster = CouchbaseCluster.create(env, "10.112.181.101");

Example Java Timeout Returns

Exception in thread "main" java.lang.RuntimeException: java.util.concurrent.TimeoutException: {"b":"travel-sample","r":"127.0.0.1:11210","s":"kv","c":"0053FF30C2825267/FFFFFFFFD34A9178","t":10000,"i":"0x1","l":"127.0.0.1:64351"}

Example Python Configuration

cluster = Cluster('couchbase://10.112.181.101?enable_tracing=true')
cluster.authenticate(PasswordAuthenticator('Administrator', 'password'))
bucket = cluster.open_bucket('default')

# pending changes regarding queue flush size / log interval
bucket.TRACING_ORPHANED_QUEUE_FLUSH_INTERVAL = 5
bucket.TRACING_ORPHANED_QUEUE_SIZE = 10
bucket.TRACING_THRESHOLD_QUEUE_FLUSH_INTERVAL = 5
bucket.TRACING_THRESHOLD_QUEUE_SIZE = 10
bucket.TRACING_THRESHOLD_KV = 0.00001
bucket.TRACING_THRESHOLD_N1QL= 0.00001
bucket.TRACING_THRESHOLD_VIEW = 0.00001
bucket.TRACING_THRESHOLD_FTS = 0.00001
bucket.TRACING_THRESHOLD_ANALYTICS = 0.00001

Example Python Timeout Returns

On a timeout, the Python SDK would return identification information to the application that may log it. It is returned in a dictionary listed on the "Tracing Output" entry. The keys of this dictionary represent the keys of timed-out operations, as most PYCBC operations have a multi-result option.

Jan 4 13:14:19 myhost02 ERROR __init__ P64321 T140101859555072 <Key=u'foobar', RC=0x17[Client-Side timeout exceeded for operation. Inspect network conditions or increase the timeout,  Operational Error, Results=1, C Source=(src/multiresult.c,309), Tracing Output=

 {"foobar": {"c": "000000003bdea42d/f5b76648029c27da", "b": "default", "i": 6846049684839775971, "l": "10.0.0.5:59955", "s": "kv:get", "r": "10.0.0.5:59940", "t": 1}, "foo": {"c": "000000003bdea42d/7913a3b7fcfc38ee", "b": "default", "i": 9314454879042790065, "l": "10.0.0.5:59957", "s": "kv:get", "r": "10.0.0.5:59942", "t": 1}, "bar": {"c": "000000003bdea42d/a29e6b704caea8f9", "b": "default", "i": 12610892345893574011, "l": "10.0.0.5:59958", "s": "kv:get", "r": "10.0.0.5:59943", "t": 1}}

Example PHP Configuration

< TO ADD >

Example PHP Timeout Returns

//KV

{main}[cb,WARN] (server L:626 I:1735235899) Failing command with error LCB_ETIMEDOUT (0x17): {"b":"travel-sample","i":"00000000676d993b/222ce886773e17c8/1500","l":"127.0.0.1:63698","r":"127.0.0.1:11210","s":"kv:get","t":1}

Example Node.js Configuration

< TO ADD >

Example Node.js Timeout Returns

See LCB implementation below. The error returned will be similar to previous versions, but log lines will be generated by LCB as described by this RFC.

Example .NET Configuration

var tracer = new ThresholdLoggingTracer()
{
    Interval = 10000,
    SampleSize = 10,
    KvThreshold = 500000,
    ViewThreshold = 1000000,
    N1qlThreshold = 1000000,
    SearchThreshold = 1000000,
    AnalyticsThreshold = 1000000
};

var config = new ClientConfiguration();
config.Tracer = tracer;

Example .NET Timeout Returns

// KV

2018-05-16 10:42:11,733 [Worker#STA_NP] INFO Couchbase.IO.ConnectionBase - The operation has timed out. {"s":"kv","i":"_0769020a","c":"002c2b0d250e6fc5/002c2b0c723e11c5_","b":"default","l":"192.168.1.104:16874","r":"10.112.181.101:11210","t":250000}

// N1QL

2018-05-16 10:42:11,733 [Worker#STA_NP] INFO Couchbase.Query.QueryClient - The operation has timed out. {"s":"n1ql","i":"B639B0AC-1DA1-4013-BA42-127F066C165A","r":"10.112.181.101:8093"}

// Views

2018-05-16 10:42:11,733 [Worker#STA_NP] INFO Couchbase.Views.ViewClient - The operation has timed out. {"s":"view","b":"default","r":"10.112.181.101:8092}

Note: All services other than KV do not have access to all the contextual information. This either because the we use the .NET HttpClient which doesn't expose local endpoint, or the contextual information is not available at the same level where timeouts occur. eg timeout is used to create a CancellationToken, but the raw value is not passed to the Client.

I've raised the following tickets to try and improve on these:

Views - https://issues.couchbase.com/browse/NCBC-1688
Query - https://issues.couchbase.com/browse/NCBC-1687
Search - https://issues.couchbase.com/browse/NCBC-1689
Analytics - https://issues.couchbase.com/browse/NCBC-1690

Example Go Configuration

tracer := gocb.ThresholdLoggingTracer{}
tracer.TickInterval = 30 * time.Second
tracer.KvFloor = 750 * time.Millisecond
tracer.QueryFloor = 2 * time.Second
tracer.ViewsFloor = 2 * time.Second
tracer.AnalyticsFloor = 2 * time.Second
tracer.SearchFloor = 2 * time.Second
tracer.SampleSize = 15

Example Go Timeout Returns

"operation has timed out"

As per note earlier in document: the Go SDK is an exception to this section, in that it does not log timeouts with additional data as the context is not able to be carried along with a Go error.

Example LCB Timeout Returns

< TO ADD >

SDK Verification Scenarios

To ensure each SDK is implemented in the same way with the output being consistent, the following scenarios will be used to verify API, behaviour and output.

API Default Threshold Logging Tracer settings Default Orphaned Response Logger settings Configure Threshold Logging Tracer Configure Orphaned Response Logger

Output Threshold Logging Tracer Orphaned Response Logger Timeout message

Previously Considered Tracer Solutions

Heuristic average

To help identify when an operation is unusually slow, each SDK could implement a SlowOperationTracer that will keep a heuristic average duration of the top-level Spans. As a top-level span completes, its duration is then compared to the average plus a deviation percentage and if it is greater, it is logged.

For example; given a deviation percentage of 75%, the pseudo code for a top-level span being finished and reported to the SlowOperationTracer would look like the following.

completedSpans = { 1234, 3321, 12234, 44322, 54533 } // last five operation durations
deviationPercentage = 0.75 // 75%
If span.duration >= completedSpans.average + completedSpans.average * deviationPercentage
begin
    logger.log(span)
end

Because the average request durations is likely to be very different between service types (KV, N1QL, etc), there should either be a SlowOperationTracer per service or a single instance by it internally manages a per-service duration buffer. Each instance should be configured with a number of operation durations to remember to calculate the average with and the deviation percentage.

Spans are not logged as they finish so the internal _Span _implementation should be extended to maintain a collection of sub-spans and when building a new Span using SpanBuilder.ChildOf, the new span is added to the parent's span collection. This ensures that when the top-level span is logged, it had a full reference to all sub-spans so a complete structure can be created. An example output JSON structure can be seen here.

If a top-level span has been identified as slow, it should be logged to the configured SDK logging infrastructure.

To prevent flooding logs, each SDK could maintain a heuristic average operation duration and two configuration values slow operation percentile and slow operation floor.

Questions

Q: Is there a common way to modify the KV node logging of 'slow'? If so, what is that API and behavior?

A: The KV slow operation threshold is configured on the cluster and can be updated using the cluster management REST interface. A way to programmatically configure the server's KV slow operation threshold by an SDK in the future.

Q: Is there a way to change the logging behavior of other services based on application definition of what is 'slow'?

A: There is both a client level calculated slow operation threshold using a percentile and a slow operation floor that are both flexible will only log trace information for what is unusual.

Q: How is trace logging of timed out operations handled? In particular, is there special handling for reading network responses that are for timed out operations?

A: Timed out operations are returned to the application when the client timeout has been reached, this means the response packet may be received afterwards. To prevent having to maintain state for in memory for operations that have timed out, the SDK will log the trace information at the point the operation timed out, and then the response is received, another log will be written with the same correlation ID and server duration.

Q: Since the logging is JSON and logging with multiple lines makes for hard to consume logs, should it be guaranteed one line or have formatting?

A: Yes, the JSON logged should be in a compressed format where line breaks and whitespace is removed to prevent consuming many lines in a log file and ensure it's easily readable by another process analysing the log files.

Q: Shоuld it be enabled by default? Recording all these timestamps will require a lot of requests to system clock. And storing extra data per operation will increase memory consumption.

A: The PRD requirement was to be on by default, to comply with that we will have it enabled as standard with the option to turn it off if wanted.

Q: Should we report directly to the tracer if no parent span is provided for an operation, or should that count as something we not reporting.

A: Any span that does not have a parent should be treated as a "top-level" span, which should be sent to the tracer. If the span is not meant to be, it should be regarded as a bug.

Change Log

Date	Description
2017-12-18	Add additional details for KV server duration encoding and response packet structure
2017-12-15	Re-write and tidy up enhanced log messages section with regards to format and example
2018-1-2	Clarify some properties descriptions Marked request encoding / response decoding as optional Clarified this feature should be "on by default" and that no tracing timings should be recorded if EnableOperationTracing is false Added Performance Concerns sections
2018-01-04	Accepted many suggestions and clarifications. Added two new optional trace points (queue_time and resolution_time). Added two new future tracing points. Updated KV duration encoding algorithm.
2018-01-22	Overhauled RFC with focus on adopting OpenTracing API and implementing spans Added Zombie response handling
2018-01-29	Add Client / connection IDs for KV and HTTP services Updated list of expected tags Updated names of trace points to be more explicit
2018-01-30	Added total_time_us to SpanSimmary structure used for logging slow operations.
2018-02-01	Updated new Client / Connection ID format Added example unit test to verify KV server duration decoding
2018-02-15	Rename Slow Operation Tracer to Threshold Logging Tracer Reword ThresholdLoggingTracer description to be less prescriptive of implementation
2018-02-19	Updated table of contents to reflect heading and section changes
2018-02-20	Updated peer.latency tag description to indicate time precision is required. Updated configuration names for threshold logging floors
2018-02-21	Use defined peer.service tag instead of custom couchbase.service span tag Update timeout message format Add description paragraph to orphaned response section to describe how to correlate operations
2018-04-10	Added last_operation_id and last_local_id span summary properties Updated Timeout log message format
2018-05-09	Extended optional N1QL tags to include standard metrics returned in payload. Updated ThresholdLoggingTracer example JSON to reflect updated property names.
2018-05-14	Updated timeout messages to provide additional data in a JSON object and add table for properties
2018-05-17	Update two references to threshold logging tracer interval from 10 seconds to 1 minute
2018-05-21	Added verification scenarios / expected output
2018-05-22	Revert Threshold Logging Tracer and Orphaned Response Logger default interval to 10 seconds.
2018-05-23	Added explicit note for expected logging level for Orphaned / zombie responses.
2018-05-24	Rename ThresholdLoggingTracerQueryThreshold to ThresholdLoggingTracerN1qlThreshold.
2018-05-28	Add separate config options for Orphaned Response Logging
2018-06-06	Update operation context JSON properties separate operation ID and connection ID.
2018-06-21	Removed all references to zombies and replaced with orphans.

Signoff

Language	Representative	Date
C	Sergey Avseyev	2018-06-25
Go	Brett Lawson	2018-06-25
Java	Michael Nitschinger	2018-06-21
.NET	Mike Goldsmith	2018-06-21
Node.js	Brett Lawson	2018-06-25
PHP	Sergey Avseyev	2018-06-25
Python	Ellis Breen	2018-06-25

Errata

13th September 2018:

Added document key span tag for KV operations.

Tag	Description
couchbase.operation_id	The unique ID of the operation, as described above.
couchbase.document_key	The document key (KV operations only). NOTE: This tag should be omitted if log redaction is enabled within the SDK.
peer.service	The service type, one of the following: kv, view, n1ql, search, analytics
local.address	The local socket hostname / IP and port, in the format: {hostname}:{port} To be added to dispatch spans when the local socket is available.
peer.address	The remote socket hostname / IP and port, in the format: {hostname}:{port} To be added to dispatch spans when the remote socket is available.
span.kind	"client"
db.type	"couchbase"
component	The client's identifier string (the 'a' property in the updated HELLO request), the same one that is shared with the server to identify the SDK. NOTE: This string does not require trimming to 200 characters.
peer.latency	The server duration with precision suffix. The suffix is required to indicate precision because KV is recorded in microseconds and N1QL query metrics use milliseconds. For example: 123us, 32.12ms

Files

0035-rto.md

Latest commit

History

0035-rto.md

File metadata and controls

Meta

Summary

Description

Tracing

Spans

Operation Name

Trace Points

Tags

Client / Connection IDs

Threshold Logging Tracer

SpanSummary

Log Tracer (Optional)

Noop Tracer

Server Durations

KV - Flexible Framing Extras

FrameInfo

N1QL, Analytics Query Server Duration

N1QL Profiling Baggage (optional)

Views

Enhancements to Timeout Exceptions

Responses for Orphaned (timed-out) Requests

Terminology

Implementation Notes

Example Telemetry

Operation Context Properties

Future Enhancements

Composite Tracer

Language Details

Example Java Configuration

Example Java Timeout Returns

Example Python Configuration

Example Python Timeout Returns

Example PHP Configuration

Example PHP Timeout Returns

Example Node.js Configuration

Example Node.js Timeout Returns

Example .NET Configuration

Example .NET Timeout Returns

Example Go Configuration

Example Go Timeout Returns

Example LCB Timeout Returns

SDK Verification Scenarios

Previously Considered Tracer Solutions

Heuristic average

Questions

Change Log

Signoff

Errata