diff --git a/changes/2025-09-12_introduce-metrics-interface/background.md b/changes/2025-09-12_introduce-metrics-interface/background.md new file mode 100644 index 00000000..b9883bc5 --- /dev/null +++ b/changes/2025-09-12_introduce-metrics-interface/background.md @@ -0,0 +1,105 @@ +[//]: # "Copyright Amazon.com Inc. or its affiliates. All Rights Reserved." +[//]: # "SPDX-License-Identifier: CC-BY-SA-4.0" + +# Interacting with a Metrics Interface in the AWS Encryption SDK family of products (Background) + +## Definitions + +### Conventions used in this document + +The key words +"MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", +"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" +in this document are to be interpreted as described in +[RFC 2119](https://tools.ietf.org/html/rfc2119). + +## Issues and Alternatives + +Crypto Tools (CT) publishes software libraries. The latest +versions of these libraries have no logging or metrics publishing +to either a local application or to an observability service like AWS CloudWatch. + +As client side encryption libraries emitting metrics must be done carefully as +to avoid accidentally [leaking](https://github.com/aws/aws-encryption-sdk-python/pull/105/files) any information related to the plaintext that could lead to a +loss of customer trust. + +A popular feature request has been for in depth insights into CT libraries. Many customers +ask for suggestions on how to reduce network calls to AWS Key Management Service (AWS KMS) and +followup questions around cache performance. + +CT offers solutions to reduce network calls to AWS KMS through the Caching CMM and the AWS KMS Hierarchical Keyring. +Today, there is no CT solution for customers to extract the performance metrics customers are looking for. +This can lead to frustrating debugging sessions and escalations that +could have been resolved with additional information. + +Recent customer demand has allowed CT to re-evaluate client side metrics to offer +a better customer experience. + +### Issue 1: What will be the default behavior? + +As a client-side encryption library CT should be as cautious as possible. +Customers of CT libraries should be on the driver seat and determine for +themselves if their application could benefit from emitting metrics. +Making that decision for customers can erode customer trust. + +For CT to be comfortable with allowing metrics, CT must consider that +this process must not affect the availability of the consumer of the library. + +#### Opt-In (recommended) + +By not emitting metrics by default existing customer workflows do not change. + +This allows customers to test how their applications behave when they start to emit +metrics. Customers can then ask for updates to the implementations +CT provides or customers can go an implement their own interfaces that are fine-tuned +to their use cases. + +#### Always + +This option implies that CT guarantees that the availability of an application +will not change. Perhaps a bold implication this is ultimately what the customer +will feel like; getting no choice on the matter and opting to not upgrade. +Going from never emitting metrics to always emitting them says to customers +that their application no matter its use case will always benefit from metrics. +Without letting customers make that choice, CT looses hard earned customer trust. + +This also forces customers to make a choice, start collecting metrics and pick up +additional updates CT provides or get stuck in a version of the library that will +become unsupported. + +Additionally, requiring customers to start emitting metrics +almost certainly guarantees a breaking change across supported libraries. + +### Issue 2: Should Data Plane APIs fail if metrics fail to publish? + +#### No (recommended) + +Metrics publishing must not impact application availability. + +CT should allow for a fail-open approach when metrics fail to publish. +This will prevent metric publishing issues from impacting the +core functionality of the application. + +CT can consider this a two-way door with initially not attempting to retry +to publish failed metrics and add this functionality later on. + +#### Yes + +This will become a problem for the libraries and will undoubtedly result +in customer friction and failing adoption rates. +Failing operations due to metrics not being published leaves the availability +of the application to rest on the implementation of the metrics interface. +This should not be the case, metrics should aid the customer application +not restrict it. + +### Issue 3: How will customers interact with the libraries to emit metrics? + +#### Provide an Interface + +Keeping in line with the rest of CT features, a well defined interface with out +of the box implementations should satisfy the feature request. + +Out of the box implementations should cover publishing metrics to an +existing observability service like AWS CloudWatch and to the local file system. +These implementations should offer customers a guide into implementing their own +if they wish to do so. diff --git a/changes/2025-09-12_introduce-metrics-interface/change.md b/changes/2025-09-12_introduce-metrics-interface/change.md new file mode 100644 index 00000000..94024b0b --- /dev/null +++ b/changes/2025-09-12_introduce-metrics-interface/change.md @@ -0,0 +1,169 @@ +[//]: # "Copyright Amazon.com Inc. or its affiliates. All Rights Reserved." +[//]: # "SPDX-License-Identifier: CC-BY-SA-4.0" + +# Adding a Metrics Interface + +**_NOTE: This document will be used to gain alignment on +this interface should look like and how it could be integrated with +existing operations. This document will not seek to specify +a Metrics implementation or specify which metrics will be collected +from impacted APIs or configurations._** + +## Affected APIs or Client Configurations + +This serves as a reference of all APIs and Client Configurations that this change affects. +This list is not exhaustive. Any downstream consumer of any API or client configuration SHOULD +also be updated as part of this proposed changed. + +| API/ Configuration | +| --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| [Encrypt](https://github.com/awslabs/aws-encryption-sdk-specification/blob/master/client-apis/encrypt.md) | +| [Decrypt](https://github.com/awslabs/aws-encryption-sdk-specification/blob/master/client-apis/decrypt.md) | +| [GetEncryptionMaterials](https://github.com/awslabs/aws-encryption-sdk-specification/blob/master/framework/cmm-interface.md#get-encryption-materials) | +| [DecryptionMaterials](https://github.com/awslabs/aws-encryption-sdk-specification/blob/master/framework/cmm-interface.md#decrypt-materials) | +| [OnEncrypt](https://github.com/awslabs/aws-encryption-sdk-specification/blob/master/framework/keyring-interface.md#onencrypt) | +| [OnDecrypt](https://github.com/awslabs/aws-encryption-sdk-specification/blob/master/framework/keyring-interface.md#ondecrypt) | +| [DynamoDB Table Encryption Config](https://github.com/aws/aws-database-encryption-sdk-dynamodb/blob/main/specification/dynamodb-encryption-client/ddb-table-encryption-config.md) | + +## Affected Libraries + +| Library | Version Introduced | Implementation | +| ------- | ------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | +| ESDK | T.B.D | [ESDK.smithy](https://github.com/aws/aws-encryption-sdk/blob/mainline/AwsEncryptionSDK/dafny/AwsEncryptionSdk/Model/esdk.smithy) | +| MPL | T.B.D | [material-provider.smithy](https://github.com/aws/aws-cryptographic-material-providers-library/blob/main/AwsCryptographicMaterialProviders/dafny/AwsCryptographicMaterialProviders/Model/material-provider.smithy) | +| DB-ESDK | T.B.D | [DynamoDbEncryption.smithy](https://github.com/aws/aws-database-encryption-sdk-dynamodb/blob/main/DynamoDbEncryption/dafny/DynamoDbEncryption/Model/DynamoDbEncryption.smithy) | + +## Definitions + +### Conventions used in this document + +The key words +"MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", +"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" +in this document are to be interpreted as described in +[RFC 2119](https://tools.ietf.org/html/rfc2119). + +## Summary + +Existing users of Crypto Tools (CT) libraries do no have any insights as to +how the librar(y/ies) behave(s) in their application. +This can lead to frustrating debugging sessions where users +are required to have explicit tests to assert they are using a particular feature +correctly, or if customers are using any of the KMS keyrings users have to have +AWS Cloudwatch open to verify their application is sending values users expect. +This can be seen as a best practice and users may find this a good exercise; +however, CT's libraries do not make debugging an easy task. + +A feature which allows customers to get real-time telemetry of their application's +integration with CT's libraries would be welcomed by users and will deliver on the +"easy to use and hard to misuse" tenet. + +Introducing a new interface that defines the operations that must be +implemented in order to build a specification compliant MetricsAgent. + +## Requirements + +The interface should have three requirements. + +1. MUST be simple. +1. MUST be extensible. + +The following is documented to signify its importance +even though the interface is not able to make this guarantee. +Every implementation of the proposed interface must +ensure the following. + +1. MUST NOT block the application's execution thread. + +## Points of Integration + +To collect metrics across CT's library stack multiple points of integration +are needed in order to collect metrics across CT's libraries. +Generally, CT's libraries work as follows: + +_Note: Not every Client supports the Material Provider Library. +The Diagram below assumes it to help the mental model._ + +```mermaid +sequenceDiagram + participant Client + box Material Provider Library + participant CMM + participant Keyring + end + loop Content Encryption + Client->>Client: Content Encryption + end + Client<<->>CMM: GetEncryption/Decryption Materials + CMM<<->>Keyring: OnEncrypt/OnDecrypt +``` + +To optionally emit metrics from a top level client, +all customer facing APIs MUST be changed to optionally accept +a Metrics Agent. This will allow customers to define and supply one top +level Metrics Agent; this agent will get plumbed throughout CT's stack. + +For example, in the ESDK for Java this would look like: + +```java +final AwsCrypto crypto = AwsCrypto.builder().build(); +// Create a Keyring +final MaterialProviders matProv = + MaterialProviders.builder() + .MaterialProvidersConfig(MaterialProvidersConfig.builder().build()) + .build(); + +final IKeyring rawAesKeyring = matProv.CreateRawAesKeyring(keyringInput); +final Map encryptionContext = + Collections.singletonMap("ExampleContextKey", "ExampleContextValue"); + +// Create a Metrics Agent +final IMetricsAgent metrics = matProv.CreateSimpleMetricsAgent(metricsAgentInput); +// 4. Encrypt the data +final CryptoResult encryptResult = + crypto.encryptData(rawAesKeyring, metrics, EXAMPLE_DATA, encryptionContext); +final byte[] ciphertext = encryptResult.getResult(); +``` + +This change will allow Crypto Tools to introduce a Metrics Agent in a +non-breaking way as the Metrics Agent will be an optional parameter +at customer facing API call sites. + +Currently, the ESDK client APIs models are defined [here](https://github.com/aws/aws-encryption-sdk/blob/mainline/AwsEncryptionSDK/dafny/AwsEncryptionSdk/Model/esdk.smithy#L60-L126). +This change would see that the client APIs be changed as follows: + +```diff +structure EncryptInput { + @required + plaintext: Blob, + + encryptionContext: aws.cryptography.materialProviders#EncryptionContext, + + // One of keyring or CMM are required + materialsManager: aws.cryptography.materialProviders#CryptographicMaterialsManagerReference, + keyring: aws.cryptography.materialProviders#KeyringReference, + + algorithmSuiteId: aws.cryptography.materialProviders#ESDKAlgorithmSuiteId, + + frameLength: FrameLength, + ++ metricsAgent: aws.cryptography.materialProviders#MetricsAgentReference +} +... +structure DecryptInput { + @required + ciphertext: Blob, + + // One of keyring or CMM are required + materialsManager: aws.cryptography.materialProviders#CryptographicMaterialsManagerReference, + keyring: aws.cryptography.materialProviders#KeyringReference, + //= aws-encryption-sdk-specification/client-apis/keyring-interface.md#onencrypt + //= type=implication + //# The following inputs to this behavior MUST be OPTIONAL: + // (blank line for duvet) + //# - [Encryption Context](#encryption-context) + encryptionContext: aws.cryptography.materialProviders#EncryptionContext, + ++ metricsAgent: aws.cryptography.materialProviders#MetricsAgentReference +} +``` diff --git a/framework/metrics-agent.md b/framework/metrics-agent.md new file mode 100644 index 00000000..6235b433 --- /dev/null +++ b/framework/metrics-agent.md @@ -0,0 +1,247 @@ +[//]: # "Copyright Amazon.com Inc. or its affiliates. All Rights Reserved." +[//]: # "SPDX-License-Identifier: CC-BY-SA-4.0" + +# Metrics Agent Interface + +_NOTE: Still in draft but in a state to receive feedback on 9-15-2025_ + +## Version + +1.0.0 + +### Changelog + +- 1.0.0 + - Initial record + +## Implementations + +| Library | Version Introduced | Implementation | +| ------- | ------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | +| ESDK | T.B.D | [ESDK.smithy](https://github.com/aws/aws-encryption-sdk/blob/mainline/AwsEncryptionSDK/dafny/AwsEncryptionSdk/Model/esdk.smithy) | +| MPL | T.B.D | [material-provider.smithy](https://github.com/aws/aws-cryptographic-material-providers-library/blob/main/AwsCryptographicMaterialProviders/dafny/AwsCryptographicMaterialProviders/Model/material-provider.smithy) | +| DB-ESDK | T.B.D | [DynamoDbEncryption.smithy](https://github.com/aws/aws-database-encryption-sdk-dynamodb/blob/main/DynamoDbEncryption/dafny/DynamoDbEncryption/Model/DynamoDbEncryption.smithy) | + +## Overview + +The Metrics Agent defines defines operations that allow messages +to be published to a destination. +The Metrics Agent interface describes the interface that all +Metrics Agents MUST implement. + +## Definitions + +### Conventions used in this document + +The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" +in this document are to be interpreted as described in [RFC 2119](https://tools.ietf.org/html/rfc2119). + +### label + +A label is a string that is used +as a an attribute name to aggregate +measurements. A label can be used to +add a dimension to the Metrics Agent + +### date + +A date is a value in milliseconds since epoch. + +### duration + +A duration is a value in milliseconds + +### count + +A count is an Long value + +### value + +A value is a string that is used to attach +context to a particular label. + +### transactionId + +A transactionId is a string that +is used to coalasce multiple metric requests +for a given client request. + +## Supported Metrics Agents + +Note: A user MAY create their own custom Metrics Agent. + +## Interface + +### Inputs + +The inputs to the MetricsAgent are groups of related fields, referred to as: + +- [AddDate Input](#adddate-input) +- [AddTime Input](#addtime-input) +- [AddCount Input](#addcount-input) +- [AddProperty Input](#addproperty-input) + +#### AddDate Input + +This is the input to the [AddDate](#adddate) behavior. + +The add date input MUST include the following: + +- A [label](#label) +- A [date](#date) + +The add date input MAY include the following: + +- A [transactionId](#transactionid) + +#### AddTime Input + +This is the input to the [AddTime](#addtime) behavior. + +The add time input MUST include the following: + +- A [label](#label) +- A [duration](#duration) + +The add time input MAY include the following: + +- A [transactionId](#transactionid) + +#### AddCount Input + +This is the input to the [AddCount](#addcount) behavior. + +The add count input MUST include the following: + +- A [label](#label) +- A [count](#count) + +The add count input MAY include the following: + +- A [transactionId](#transactionid) + +#### AddProperty Input + +This is the input to the [AddProperty](#addproperty) behavior. + +The add property input MUST include the following: + +- A [label](#label) +- A [value](#value) + +The add property input MAY include the following: + +- A [transactionId](#transactionid) + +### Behaviors + +The MetricsAgent Interface MUST support the following behaviors: + +- [AddDate](#adddate) +- [AddTime](#addtime) +- [AddCount](#addcount) +- [AddProperty](#addproperty) + +#### AddDate + +Used to record a specific time value with the same metric name. + +#### AddTime + +Used to aggregate a sum from multiple time values with the same metric name. + +#### AddCount + +Used to aggregate a sum from multiple count values with the same metric name. + +#### AddProperty + +Used to add context/metadata in the form of a key-value pair related to a Metrics reference. + +## Proposed Smithy Model + +```smithy +use aws.polymorph#extendable + +@extendable +resource MetricsAgent { + operations: [ + AddDate, + AddTime, + AddCount, + AddProperty + ] +} + +// Operations for different metric types +operation AddDate { + input: AddDateInput, + output: AddOutput, + errors: [MetricsPutError] +} + +operation AddTime { + input: AddTimeInput, + output: AddOutput, + errors: [MetricsPutError] +} + +operation AddCount { + input: AddCountInput, + output: AddOutput, + errors: [MetricsPutError] +} + +operation AddProperty { + input: AddPropertyInput, + output: AddOutput, + errors: [MetricsPutError] +} + +// Input structures for each operation with flattened values +structure AddDateInput { + @required + label: String, + @required + date: Timestamp, + transactionId: String +} + +structure AddTimeInput { + @required + label: String, + @required + duration: Long, // Duration in milliseconds + transactionId: String +} + +structure AddCountInput { + @required + label: String, + @required + count: Long, + transactionId: String +} + +structure AddPropertyInput { + @required + label: String, + @required + value: String, + transactionId: String +} + +// Common output structure +structure AddOutput {} + +// Error structure +@error("client") +structure MetricsPutError { + @required + message: String +} + +@aws.polymorph#reference(resource: MetricsAgent) +structure MetricsAgentReference {} + +```