db.statement sanitization default behavior #3104

avzis · 2023-01-15T10:13:03Z

What are you trying to achieve?

According to the db spec:

The value may be sanitized to exclude sensitive information.

It does not strictly describes what should be the default behavior - sanitize the data, or not?

I am thinking that this approach might be risky, because of several cases:

Users might not be aware of the query collection, and might expose sensitive data.
In some instrumentations, db.statement was not collected in the beginning and was added (or will be added) in a later stage, when users are already using that instrumentation, which means that suddenly the Instrumentation might start exposing sensitive data without the user knowing about it.
There are in-consistencies between different packages and instrumentation, some are displaying sanitized data, and some are displaying the original data.

I am suggestion that the spec should clarify what should be the default behavior regarding sanitization.

Additional context.

This issue is talking about missing examples, which might also clarify the behavior

haddasbronfman · 2023-01-15T10:39:32Z

I would expect the default behavior would be not to present sensitive data.
It can be by using a default serializer function that omits the sensitive data, and it can be by a boolean configuration that the user should set if he wants to see his data.
This way we can make sure that he is aware of what he is doing.

(in js-contrib, redis instrumentation, has a default serializer that returns only the non-sensitive data:
https://github.com/open-telemetry/opentelemetry-js-contrib/blob/main/packages/opentelemetry-redis-common/src/index.ts#L57

and in mongo db we have the same behavior: (https://github.com/open-telemetry/opentelemetry-js-contrib/blob/main/plugins/node/opentelemetry-instrumentation-mongodb/src/instrumentation.ts#L606).

tombruijn · 2023-01-17T12:21:33Z

I agree that sanitizing the queries and other sensitive data by default would be a good idea. People can always choose to send the raw query if there's a config option for it.

In most cases sending unsanitized queries will likely leak PII data out to other systems. Something that could put companies and people in some legal trouble and open them up for security headaches.
It's also good that the people with access to the instrumented app data can't find user email addresses, passwords, addresses, phone numbers, etc. in traces.

Providing developers with config options to sanitize queries is good, but since it is not the default and configured by the user it may be less well tested and not cover all cases, or be forgotten when the instrumentation is added.

We make sure that anything we detect as an SQL queries is run through our own sql_lexer library to sanitize them before their send to our servers. But we can't do that for everything, like Redis queries, and it would be best to put the query sanitization in the instrumentation packages itself.

nemoshlag · 2023-01-22T15:50:18Z

I think it should be default sanitized and mentioned explicitly. Instead of removing db.statement attribute completely, can we set its value to "sanitized"? That way users can differentiate between empty statements/lack of instrumentation/sanitization.

tombruijn · 2023-01-24T14:41:13Z

I wouldn't say it needs to always remove the attribute to sanitize, or replace the value in its entirety. If the instrumentation package can store a sanitized version, that would still be helpful to any person who views the span data.

For example, a sanitized SQL query could look like this SELECT * FROM table WHERE email = "?", where the sanitized value is the question mark. That people can still recognize which queries are being performed, without all the email addresses, passwords and other things one wouldn't want to collect.

carlosalberto · 2023-01-24T17:40:31Z

Btw OTel Java does this at this moment. Maybe it would be helpful to replicate this behavior in other languages, sharing the parsing rules? See this Java-infused Lex definition: https://github.com/open-telemetry/opentelemetry-java-instrumentation/blob/main/instrumentation-api-semconv/src/main/jflex/SqlSanitizer.jflex

pellared · 2023-01-25T10:49:21Z

Related thing. I think that if the instrumentation library does not sanitize the db.statement then the attribute should not be added unless explicitly configured (it should be opt-in in that case). This is e.g. what https://github.com/open-telemetry/opentelemetry-dotnet/tree/main/src/OpenTelemetry.Instrumentation.SqlClient does. The default options should be safe.

I know that some security engineers would even say that emitting db.statement should be always disabled by default as there are also "dynamic database structures" where e.g. table names represent usernames (but I think it goes to too far 😉 ).

Update the Redis instrumentation library to not change the default behavior for the Redis instrumentation. This can be enabled at a later time when the spec discussion about this topic has concluded. open-telemetry/opentelemetry-specification#3104

arielvalentin · 2023-02-01T01:07:58Z

+1 in favor of sanitizing by default or omitting if the instrumentation (or library) does not support it.

We offer options in the Ruby instrumentations of omit, include, obfuscate (probably should be sanitize) e.g. https://github.com/open-telemetry/opentelemetry-ruby-contrib/blob/main/instrumentation/pg/lib/opentelemetry/instrumentation/pg/instrumentation.rb#L28

It would be great to have standard option names and reduce the likelihood of accidentally exporting sensitive data.

* Add Redis instrumentation query sanitization Add a query sanitizer to the Redis instrumentation. This can be disabled with the `sanitize_query = False` config option. Given the query `SET key value`, the sanitized query becomes `SET ? ?`. Both the keys and values are sanitized, as both can contain PII data. The Redis queries are sanitized by default. This changes the default behavior of this instrumentation. Previously it reported unsanitized Redis queries. This was previously discussed in the previous implementation of this PR in PR #1571 Closes #1548 * Update Redis sanitize_query option documentation Changes suggested in #1572 (comment) * Remove uninstrument & instrument from test setup The Redis test that performs the tests with the default options, doesn't need to uninstrument and then instrument the instrumentor. This commit removes the unnecessary setup code. The setup code is already present at the top of the file. * Fix code style formatting * Update Redis functional tests - Update the sanitizer to also account for a max `db.statement` attribute value length. No longer than 1000 characters. - Update the functional tests to assume the queries are sanitized by default. - Add new tests that test the behavior with sanitization turned off. Only for the tests in the first test class. I don't think it's needed to duplicate this test for the clustered and async setup combinations. * Test Redis unsanitized queries by default Change the Redis functional tests so that they test the unsanitized query by default, and test the sanitized query results in the separate test functions. This is a partial revert of the previous commit 8d56c2f * Fix formatting issue in Redis utils * Disable Redis query sanitization by default Update the Redis instrumentation library to not change the default behavior for the Redis instrumentation. This can be enabled at a later time when the spec discussion about this topic has concluded. open-telemetry/opentelemetry-specification#3104 * Fix pylint issue Remove else statement. * Update changelog about Redis query sanitization default [ci skip] Co-authored-by: Srikanth Chekuri <[email protected]> * Fix potential error on Redis args being 0 Check the length of the args array and return an empty string if there are no args. That way it won't cause an IndexError if the args array is empty and it tries to fetch the first element, which should be the Redis command. --------- Co-authored-by: Srikanth Chekuri <[email protected]>

dubonzi · 2023-03-07T14:26:52Z

+1 in favor of sanitizing by default or omitting if the instrumentation (or library) does not support it.

It would be great to have standard option names and reduce the likelihood of accidentally exporting sensitive data.

Also +1 for sanitizing by default and ommiting if theres no support/implementation for it by the instrumentation.

The Go instrumentation for MongoDB doesn't sanitize the query, I had to warn another team at work about it because they deal with sensitive data and was about to migrate to open telemetry.

(see open-telemetry/opentelemetry-go-contrib#3519)

crebbo · 2023-03-17T15:08:43Z

This is a huge red flag for capturing db statements by default in tracing data, surely?

Trying to learn more about this behaviour currently, but I'm seeing our DynamoDB payload being captured in tracing span data (db.statement) when using the AWS ADOT collector for lambda out of the box (nodeJS).

Is there an easy way/environment var to stop this behaviour?

I'd previously seen Datadog was doing something similar with their python dd-trace, but they quickly patched this when I highlighted it.

Any pointers appreciated, will keep investigating

EDIT: I guess the best thing to do here is have custom processors which delete/hash known attributes that could contain sensitive data? An environment var to do this also would be nice, but can’t see that.

nemoshlag · 2023-03-26T12:38:14Z

+1 in favor of sanitizing by default or omitting if the instrumentation (or library) does not support it.

It would be great to have standard option names and reduce the likelihood of accidentally exporting sensitive data.

Also +1
I think this issue requires a decision since the current status where sensitive data can leak by default might prevent users from using otel

arminru · 2023-04-03T12:10:31Z

From the spec triage meeting: Accepting this one in retrospect as it looks like there's agreement to move forward as per the discussion on the PR.

avzis added the spec:trace Related to the specification/trace directory label Jan 15, 2023

github-actions bot assigned SergeyKanzhelev Jan 15, 2023

avzis mentioned this issue Jan 15, 2023

pymemcache query sanitization open-telemetry/opentelemetry-python-contrib#1576

Open

6 tasks

shalevr mentioned this issue Jan 16, 2023

Add Redis instrumentation query sanitization open-telemetry/opentelemetry-python-contrib#1572

Merged

8 tasks

avzis mentioned this issue Jan 22, 2023

Change db.statement to only be collected if there is sanitization #3127

Merged

nemoshlag mentioned this issue Jan 24, 2023

add elasticsearch db.statement sanitization open-telemetry/opentelemetry-python-contrib#1598

Merged

9 tasks

avzis mentioned this issue Feb 7, 2024

DB sanitization uniform format open-telemetry/semantic-conventions#717

Closed

nemoshlag mentioned this issue Feb 12, 2023

Add botocore db.statement sanitization for dynamoDB open-telemetry/opentelemetry-python-contrib#1662

Open

9 tasks

SergeyKanzhelev removed their assignment Feb 18, 2023

nemoshlag mentioned this issue Mar 1, 2023

Add sqlalchemy db.statement sanitization flag open-telemetry/opentelemetry-python-contrib#1701

Open

9 tasks

pellared mentioned this issue Mar 7, 2023

[otelmongo] Disable adding the mongo 'db.statement' tag by default open-telemetry/opentelemetry-go-contrib#3519

Merged

arminru added the [label deprecated] triaged-accepted [label deprecated] Issue triaged and accepted by OTel community, can proceed with creating a PR label Apr 3, 2023

reyang closed this as completed in #3127 Apr 6, 2023

Mpdreamz mentioned this issue Feb 7, 2024

Introduce common configuration option for sanitisation of db.statement. open-telemetry/semantic-conventions#705

Open

arielvalentin mentioned this issue Oct 4, 2023

feat!: obfuscation for mysql2, dalli and postgresql as default option for db_statement open-telemetry/opentelemetry-ruby-contrib#682

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

db.statement sanitization default behavior #3104

db.statement sanitization default behavior #3104

avzis commented Jan 15, 2023 •

edited

Loading

haddasbronfman commented Jan 15, 2023

tombruijn commented Jan 17, 2023

nemoshlag commented Jan 22, 2023

tombruijn commented Jan 24, 2023 •

edited

Loading

carlosalberto commented Jan 24, 2023

pellared commented Jan 25, 2023

arielvalentin commented Feb 1, 2023

dubonzi commented Mar 7, 2023 •

edited

Loading

crebbo commented Mar 17, 2023 •

edited

Loading

nemoshlag commented Mar 26, 2023

arminru commented Apr 3, 2023

db.statement sanitization default behavior #3104

db.statement sanitization default behavior #3104

Comments

avzis commented Jan 15, 2023 • edited Loading

haddasbronfman commented Jan 15, 2023

tombruijn commented Jan 17, 2023

nemoshlag commented Jan 22, 2023

tombruijn commented Jan 24, 2023 • edited Loading

carlosalberto commented Jan 24, 2023

pellared commented Jan 25, 2023

arielvalentin commented Feb 1, 2023

dubonzi commented Mar 7, 2023 • edited Loading

crebbo commented Mar 17, 2023 • edited Loading

nemoshlag commented Mar 26, 2023

arminru commented Apr 3, 2023

avzis commented Jan 15, 2023 •

edited

Loading

tombruijn commented Jan 24, 2023 •

edited

Loading

dubonzi commented Mar 7, 2023 •

edited

Loading

crebbo commented Mar 17, 2023 •

edited

Loading