Proposal for vector db semantic convention #1231

ezimuel · 2024-07-10T15:14:58Z

This is a proposal for vector db semantic convention (see #936). I tried to expand the db semantic convention adding some db.vector attributes. I tried to focus on the basic needs of a general purpose vector database.

I proposed the following experimental attributes (updated with the feedbacks in this PR):

db.search.similarity_metric: specify the metric used in similarity search (e.g. cosine)
db.record.id: the ID of the record (e.g. the ID of the vector)
db.vector.field_name: the name field of the vector embedding
db.vector.dimension_count: the dimension of the vector (e.g. 1536)
db.vector.query.top_k: the top-k most similar vectors returned by a query (e.g. similarity search)

The operations performed in a vector db, such as insert, update, search and delete can be performed using the existing db.operation.name attribute.

Regarding the similarity search we can use the db.query attributes, such as db.query.parameter.<key>.

linux-foundation-easycla · 2024-07-10T15:15:03Z

The committers listed above are authorized under a signed CLA.

✅ login: MadVikingGod / name: Aaron Clawson (ae0e066)
✅ login: lmolkova / name: Liudmila Molkova (6db7ec5, d99ec10, fd0f2e7)
✅ login: ChrsMark / name: Christos Markou (61b0f2c, e5e0d9d, bc8a63c)
✅ login: MSNev / name: Nev (93d2cbe)
✅ login: maryliag / name: Marylia Gutierrez (ceae2ca, 03b67bf, 1c6bd00)
✅ login: dependabot[bot] (a5f8661, a10e75f, d996cd9)
✅ login: ezimuel / name: Enrico Zimuel (fa8ee30, a3330ff, da6649b, 7068720, 5e12a86, 53d82d4, 828bacc, e5ff387, 3b61784, 81dca47, ff03da1, 2357766, 523bcb9, 9feb74d)
✅ login: jsuereth / name: Josh Suereth (daa0a14, f411554)

model/registry/db.yaml

docs/database/dynamodb.md

model/registry/db.yaml

lmolkova · 2024-07-15T20:42:16Z

model/registry/db.yaml

+        brief: >
+          The dimension of the vector.
+        examples: [3]
+      - id: model


this should be captured with gen_ai.request.model (and other gen-ai attributes) - it's ok to mix different attributes on the same telemetry item.

This is the model for the embedding and can be different from the gen_ai.request.model. Vector databases are not strictly related with GenAI. There are many use cases where vector db can be used without GenAI (e.g. semantic search).

could you provide some examples of databases that can compute embeddings?

How would database instrumentation know which model was used to create embeddings if it only stores and queries them?

Some vendors that provide in-database embedding:

Weaviate, supporting the following provider integrations;

Elasticsearch, using the Inference API;

PostgresML provides in-database embedding generation;

Vespa provides the embedding using the embed() function;

Supabase provides an embegging generator using edge function;

I think many other vendors will add this feature. Having the possibility to generate embeddings in the database simplify the customer use cases.

Regarding the question on how database instrumentation know which model was used to create an embedding this is an information that is specified when you create a collection or when you provide some search (e.g. in PostgreML). Moreover, I think the instrumentation libraries can also leverage this model attribute since they know the embedding model used to generate the vectors.

It seems all of them use some GenAI model (in-process or external) and provide an integration layer with it.

I'm not sure what benefit defining a new attribute for databases brings. If embeddings become cross-domain concerns, let's find a generic attribute name for the vectorization model that will be reused between DB and GenAI. For now I strongly recommend to avoid adding new attribute.

Ok, I agree that we can wait to see how this topic progresses.

lmolkova · 2024-07-15T20:43:13Z

model/registry/db.yaml

+        type: int
+        stability: experimental
+        brief: >
+          The dimension of the vector.


is it the number of dimensions? let's call it something like db.vector.dimension_count and also update the brief

@ezimuel you marked this as resolved, but looks like you missed this change, it still shows dimension instead of dimention_count

Fixed in a3330ff

lmolkova · 2024-07-15T20:43:32Z

model/registry/db.yaml

+        type: string
+        stability: experimental
+        brief: >
+          The name field as of the vector (e.g. a field name).


do we need both - the name and the id?

The id is the identifier of the vector, the name is the field name that contains the vector. Many database uses both but some uses only name of the field. Maybe we should use a better naming here.

@lmolkova what do you think about this? Thanks.

Could you provide some examples of id and name?

Let's pick a few (2-3) popular databases and explain what id and name would mean in their context.

E.g. I look into MongoDB and I don't understand what id or name would be.

Or I look into Azure Search and don't understand what should go into id.

I look into pinecone and it does not talk about ids.

Also, perhaps by vector you mean index? Or if it's about individual vectors, then how the index would be represented?

@lmolkova Here some examples:

MongoDB uses name for the field (i.e. path in search here and field-name in the definition here)

Azure Search uses name (here)

Pinecone uses id (here)

Qdrant uses id (Point ID, here)

Elasticsearch uses name (here)

Milvus uses id (here)

Chroma uses id (here)

pgvector for PostgreSQL uses name (i.e. the field name with type vector(x), here)

Redis uses name (i.e. the field name that will contain the vector values in fieldname_embedding, here)

By db.vector.name I mean the field name in the document that will be used for embedding (most common for DB like MongoDB, Azure Search, Redis, Elasticsearch, PostgreSQL) and by db.vector.id I mean the identifier of the vector (most used in native vector db like Pinecone, Qdrant, Chroma, Milvus).

Basically these attributes, name and ID, are used to identify the vector, maybe a better naming for db.vector.name can be db.vector.field_name?

Thank you for the context!

name

I agree that db.vector.field_name would be more descriptive.
It could be difficult to collect it though - for cosmos, postgres and many other dbs it would require parsing queries (and specific parsing for vector search). I.e. it'd probably mean that very few generic DB instrumentations will do it.

Even the native instrumentation that we have in CosmosDB does not know if query supplied by user is doing vector search.

So I wonder how critical this attribute is for the observability purposes. Also I'd be mostly static - is it important enough to justify additional costs of populating it on each span?

A typical semconv decision making hint is: if we're not sure we need it, let's not add it. Adding attributes to spans is easy, removing them is breaking and hard.

id

I still don't understand what's behind db.vector.id - it seems to be a generic record id and there is nothing vector-specific here.

I support adding generic attribute like db.record.id: string or db.record.ids: string[] (needs polishing)

The db.vector.id is the record id but specific for the vector. The db.record.id will work as well, since it's just an identifier for the record (i.e. vector).

I think we should add db.vector.field_name since it is used in many database to specify which field has been used for the embedding.

lmolkova · 2024-07-15T20:45:10Z

model/registry/db.yaml

+    brief: >
+      This group defines attributes for vector databases.
+    attributes:
+      - id: similarity


suggesting more explicit name like db.vector.search.similarity_metric

@ezimuel I think you also missed this change, is marked as resolved without any updates :)

Fixed in a3330ff

BTW, who/how/when is going to populate it?

It seems it's only available at index creation time and not available at query or insert time. I.e. very rarely.

Correct, this is generally provided during the index creation but some databases uses it also in the query (e.g. Qdrant).

I think we can also leverage the Conditionally Required level for some attributes that are not always available, like the db.vector.search.similarity_metric, WDYT?

@lmolkova just a reminder for this, thanks.

Thanks, it makes sense!

BTW I think the attribute should be db.search.similarity_metric or similar - similarity is not always based on vectors and we don't need to limit this attribute.

model/registry/db.yaml

karthikscale3 · 2024-07-17T16:19:10Z

Thanks for creating this PR. A few additional attributes which we instrument today with our SDK that we have found useful are the following:

db.index
db.namespace
db.collection.name
db.top_k
db.query

Thoughts on the ones listed above? cc @lmolkova

ezimuel · 2024-07-17T18:58:39Z

@karthikscale3 regarding the attributes that you proposed, some already exists:

db.index can be represented using db.operation.name = index;
db.namespace already exists;
db.collection.name already exists;
db.query can be represented using db.query.text and also db.query.parameter;

The top_k proposal I think it's a good idea:

db.vector.query.top_k to represent the k most similar vectors returned by a similarity serach;

Moreover, I found very interesting the proposal of OpenLLMetry project especially the part regarding the attributes for vector db, here:

 # Vector DB
 VECTOR_DB_VENDOR = "db.system"
 VECTOR_DB_OPERATION = "db.operation"
 VECTOR_DB_QUERY_TOP_K = "db.vector.query.top_k"

ezimuel · 2024-07-17T19:13:37Z

@lmolkova I applied all the feedbacks, thanks for the review. @karthikscale3 I added the top_k attribute, thanks.

Summary of the changes:

removed vector in db.system;
removed the db.vector.embeddings;
renamed db.vector.dimension in db.vector.dimension_count;
added the db.vector.query.top_k as suggested by @karthikscale3;
removed the allow_custom_values: true in db.yaml, see last commit;

karthikscale3 · 2024-07-17T19:15:49Z

@karthikscale3 regarding the attributes that you proposed, some already exists:

db.index can be represented using db.operation.name = index;

db.namespace already exists;

db.collection.name already exists;

db.query can be represented using db.query.text and also db.query.parameter;

The top_k proposal I think it's a good idea:

db.vector.query.top_k to represent the k most similar vectors returned by a similarity serach;

Moreover, I found very interesting the proposal of OpenLLMetry project especially the part regarding the attributes for vector db, here:
 # Vector DB
 VECTOR_DB_VENDOR = "db.system"
 VECTOR_DB_OPERATION = "db.operation"
 VECTOR_DB_QUERY_TOP_K = "db.vector.query.top_k"

Yea that sounds good! And yes, my intention was to reuse the existing ones. Wasn't sure if we needed them redefined for the sake of vector dbs or not. But sounds like its unnecessary.

karthikscale3 · 2024-07-17T19:17:02Z

@lmolkova I applied all the feedbacks, thanks for the review. @karthikscale3 I added the top_k attribute, thanks.

Summary of the changes:

removed vector in db.system;

removed the db.vector.embeddings;

renamed db.vector.dimension in db.vector.dimension_count;

added the db.vector.query.top_k as suggested by @karthikscale3;

removed the allow_custom_values: true in db.yaml, see last commit;

Thank you! From my side, everything looks good. We discussed this PR in today's working group call and @nirga wanted to take a deeper look at it once again.

docs/attributes-registry/db.md

docs/database/dynamodb.md

ezimuel · 2024-07-18T06:20:22Z

I fixed the merge issues. Thanks @trask

nirga

Thanks, that's a great start! I wonder if we want to add specific spans that use these attributes in this PR as well?

ezimuel · 2024-07-20T16:04:24Z

@nirga can you give me an example of specific span? FYI, I'm going offline and I'll come back August 4 for further discussion.

nirga · 2024-07-20T16:15:55Z

Sorry, nvm I think this is already covered as part of the DB semconv

maryliag · 2024-07-25T13:23:49Z

docs/attributes-registry/db.md

@@ -199,6 +200,28 @@ This group defines attributes for Elasticsearch.

 **[8]:** Many Elasticsearch url paths allow dynamic values. These SHOULD be recorded in span attributes in the format `db.elasticsearch.path_parts.<key>`, where `<key>` is the url path part name. The implementation SHOULD reference the [elasticsearch schema](https://raw.githubusercontent.com/elastic/elasticsearch-specification/main/output/schema/schema.json) in order to map the path part values to their names.

+## Db Vector Attributes


nit: Vector Database Attributes

maryliag · 2024-07-25T13:32:53Z

@ezimuel looks like you missed some of the changes you marked as resolved:

rename db.vector.dimension to db.vector.dimension_count: it is still showing db.vector.dimension
rename db.vector.similarity to db.vector.search.similarity_metric: it is still showing db.vector.similarity

lmolkova · 2024-07-26T03:28:17Z

model/registry/db.yaml

+        brief: >
+          The model used for the embedding.
+        examples: 'text-embedding-3-small'
+      - id: query.top_k


I think we should come up with a more common attribute not specific to vector dbs.
Many databases allow to limit number of returned rows:

JDBC has Statement.setMaxRows,

Mongo allows to set a limit

Suggesting db.query.max_returned_items. The actual returned count could be even better - db.query.item_count could mean items inserted or returned depending on the operation.

@lmolkova I see the similarity here but I think the db.vector.query.top-k is more specific from a semantic point of view and more related to vectors, since it specifies the top k results in order, starting from the most similar. In semantic search we have this similarity value that is always present in any result that we don't have in standard database. The limit parameter of SQL returns the first k results but not in order, it depends on how you build the query (e.g. using ORDER BY).
I personally think we should keep db.vector.query.top-k and potentially add a db.query.limit (or db.query.max_returned_items as you suggested) in a separate PR.

Out of the few dbs I checked, they use limit in vector search

pgvector uses traditional limit - same with cosmos and other sql databases

mongo uses imit

qdrant uses limit

So we're saying that DB instrumentations will need to detect if query is related to vector search or not and depending on this populate top-k or limit. That's difficult or impossible, but most importantly inconsistent and depends on instrumentation capabilities.

I.e. instrumentations that don't have vector-db specifics and those that do will use different attributes for the same thing.

So, I'd still prefer db.query.limit or something similar (and it should be under the same condition as db.query.text - we cannot require instrumentations to do query parsing)

@lmolkova do you agree that top-k and limit are two different concepts, based on my previous comment? If they are I think we cannot use a single attribute (e.g. db.query.limit) to manage both.

@lmolkova just a reminder for this, thanks.

@ezimuel I see that databases use both terms to describe the same thing (see my comment above).

Let's say you have a postgres query like SELECT * FROM items ORDER BY embedding <-> '[3,1,2]' LIMIT 5; - the general-purpose DB instrumentation can report limit. If you make it understand vector search syntax, it may be able to use top_k instead, but that's would be inconsistent and unfamiliar for those who use vector search in postgres.

@lmolkova I see your point but I think top-k has a different meaning from limit. If you are using a relation database as vector db, limit is fine since you are building an SQL statement and you specify the order. But, if you are using a native vector database (e.g. Qdrant), the top-k is more relevant since top implies the order, using a similarity metric.

I think we should add both:

db.query.limit

db.vector.query.top-k

lmolkova · 2024-07-26T03:33:36Z

We need to reference new attributes in the database spans conventions (see https://github.com/open-telemetry/semantic-conventions/blob/main/model/trace/database.yaml), specifically on the conventions for the databases we have there which support vector search.

We should describe how new attributes apply to them.

Co-authored-by: Liudmila Molkova <[email protected]>

Signed-off-by: ChrsMark <[email protected]>

Co-authored-by: Aaron Clawson <[email protected]> Co-authored-by: Liudmila Molkova <[email protected]>

Co-authored-by: Liudmila Molkova <[email protected]>

…efinitions from global attributes. (open-telemetry#1340)

…1328) Co-authored-by: Liudmila Molkova <[email protected]>

…etry#1346) Co-authored-by: Armin Ruech <[email protected]>

Signed-off-by: ChrsMark <[email protected]> Co-authored-by: Liudmila Molkova <[email protected]>

Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Liudmila Molkova <[email protected]>

…n /internal/tools (open-telemetry#1350) Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Liudmila Molkova <[email protected]>

Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

Co-authored-by: Liudmila Molkova <[email protected]>

…name

ezimuel · 2024-08-20T09:16:11Z

@lmolkova I provided the following changes:

renamed db.vector.name in db.vector.field_name;
removed db.vector.id in favor of db.record.id.

I think the only missing point is about db.vector.query.top-k, see my last comment.
Thanks!

AlexanderWert · 2024-09-20T12:13:06Z

@ezimuel Can you please update this PR with the new semconv folder structure? Thanks

lmolkova · 2024-09-23T16:09:25Z

thanks for working on this @ezimuel !

Please make sure to update actual database semantic conventions and reference attributes you're adding under those that report them - #1231 (comment)

…in db.search.similarity_metric

ezimuel · 2024-09-25T12:59:21Z

@AlexanderWert I updated the PR and applied the suggestions from @lmolkova. I did the following changes:

removed the db.vector.model as reported here;
moved db.vector.search.similarity_metric to db.search.similarity_metric, as suggested here;

The only open question is about limit vs top-k, see my last comment.

@lmolkova I didn't understand what I supposed to do in this comment, since the link https://github.com/open-telemetry/semantic-conventions/blob/main/model/trace/database.yaml does not work. Can you clarify? Thanks.

Proposal for vector db semantic convention

fa8ee30

ezimuel requested review from a team July 10, 2024 15:14

github-actions bot assigned arminru Jul 10, 2024

ezimuel mentioned this pull request Jul 10, 2024

VectorDB Semantic Convention #936

Open

gregkalapos reviewed Jul 15, 2024

View reviewed changes

model/registry/db.yaml Outdated Show resolved Hide resolved

lmolkova reviewed Jul 15, 2024

View reviewed changes

Merge + applied feedbacks open-telemetry#1231

7068720

Removed allow_custom_values: true in db.yaml

5e12a86

karthikscale3 approved these changes Jul 17, 2024

View reviewed changes

trask reviewed Jul 17, 2024

View reviewed changes

docs/attributes-registry/db.md Outdated Show resolved Hide resolved

docs/database/dynamodb.md Outdated Show resolved Hide resolved

Fixed merge

3b61784

Merge branch 'main' into vector-db

828bacc

nirga reviewed Jul 20, 2024

View reviewed changes

maryliag reviewed Jul 25, 2024

View reviewed changes

lmolkova reviewed Jul 26, 2024

View reviewed changes

karthikscale3 mentioned this pull request Aug 4, 2024

REQUEST: New membership for karthikscale3 open-telemetry/community#2256

Closed

6 tasks

ezimuel added 2 commits August 5, 2024 09:25

Merge remote-tracking branch 'upstream/main' into vector-db

53d82d4

Updated dimension_count and similarity_metric

a3330ff

maryliag and others added 15 commits August 20, 2024 11:12

Db metrics pending requests (open-telemetry#1290)

ceae2ca

Co-authored-by: Liudmila Molkova <[email protected]>

Fix process.args_count attribute (open-telemetry#1331)

6db7ec5

Add k8s.volume.{name,type} attributes (open-telemetry#1251)

e5e0d9d

Signed-off-by: ChrsMark <[email protected]>

Add tests for rego policies (open-telemetry#1334)

ae0e066

Co-authored-by: Aaron Clawson <[email protected]> Co-authored-by: Liudmila Molkova <[email protected]>

add nodejs.eventloop.time metric (open-telemetry#1259)

03b67bf

Co-authored-by: Liudmila Molkova <[email protected]>

chore: Remove support for the event fields referencing/inheriting d…

93d2cbe

…efinitions from global attributes. (open-telemetry#1340)

Attempt to optimise attribute name collision checks. (open-telemetry#…

f411554

…1328) Co-authored-by: Liudmila Molkova <[email protected]>

(chore) Add dependabot config to keep tooling up to date. (open-telem…

daa0a14

…etry#1346) Co-authored-by: Armin Ruech <[email protected]>

Fix broken docker link (open-telemetry#1332)

bc8a63c

Signed-off-by: ChrsMark <[email protected]> Co-authored-by: Liudmila Molkova <[email protected]>

Bump markdownlint-cli from 0.31.0 to 0.41.0 (open-telemetry#1349)

a5f8661

Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Liudmila Molkova <[email protected]>

Bump gulp from 4.0.2 to 5.0.0 (open-telemetry#1348)

a10e75f

Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

Fix link anchors (open-telemetry#1354)

fd0f2e7

chore: update ids (open-telemetry#1352)

1c6bd00

Co-authored-by: Liudmila Molkova <[email protected]>

Removed db.vector.id and added db.record.id, renamed db.vector.field_…

9feb74d

…name

ezimuel requested review from a team August 20, 2024 09:13

Merge branch 'main' into vector-db

2357766

Merge from upstream/main

81dca47

ezimuel requested review from a team as code owners September 25, 2024 11:36

Removed db.vector.model and moved db.vector.search.similarity_metric …

ff03da1

…in db.search.similarity_metric

Merge branch 'main' into vector-db

523bcb9

		@@ -199,6 +200,28 @@ This group defines attributes for Elasticsearch.

		[8]: Many Elasticsearch url paths allow dynamic values. These SHOULD be recorded in span attributes in the format `db.elasticsearch.path_parts.<key>`, where `<key>` is the url path part name. The implementation SHOULD reference the [elasticsearch schema](https://raw.githubusercontent.com/elastic/elasticsearch-specification/main/output/schema/schema.json) in order to map the path part values to their names.

		## Db Vector Attributes

Proposal for vector db semantic convention #1231

Are you sure you want to change the base?

Proposal for vector db semantic convention #1231

Conversation

ezimuel commented Jul 10, 2024 • edited Loading

linux-foundation-easycla bot commented Jul 10, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lmolkova Jul 26, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

karthikscale3 commented Jul 17, 2024 • edited Loading

ezimuel commented Jul 17, 2024

ezimuel commented Jul 17, 2024 • edited Loading

karthikscale3 commented Jul 17, 2024

karthikscale3 commented Jul 17, 2024

ezimuel commented Jul 18, 2024

nirga left a comment

Choose a reason for hiding this comment

ezimuel commented Jul 20, 2024

nirga commented Jul 20, 2024

Choose a reason for hiding this comment

maryliag commented Jul 25, 2024

lmolkova Jul 26, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lmolkova commented Jul 26, 2024

ezimuel commented Aug 20, 2024

AlexanderWert commented Sep 20, 2024

lmolkova commented Sep 23, 2024

ezimuel commented Sep 25, 2024

ezimuel commented Jul 10, 2024 •

edited

Loading

linux-foundation-easycla bot commented Jul 10, 2024 •

edited

Loading

lmolkova Jul 26, 2024 •

edited

Loading

karthikscale3 commented Jul 17, 2024 •

edited

Loading

ezimuel commented Jul 17, 2024 •

edited

Loading

lmolkova Jul 26, 2024 •

edited

Loading