POC: Tantivy documents as a trait #2071

ChillFish8 · 2023-06-05T14:38:14Z

Problem

Building on what #1352 describes, one of Tantivy's biggest limitations/pain points IMO is the fact that you must convert whatever document type you are using in your code, to a Tantivy Document type, which often involves re-allocating and a lot of extra code in order to walk through potentially nested objects (I.e. JSON objects) to be able to index it with tantivy.

Use cases

Often in high throughput situations you want to avoid needlessly allocating things, the current setup always requires allocating especially with nested JSON objects, in particular, Quickwit could be quite heavily affected by this change as it allows for borrowed keys and values from the original data.
Users with custom document types (i.e. lnx) can use the same type across the code base without creating a somewhat messy conversion step between tantivy documents and the documents used everywhere else in the code.

A limited solution

The solution to this issue initially could be thought up as a basic document trait which simply creates an
fn field_values(&self) -> impl Iterator<Item = (Field, Value)> method for accessing the document data. In theory with the use of GATs now we can even make the Value take borrowed data avoiding the allocation issue.

pub trait Document {
    fn field_values(&self) -> impl Iterator<Item = (Field, Value)>;
}

Problems with this approach

Although the above suggestion works for a basic setup, it doesn't really solve the issue since you still need a set of concrete tantivy values before you can index data, by which point the amount of effort it takes to convert to a Document is very small.

A fully flexible solution

To get around the issue of making document indexing more flexible and more powerful, we can extend how much of the system is represented by traits, in particular, we replace Document, Value, FieldValue, serde_json::Map<String, serde_json::Value>, serde_json::Value with a set of traits as described bellow:

`DocumentAccess` trait

Using GATs we can avoid a complicated set of lifetimes and unsafe, instead, we simply described the type the document uses for its values which can borrow data (Value<'a>), the owned version of this type which can be the same type potentially, but can also be different depending on application (OwnedValue) and finally, we describe the FieldsValuesIter which is just used to get around not be able to use anonymous types in the form of impl Iterator.

As you can see with the code below, we've replaced the FieldValue type with a simple tuple instead, technically this could be kept and made generic, but I don't think it's that useful to do so.

Compatibility

The original Document type has been kept, and simply implements the trait, meaning a user already using tantivy should not experience any direct conflict if they just keep using the original type.

/// The core trait representing a document within the index.
pub trait DocumentAccess: Send + Sync + 'static {
    /// The value of the field.
    type Value<'a>: DocValue<'a> + Clone
    where Self: 'a;
    /// The owned version of a value type.
    ///
    /// It's possible that this is the same type as the borrowed
    /// variant simply by using things like a `Cow`, but it may
    /// be beneficial to implement them as separate types for
    /// some use cases.
    type OwnedValue: ValueDeserialize + Debug;
    /// The iterator over all of the fields and values within the doc.
    type FieldsValuesIter<'a>: Iterator<Item = (Field, Self::Value<'a>)>
    where Self: 'a;

    /// Returns the number of fields within the document.
    fn len(&self) -> usize;

    /// Returns true if the document contains no fields.
    fn is_empty(&self) -> bool {
        self.len() == 0
    }

    /// Get an iterator iterating over all fields and values in a document.
    fn iter_fields_and_values(&self) -> Self::FieldsValuesIter<'_>;

    /// Create a new document from a given stream of fields.
    fn from_fields(fields: Vec<(Field, Self::OwnedValue)>) -> Self;

    /// Sort and groups the field_values by field.
    ///
    /// The result of this method is not cached and is
    /// computed on the fly when this method is called.
    fn get_sorted_field_values(&self) -> Vec<(Field, Vec<Self::Value<'_>>)> {
         // Code emitted for ease of reading
    }
}

`DocValue<'a>` trait

This trait is fairly simple in what it does, it simply defines the common methods on the old Document type as methods of the trait, and then has a generic JsonVisitor type which can be used to represent JSON data in more flexible ways and without allocation.

The original Value type implements this trait for compatibility and ergonomics, technically speaking if you wanted to do an approach similar to the first solution you can absolutely do this just by using the original tantivy types.

pub trait DocValue<'a>: Send + Sync + Debug {
    /// The visitor used to walk through the key-value pairs
    /// of the provided JSON object.
    type JsonVisitor: JsonVisitor<'a>;

    /// Get the string contents of the value if applicable.
    ///
    /// If the value is not a string, `None` should be returned.
    fn as_str(&self) -> Option<&str>;
    /// Get the facet contents of the value if applicable.
    ///
    /// If the value is not a string, `None` should be returned.
    fn as_facet(&self) -> Option<&Facet>;
    /// Get the u64 contents of the value if applicable.
    ///
    /// If the value is not a u64, `None` should be returned.
    fn as_u64(&self) -> Option<u64>;
    /// Get the i64 contents of the value if applicable.
    ///
    /// If the value is not a i64, `None` should be returned.
    fn as_i64(&self) -> Option<i64>;
    /// Get the f64 contents of the value if applicable.
    ///
    /// If the value is not a f64, `None` should be returned.
    fn as_f64(&self) -> Option<f64>;
    /// Get the date contents of the value if applicable.
    ///
    /// If the value is not a date, `None` should be returned.
    fn as_date(&self) -> Option<DateTime>;
    /// Get the bool contents of the value if applicable.
    ///
    /// If the value is not a boolean, `None` should be returned.
    fn as_bool(&self) -> Option<bool>;
    /// Get the IP addr contents of the value if applicable.
    ///
    /// If the value is not an IP addr, `None` should be returned.
    fn as_ip_addr(&self) -> Option<Ipv6Addr>;
    /// Get the bytes contents of the value if applicable.
    ///
    /// If the value is not bytes, `None` should be returned.
    fn as_bytes(&self) -> Option<&[u8]>;
    /// Get the pre-tokenized string contents of the value if applicable.
    ///
    /// If the value is not pre-tokenized string, `None` should be returned.
    fn as_tokenized_text(&self) -> Option<&PreTokenizedString>;
    /// Get the JSON contents of the value if applicable.
    ///
    /// If the value is not pre-tokenized string, `None` should be returned.
    fn as_json(&self) -> Option<Self::JsonVisitor>;
}

`JsonVisitor<'a>` trait

The JSON visitor effectively just replaces the Map<String, Value> with a trait that allows for walking through the object.

This also means that any type which implements this trait can also be serialized via serde_json.

/// A trait representing a JSON key-value object.
///
/// This allows for access to the JSON object without having it as a
/// concrete type.
pub trait JsonVisitor<'a> {
    /// The visitor for each value within the object.
    type ValueVisitor: JsonValueVisitor<'a>;

    #[inline]
    /// The size hint of the iterator length.
    fn size_hint(&self) -> usize {
        0
    }

    /// Get the next key-value pair in the object or return `None`
    /// if the element is empty.
    fn next_key_value(&mut self) -> Option<(&'a str, Self::ValueVisitor)>;
}

`JsonValueVisitor<'a>` trait

Similar to the JSON visitor it describes the behaviour of a serde_json::Value rather than just an object.

/// A trait representing a JSON value.
///
/// This allows for access to the JSON value without having it as a
/// concrete type.
///
/// This is largely a subset of the `Value<'a>` trait.
pub trait JsonValueVisitor<'a> {
    /// The iterator for walking through the element within the array.
    type ArrayIter: Iterator<Item = Self>;
    /// The visitor for walking through the key-value pairs within
    /// the object.
    type ObjectVisitor: JsonVisitor<'a>;

    /// Checks if the value is `null` or not.
    fn is_null(&self) -> bool;
    /// Get the string contents of the value if applicable.
    ///
    /// If the value is not a string, `None` should be returned.
    fn as_str(&self) -> Option<&str>;
    /// Get the i64 contents of the value if applicable.
    ///
    /// If the value is not a i64, `None` should be returned.
    fn as_i64(&self) -> Option<i64>;
    /// Get the u64 contents of the value if applicable.
    ///
    /// If the value is not a u64, `None` should be returned.
    fn as_u64(&self) -> Option<u64>;
    /// Get the f64 contents of the value if applicable.
    ///
    /// If the value is not a f64, `None` should be returned.
    fn as_f64(&self) -> Option<f64>;
    /// Get the bool contents of the value if applicable.
    ///
    /// If the value is not a boolean, `None` should be returned.
    fn as_bool(&self) -> Option<bool>;
    /// Get the array contents of the value if applicable.
    ///
    /// If the value is not an array, `None` should be returned.
    fn as_array(&self) -> Option<Self::ArrayIter>;
    /// Get the object contents of the value if applicable.
    ///
    /// If the value is not an object, `None` should be returned.
    fn as_object(&self) -> Option<Self::ObjectVisitor>;
}

Deserialization via `ValueDeserialize`

Originally I wanted to use something like serde::DeserializedOwned for this job, but it became obvious that what it gained in ease of compatibility with serde, it lost in terms of complexity when handling custom deserializer. A single simple trait became better for this purpose.

One thing that could be improved further is passing a JSON deserializer for the JSON values, at the moment we pass in a Map<String, serde_json::Value> which is fairly limited at the moment and not the most ergonomic thing in the world.

Compatibility and generic methods

One of the things the trait approach requires is a set of generic where documents to be handled, which gives us two ways to handle it:

We can add generics everywhere on things like Index, etc... so it's one document type for the whole index, with a default type keeping compatibility with existing code. The problem is it causes everything to require generics (See: Add document trait ChillFish8/tantivy#2 and the thousands of lines changed, which isn't even finished!)
We add generics only to the methods which require them (i.e. when adding or fetching docs), making it potentially break some existing projects if the type cannot be inferred (although this is fairly unlikely), which is this PR.

* Fix windows build

src/core/json_utils.rs

src/query/more_like_this/query.rs

src/schema/document.rs

PSeitz · 2023-08-24T15:54:31Z

While I don't have any strong opinions on this, I do think it's potentially even more confusing if we go "Ah yes you can serialize with a customs value but you can't deserialize back to a custom value, you have to use the concrete Document type" and also you do lose a little bit of efficiency having to re-construct your types from the Document type rather than walking through with the deserializer, which I think can become more apparent at higher QPS.

We don't need to serialize with a custom value, but just provide an API over the data.
The same API can be provided back via a Document (maybe also deserialize directly from the Document format is possible).
If the user serializes the value, it would also move things like format handling to the user.
I tend to think that this is not worth the added complexity. I don't have data on the cost of deserialization for high QPS cases.

ChillFish8 · 2023-08-25T22:02:20Z

We don't need to serialize with a custom value, but just provide an API over the data.
The same API can be provided back via a Document (maybe also deserialize directly from the Document format is possible).
If the user serializes the value, it would also move things like format handling to the user.
I tend to think that this is not worth the added complexity. I don't have data on the cost of deserialization for high QPS cases.

I see your point, but I don't think removing it is adding a huge amount of additional complexity, it makes the code a bit more sparse and abstracted, but the way we serialize and deserialize data is exactly the same, I just re-wrote it from a somewhat dense block of code on the Document type to avoid make it work well with custom types when indexing.

The only thing I can see happening is if we provide custom documents for indexing but not retrieval, the question will inevitably come up as "Why?" which I can understand, also it can be quite convenient deserializing into a custom type but it's not necessarily the end of the world I suppose.

PSeitz · 2023-08-28T06:40:14Z

I see your point, but I don't think removing it is adding a huge amount of additional complexity, it makes the code a bit more sparse and abstracted, but the way we serialize and deserialize data is exactly the same, I just re-wrote it from a somewhat dense block of code on the Document type to avoid make it work well with custom types when indexing.

The only thing I can see happening is if we provide custom documents for indexing but not retrieval, the question will inevitably come up as "Why?" which I can understand, also it can be quite convenient deserializing into a custom type but it's not necessarily the end of the world I suppose.

I had another look and misunderstood it before. There's no custom serialization format, right?
In that case I think it's fine.

Overall the PR looks good, nice job!

ChillFish8 · 2023-08-28T11:09:27Z

I had another look and misunderstood it before. There's no custom serialization format, right?
In that case I think it's fine.

Format is effectively the exact same as it was before, just with the additional codes to handle collections and objects, but it is not user defined.

fulmicoton · 2023-09-14T07:14:24Z

src/schema/document/mod.rs

+    #[inline]
+    /// If the Value is a pre-tokenized string, returns the associated string. Returns None
+    /// otherwise.
+    fn as_tokenized_text(&self) -> Option<&'a PreTokenizedString> {


as_pretokenized_text...

fulmicoton · 2023-09-14T07:16:15Z

@PSeitz You can merge whenever you see fit.

# Conflicts: # Cargo.toml # examples/warmer.rs # src/aggregation/bucket/histogram/date_histogram.rs # src/core/index.rs # src/directory/mmap_directory.rs # src/functional_test.rs # src/indexer/index_writer.rs # src/indexer/segment_writer.rs # src/lib.rs # src/query/boolean_query/boolean_query.rs # src/query/boolean_query/mod.rs # src/query/disjunction_max_query.rs # src/query/fuzzy_query.rs # src/query/more_like_this/more_like_this.rs # src/query/range_query/range_query.rs # src/query/regex_query.rs # src/query/term_query/term_query.rs # tests/failpoints/mod.rs

PSeitz · 2023-10-01T16:41:45Z

test_json_indexing fails because "complexobject": {"field.with.dot": 1} is indexed as u64 and then queried with i64.
The current order is i64,u64,f64. I'm not sure what's the best way to handle this when the type is user provided. Maybe it should not be user provided.

PSeitz · 2023-10-02T03:53:53Z

src/schema/value.rs

+                if let Some(val) = number.as_u64() {
+                    Self::U64(val)
+                } else if let Some(val) = number.as_i64() {
+                    Self::I64(val)


Suggested change

if let Some(val) = number.as_u64() {

Self::U64(val)

} else if let Some(val) = number.as_i64() {

Self::I64(val)

if let Some(val) = number.as_i64() {

Self::I64(val)

} else if let Some(val) = number.as_u64() {

Self::U64(val)

PSeitz · 2023-10-02T03:58:34Z

src/schema/value.rs

            }
+            serde_json::Value::String(val) => Self::Str(val),


if can_be_rfc3339_date_time(&text) { match OffsetDateTime::parse(&text, &Rfc3339) { Ok(dt) => { let dt_utc = dt.to_offset(time::UtcOffset::UTC); Self::Date(DateTime::from_utc(dt_utc)) } Err(_) => Self::Str(text), } } else { Self::Str(text) }

fn can_be_rfc3339_date_time(text: &str) -> bool { if let Some(&first_byte) = text.as_bytes().get(0) { if first_byte >= b'0' && first_byte <= b'9' { return true; } } false }

PSeitz · 2023-10-02T04:13:24Z

src/query/fuzzy_query.rs

            r#"{
            "attributes": {
-                "aa": "japan"
+                "as": "japan"


Suggested change

"as": "japan"

"aa": "japan"

add Binary prefix to binary de/serialization

PSeitz · 2023-10-02T08:02:12Z

I had an extra pass and fixed the remaining issues. Thanks for the really nice PR!

softhuafei · 2024-05-08T07:06:21Z

Thank you for your work. I have a question. The current document serialization interface BinaryDocumentSerializer::serialize_doconly serializes stored fields. Is there any other interface that allows us to completely serialize and deserialize the document?

PSeitz · 2024-05-08T07:59:54Z

No, but I think you could add a stored bytes field and put your custom serialized doc there

softhuafei · 2024-05-08T08:50:42Z

No, but I think you could add a stored bytes field and put your custom serialized doc there

Thank you for your answer. Is there any example of custom document and related serialization and deserialization?

PSeitz · 2024-05-08T09:05:07Z

There is no custom document serialization and deserialization, having your real doc nested in a bytes field would be a workaround.

softhuafei · 2024-05-08T09:58:36Z

Sorry, my previous statement was not clear enough. I want to be able to fully serialize/deserialize TantivyDocument, just like the old version of document:

impl BinarySerializable for Document {
    fn serialize<W: Write + ?Sized>(&self, writer: &mut W) -> io::Result<()> {
        let field_values = self.field_values();
        VInt(field_values.len() as u64).serialize(writer)?;
        for field_value in field_values {
            field_value.serialize(writer)?;
        }
        Ok(())
    }

    fn deserialize<R: Read>(reader: &mut R) -> io::Result<Self> {
        let num_field_values = VInt::deserialize(reader)?.val() as usize;
        let field_values = (0..num_field_values)
            .map(|_| FieldValue::deserialize(reader))
            .collect::<io::Result<Vec<FieldValue>>>()?;
        Ok(Document::from(field_values))
    }
}

Should I modify BinaryDocumentSerializer and add a new method to it, such as serialize_entire_doc, which behaves like serialize_doc, except that it will serialize all fields?

PSeitz · 2024-05-08T10:28:59Z

TantivyDocument has serde support. Would that work for you?

#[derive(Clone, Debug, serde::Serialize, serde::Deserialize, Default)]
pub struct TantivyDocument {
    field_values: Vec<FieldValue>,
}

ChillFish8 and others added 14 commits June 3, 2023 18:38

fix windows build (#1)

a044195

* Fix windows build

Add doc traits

fd65797

Add field value iter

f5e570d

Add value and serialization

82ccece

Adjust order

6bfa500

Fix bug

49b5414

Correct type

08eaa17

Fix generic bugs

18a6680

Reformat code

843ce14

Add generic to index writer which I forgot about

acbda6a

Fix missing generics on single segment writer

eb8ada7

Add missing type export

597e8c5

Add default methods for convenience

9e73ada

Cleanup

f2ecb61

PSeitz reviewed Jun 8, 2023

View reviewed changes

src/core/json_utils.rs Outdated Show resolved Hide resolved

PSeitz reviewed Jun 8, 2023

View reviewed changes

src/query/more_like_this/query.rs Outdated Show resolved Hide resolved

PSeitz reviewed Jun 8, 2023

View reviewed changes

src/schema/document.rs Outdated Show resolved Hide resolved

Fix more-like-this query to use standard types

1e61736

trinity-1686a mentioned this pull request Jun 9, 2023

Indexing an array of small pieces of text is much slower than indexing a big string #1654

Open

ChillFish8 and others added 11 commits June 9, 2023 15:42

Update API and fix tests

7b6293a

Merge branch 'quickwit-oss:main' into main

73a344c

Add doc traits

da6f81d

Add field value iter

baf4be2

Add value and serialization

91dbe3f

Adjust order

2b75e2e

Fix bug

e958f3a

Correct type

ab6fbde

Rebase main and fix conflicts

c2bc724

Reformat code

e81c4b8

Merge upstream

a7bfd43

PSeitz approved these changes Sep 13, 2023

View reviewed changes

fulmicoton reviewed Sep 14, 2023

View reviewed changes

ChillFish8 added 7 commits September 27, 2023 22:29

Resolve comments

43ac334

Fix build

9b6b94a

Fix more build errors

b0d61f1

Fix more build errors

52db5ad

Fix the tests I missed

41f594d

Fix examples

e26dbbb

PSeitz reviewed Oct 2, 2023

View reviewed changes

PSeitz added 4 commits October 2, 2023 13:26

fix numerical order, serialize PreTok Str

45d5222

fix coverage

b6a06c2

rename Document to TantivyDocument, rename DocumentAccess to Document

af535f4

add Binary prefix to binary de/serialization

fix coverage

841d9ec

PSeitz merged commit 1c7c6fd into quickwit-oss:main Oct 2, 2023
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

POC: Tantivy documents as a trait #2071

POC: Tantivy documents as a trait #2071

ChillFish8 commented Jun 5, 2023 •

edited

Loading

PSeitz commented Aug 24, 2023

ChillFish8 commented Aug 25, 2023

PSeitz commented Aug 28, 2023

ChillFish8 commented Aug 28, 2023

fulmicoton Sep 14, 2023

fulmicoton commented Sep 14, 2023

PSeitz commented Oct 1, 2023

PSeitz Oct 2, 2023

PSeitz Oct 2, 2023 •

edited

Loading

PSeitz Oct 2, 2023

PSeitz commented Oct 2, 2023

softhuafei commented May 8, 2024

PSeitz commented May 8, 2024

softhuafei commented May 8, 2024

PSeitz commented May 8, 2024

softhuafei commented May 8, 2024

PSeitz commented May 8, 2024

POC: Tantivy documents as a trait #2071

POC: Tantivy documents as a trait #2071

Conversation

ChillFish8 commented Jun 5, 2023 • edited Loading

Problem

Use cases

A limited solution

Problems with this approach

A fully flexible solution

DocumentAccess trait

Compatibility

DocValue<'a> trait

JsonVisitor<'a> trait

JsonValueVisitor<'a> trait

Deserialization via ValueDeserialize

Compatibility and generic methods

PSeitz commented Aug 24, 2023

ChillFish8 commented Aug 25, 2023

PSeitz commented Aug 28, 2023

ChillFish8 commented Aug 28, 2023

fulmicoton Sep 14, 2023

Choose a reason for hiding this comment

fulmicoton commented Sep 14, 2023

PSeitz commented Oct 1, 2023

PSeitz Oct 2, 2023

Choose a reason for hiding this comment

PSeitz Oct 2, 2023 • edited Loading

Choose a reason for hiding this comment

PSeitz Oct 2, 2023

Choose a reason for hiding this comment

PSeitz commented Oct 2, 2023

softhuafei commented May 8, 2024

PSeitz commented May 8, 2024

softhuafei commented May 8, 2024

PSeitz commented May 8, 2024

softhuafei commented May 8, 2024

PSeitz commented May 8, 2024

ChillFish8 commented Jun 5, 2023 •

edited

Loading

`DocumentAccess` trait

`DocValue<'a>` trait

`JsonVisitor<'a>` trait

`JsonValueVisitor<'a>` trait

Deserialization via `ValueDeserialize`

PSeitz Oct 2, 2023 •

edited

Loading