Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

POC: Tantivy documents as a trait #2071

Merged
merged 69 commits into from
Oct 2, 2023

Conversation

ChillFish8
Copy link
Collaborator

@ChillFish8 ChillFish8 commented Jun 5, 2023

Problem

Building on what #1352 describes, one of Tantivy's biggest limitations/pain points IMO is the fact that you must convert whatever document type you are using in your code, to a Tantivy Document type, which often involves re-allocating and a lot of extra code in order to walk through potentially nested objects (I.e. JSON objects) to be able to index it with tantivy.

Use cases

  • Often in high throughput situations you want to avoid needlessly allocating things, the current setup always requires allocating especially with nested JSON objects, in particular, Quickwit could be quite heavily affected by this change as it allows for borrowed keys and values from the original data.
  • Users with custom document types (i.e. lnx) can use the same type across the code base without creating a somewhat messy conversion step between tantivy documents and the documents used everywhere else in the code.

A limited solution

The solution to this issue initially could be thought up as a basic document trait which simply creates an
fn field_values(&self) -> impl Iterator<Item = (Field, Value)> method for accessing the document data. In theory with the use of GATs now we can even make the Value take borrowed data avoiding the allocation issue.

pub trait Document {
    fn field_values(&self) -> impl Iterator<Item = (Field, Value)>;
}

Problems with this approach

Although the above suggestion works for a basic setup, it doesn't really solve the issue since you still need a set of concrete tantivy values before you can index data, by which point the amount of effort it takes to convert to a Document is very small.

A fully flexible solution

To get around the issue of making document indexing more flexible and more powerful, we can extend how much of the system is represented by traits, in particular, we replace Document, Value, FieldValue, serde_json::Map<String, serde_json::Value>, serde_json::Value with a set of traits as described bellow:

DocumentAccess trait

Using GATs we can avoid a complicated set of lifetimes and unsafe, instead, we simply described the type the document uses for its values which can borrow data (Value<'a>), the owned version of this type which can be the same type potentially, but can also be different depending on application (OwnedValue) and finally, we describe the FieldsValuesIter which is just used to get around not be able to use anonymous types in the form of impl Iterator.

As you can see with the code below, we've replaced the FieldValue type with a simple tuple instead, technically this could be kept and made generic, but I don't think it's that useful to do so.

Compatibility

The original Document type has been kept, and simply implements the trait, meaning a user already using tantivy should not experience any direct conflict if they just keep using the original type.

/// The core trait representing a document within the index.
pub trait DocumentAccess: Send + Sync + 'static {
    /// The value of the field.
    type Value<'a>: DocValue<'a> + Clone
    where Self: 'a;
    /// The owned version of a value type.
    ///
    /// It's possible that this is the same type as the borrowed
    /// variant simply by using things like a `Cow`, but it may
    /// be beneficial to implement them as separate types for
    /// some use cases.
    type OwnedValue: ValueDeserialize + Debug;
    /// The iterator over all of the fields and values within the doc.
    type FieldsValuesIter<'a>: Iterator<Item = (Field, Self::Value<'a>)>
    where Self: 'a;

    /// Returns the number of fields within the document.
    fn len(&self) -> usize;

    /// Returns true if the document contains no fields.
    fn is_empty(&self) -> bool {
        self.len() == 0
    }

    /// Get an iterator iterating over all fields and values in a document.
    fn iter_fields_and_values(&self) -> Self::FieldsValuesIter<'_>;

    /// Create a new document from a given stream of fields.
    fn from_fields(fields: Vec<(Field, Self::OwnedValue)>) -> Self;

    /// Sort and groups the field_values by field.
    ///
    /// The result of this method is not cached and is
    /// computed on the fly when this method is called.
    fn get_sorted_field_values(&self) -> Vec<(Field, Vec<Self::Value<'_>>)> {
         // Code emitted for ease of reading
    }
}

DocValue<'a> trait

This trait is fairly simple in what it does, it simply defines the common methods on the old Document type as methods of the trait, and then has a generic JsonVisitor type which can be used to represent JSON data in more flexible ways and without allocation.

The original Value type implements this trait for compatibility and ergonomics, technically speaking if you wanted to do an approach similar to the first solution you can absolutely do this just by using the original tantivy types.

pub trait DocValue<'a>: Send + Sync + Debug {
    /// The visitor used to walk through the key-value pairs
    /// of the provided JSON object.
    type JsonVisitor: JsonVisitor<'a>;

    /// Get the string contents of the value if applicable.
    ///
    /// If the value is not a string, `None` should be returned.
    fn as_str(&self) -> Option<&str>;
    /// Get the facet contents of the value if applicable.
    ///
    /// If the value is not a string, `None` should be returned.
    fn as_facet(&self) -> Option<&Facet>;
    /// Get the u64 contents of the value if applicable.
    ///
    /// If the value is not a u64, `None` should be returned.
    fn as_u64(&self) -> Option<u64>;
    /// Get the i64 contents of the value if applicable.
    ///
    /// If the value is not a i64, `None` should be returned.
    fn as_i64(&self) -> Option<i64>;
    /// Get the f64 contents of the value if applicable.
    ///
    /// If the value is not a f64, `None` should be returned.
    fn as_f64(&self) -> Option<f64>;
    /// Get the date contents of the value if applicable.
    ///
    /// If the value is not a date, `None` should be returned.
    fn as_date(&self) -> Option<DateTime>;
    /// Get the bool contents of the value if applicable.
    ///
    /// If the value is not a boolean, `None` should be returned.
    fn as_bool(&self) -> Option<bool>;
    /// Get the IP addr contents of the value if applicable.
    ///
    /// If the value is not an IP addr, `None` should be returned.
    fn as_ip_addr(&self) -> Option<Ipv6Addr>;
    /// Get the bytes contents of the value if applicable.
    ///
    /// If the value is not bytes, `None` should be returned.
    fn as_bytes(&self) -> Option<&[u8]>;
    /// Get the pre-tokenized string contents of the value if applicable.
    ///
    /// If the value is not pre-tokenized string, `None` should be returned.
    fn as_tokenized_text(&self) -> Option<&PreTokenizedString>;
    /// Get the JSON contents of the value if applicable.
    ///
    /// If the value is not pre-tokenized string, `None` should be returned.
    fn as_json(&self) -> Option<Self::JsonVisitor>;
}

JsonVisitor<'a> trait

The JSON visitor effectively just replaces the Map<String, Value> with a trait that allows for walking through the object.

This also means that any type which implements this trait can also be serialized via serde_json.

/// A trait representing a JSON key-value object.
///
/// This allows for access to the JSON object without having it as a
/// concrete type.
pub trait JsonVisitor<'a> {
    /// The visitor for each value within the object.
    type ValueVisitor: JsonValueVisitor<'a>;

    #[inline]
    /// The size hint of the iterator length.
    fn size_hint(&self) -> usize {
        0
    }

    /// Get the next key-value pair in the object or return `None`
    /// if the element is empty.
    fn next_key_value(&mut self) -> Option<(&'a str, Self::ValueVisitor)>;
}

JsonValueVisitor<'a> trait

Similar to the JSON visitor it describes the behaviour of a serde_json::Value rather than just an object.

/// A trait representing a JSON value.
///
/// This allows for access to the JSON value without having it as a
/// concrete type.
///
/// This is largely a subset of the `Value<'a>` trait.
pub trait JsonValueVisitor<'a> {
    /// The iterator for walking through the element within the array.
    type ArrayIter: Iterator<Item = Self>;
    /// The visitor for walking through the key-value pairs within
    /// the object.
    type ObjectVisitor: JsonVisitor<'a>;

    /// Checks if the value is `null` or not.
    fn is_null(&self) -> bool;
    /// Get the string contents of the value if applicable.
    ///
    /// If the value is not a string, `None` should be returned.
    fn as_str(&self) -> Option<&str>;
    /// Get the i64 contents of the value if applicable.
    ///
    /// If the value is not a i64, `None` should be returned.
    fn as_i64(&self) -> Option<i64>;
    /// Get the u64 contents of the value if applicable.
    ///
    /// If the value is not a u64, `None` should be returned.
    fn as_u64(&self) -> Option<u64>;
    /// Get the f64 contents of the value if applicable.
    ///
    /// If the value is not a f64, `None` should be returned.
    fn as_f64(&self) -> Option<f64>;
    /// Get the bool contents of the value if applicable.
    ///
    /// If the value is not a boolean, `None` should be returned.
    fn as_bool(&self) -> Option<bool>;
    /// Get the array contents of the value if applicable.
    ///
    /// If the value is not an array, `None` should be returned.
    fn as_array(&self) -> Option<Self::ArrayIter>;
    /// Get the object contents of the value if applicable.
    ///
    /// If the value is not an object, `None` should be returned.
    fn as_object(&self) -> Option<Self::ObjectVisitor>;
}

Deserialization via ValueDeserialize

Originally I wanted to use something like serde::DeserializedOwned for this job, but it became obvious that what it gained in ease of compatibility with serde, it lost in terms of complexity when handling custom deserializer. A single simple trait became better for this purpose.

One thing that could be improved further is passing a JSON deserializer for the JSON values, at the moment we pass in a Map<String, serde_json::Value> which is fairly limited at the moment and not the most ergonomic thing in the world.

Compatibility and generic methods

One of the things the trait approach requires is a set of generic where documents to be handled, which gives us two ways to handle it:

  • We can add generics everywhere on things like Index, etc... so it's one document type for the whole index, with a default type keeping compatibility with existing code. The problem is it causes everything to require generics (See: Add document trait ChillFish8/tantivy#2 and the thousands of lines changed, which isn't even finished!)
  • We add generics only to the methods which require them (i.e. when adding or fetching docs), making it potentially break some existing projects if the type cannot be inferred (although this is fairly unlikely), which is this PR.

src/core/json_utils.rs Outdated Show resolved Hide resolved
src/schema/document.rs Outdated Show resolved Hide resolved
@PSeitz
Copy link
Contributor

PSeitz commented Aug 24, 2023

While I don't have any strong opinions on this, I do think it's potentially even more confusing if we go "Ah yes you can serialize with a customs value but you can't deserialize back to a custom value, you have to use the concrete Document type" and also you do lose a little bit of efficiency having to re-construct your types from the Document type rather than walking through with the deserializer, which I think can become more apparent at higher QPS.

We don't need to serialize with a custom value, but just provide an API over the data.
The same API can be provided back via a Document (maybe also deserialize directly from the Document format is possible).
If the user serializes the value, it would also move things like format handling to the user.
I tend to think that this is not worth the added complexity. I don't have data on the cost of deserialization for high QPS cases.

@ChillFish8
Copy link
Collaborator Author

We don't need to serialize with a custom value, but just provide an API over the data.
The same API can be provided back via a Document (maybe also deserialize directly from the Document format is possible).
If the user serializes the value, it would also move things like format handling to the user.
I tend to think that this is not worth the added complexity. I don't have data on the cost of deserialization for high QPS cases.

I see your point, but I don't think removing it is adding a huge amount of additional complexity, it makes the code a bit more sparse and abstracted, but the way we serialize and deserialize data is exactly the same, I just re-wrote it from a somewhat dense block of code on the Document type to avoid make it work well with custom types when indexing.

The only thing I can see happening is if we provide custom documents for indexing but not retrieval, the question will inevitably come up as "Why?" which I can understand, also it can be quite convenient deserializing into a custom type but it's not necessarily the end of the world I suppose.

@PSeitz
Copy link
Contributor

PSeitz commented Aug 28, 2023

I see your point, but I don't think removing it is adding a huge amount of additional complexity, it makes the code a bit more sparse and abstracted, but the way we serialize and deserialize data is exactly the same, I just re-wrote it from a somewhat dense block of code on the Document type to avoid make it work well with custom types when indexing.

The only thing I can see happening is if we provide custom documents for indexing but not retrieval, the question will inevitably come up as "Why?" which I can understand, also it can be quite convenient deserializing into a custom type but it's not necessarily the end of the world I suppose.

I had another look and misunderstood it before. There's no custom serialization format, right?
In that case I think it's fine.

Overall the PR looks good, nice job!

@ChillFish8
Copy link
Collaborator Author

I had another look and misunderstood it before. There's no custom serialization format, right?
In that case I think it's fine.

Format is effectively the exact same as it was before, just with the additional codes to handle collections and objects, but it is not user defined.

#[inline]
/// If the Value is a pre-tokenized string, returns the associated string. Returns None
/// otherwise.
fn as_tokenized_text(&self) -> Option<&'a PreTokenizedString> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as_pretokenized_text...

@fulmicoton
Copy link
Collaborator

@PSeitz You can merge whenever you see fit.

# Conflicts:
#	Cargo.toml
#	examples/warmer.rs
#	src/aggregation/bucket/histogram/date_histogram.rs
#	src/core/index.rs
#	src/directory/mmap_directory.rs
#	src/functional_test.rs
#	src/indexer/index_writer.rs
#	src/indexer/segment_writer.rs
#	src/lib.rs
#	src/query/boolean_query/boolean_query.rs
#	src/query/boolean_query/mod.rs
#	src/query/disjunction_max_query.rs
#	src/query/fuzzy_query.rs
#	src/query/more_like_this/more_like_this.rs
#	src/query/range_query/range_query.rs
#	src/query/regex_query.rs
#	src/query/term_query/term_query.rs
#	tests/failpoints/mod.rs
@PSeitz
Copy link
Contributor

PSeitz commented Oct 1, 2023

test_json_indexing fails because "complexobject": {"field.with.dot": 1} is indexed as u64 and then queried with i64.
The current order is i64,u64,f64. I'm not sure what's the best way to handle this when the type is user provided. Maybe it should not be user provided.

Comment on lines 337 to 340
if let Some(val) = number.as_u64() {
Self::U64(val)
} else if let Some(val) = number.as_i64() {
Self::I64(val)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if let Some(val) = number.as_u64() {
Self::U64(val)
} else if let Some(val) = number.as_i64() {
Self::I64(val)
if let Some(val) = number.as_i64() {
Self::I64(val)
} else if let Some(val) = number.as_u64() {
Self::U64(val)

}
serde_json::Value::String(val) => Self::Str(val),
Copy link
Contributor

@PSeitz PSeitz Oct 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

                if can_be_rfc3339_date_time(&text) {
                    match OffsetDateTime::parse(&text, &Rfc3339) {
                        Ok(dt) => {
                            let dt_utc = dt.to_offset(time::UtcOffset::UTC);
                            Self::Date(DateTime::from_utc(dt_utc))
                        }
                        Err(_) => Self::Str(text),
                    }
                } else {
                    Self::Str(text)
                }
fn can_be_rfc3339_date_time(text: &str) -> bool {
    if let Some(&first_byte) = text.as_bytes().get(0) {
        if first_byte >= b'0' && first_byte <= b'9' {
            return true;
        }
    }

    false
}

r#"{
"attributes": {
"aa": "japan"
"as": "japan"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"as": "japan"
"aa": "japan"

@PSeitz PSeitz merged commit 1c7c6fd into quickwit-oss:main Oct 2, 2023
4 checks passed
@PSeitz
Copy link
Contributor

PSeitz commented Oct 2, 2023

I had an extra pass and fixed the remaining issues. Thanks for the really nice PR!

@softhuafei
Copy link

Thank you for your work. I have a question. The current document serialization interface BinaryDocumentSerializer::serialize_doconly serializes stored fields. Is there any other interface that allows us to completely serialize and deserialize the document?

@PSeitz
Copy link
Contributor

PSeitz commented May 8, 2024

No, but I think you could add a stored bytes field and put your custom serialized doc there

@softhuafei
Copy link

No, but I think you could add a stored bytes field and put your custom serialized doc there

Thank you for your answer. Is there any example of custom document and related serialization and deserialization?

@PSeitz
Copy link
Contributor

PSeitz commented May 8, 2024

There is no custom document serialization and deserialization, having your real doc nested in a bytes field would be a workaround.

@softhuafei
Copy link

Sorry, my previous statement was not clear enough. I want to be able to fully serialize/deserialize TantivyDocument, just like the old version of document:

impl BinarySerializable for Document {
    fn serialize<W: Write + ?Sized>(&self, writer: &mut W) -> io::Result<()> {
        let field_values = self.field_values();
        VInt(field_values.len() as u64).serialize(writer)?;
        for field_value in field_values {
            field_value.serialize(writer)?;
        }
        Ok(())
    }

    fn deserialize<R: Read>(reader: &mut R) -> io::Result<Self> {
        let num_field_values = VInt::deserialize(reader)?.val() as usize;
        let field_values = (0..num_field_values)
            .map(|_| FieldValue::deserialize(reader))
            .collect::<io::Result<Vec<FieldValue>>>()?;
        Ok(Document::from(field_values))
    }
}

Should I modify BinaryDocumentSerializer and add a new method to it, such as serialize_entire_doc, which behaves like serialize_doc, except that it will serialize all fields?

@PSeitz
Copy link
Contributor

PSeitz commented May 8, 2024

TantivyDocument has serde support. Would that work for you?

#[derive(Clone, Debug, serde::Serialize, serde::Deserialize, Default)]
pub struct TantivyDocument {
    field_values: Vec<FieldValue>,
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants