Sort Index/Docids By Field #1026

PSeitz · 2021-04-29T06:21:03Z

add sort info to IndexSettings generate docid mapping for sorted field (only fastfield) remap singlevalue fastfield

move docid mapping to serialization step (less intermediate data for mapping) add support for docid mapping in multivalue fastfield

add docid mapping for both directions old->new (used in postings) and new->old (used in fast field) handle mapping in postings recorder warn instead of info for MAX_TOKEN_LEN

handle index sort in docstore, by saving all the docs in a temp docstore file (SegmentComponent::TempStore). On serialization the docid mapping is used to create a docstore in the correct order by reader the old docstore. add docstore sort tests refactor tests

PSeitz · 2021-05-04T15:26:51Z

src/core/index.rs

 ///
 /// let mut schema_builder = Schema::builder();
 /// let id_field = schema_builder.add_text_field("id", STRING);
 /// let title_field = schema_builder.add_text_field("title", TEXT);
 /// let body_field = schema_builder.add_text_field("body", TEXT);
 /// let schema = schema_builder.build();
-/// let settings = IndexSettings::default();
+/// let settings = IndexSettings{sort_by_field: IndexSortByField{field:"title".to_string(), order:Order::Asc}};


this example is not possible currently and should be changed (or implemented)

src/core/index_meta.rs

src/core/segment.rs

src/core/segment_component.rs

src/fastfield/multivalued/writer.rs

src/fastfield/writer.rs

src/fieldnorm/reader.rs

src/indexer/index_sorter.rs

src/indexer/merger_sorted_index.rs

src/indexer/segment_updater.rs

src/indexer/segment_writer.rs

fulmicoton · 2021-05-11T07:51:59Z

src/postings/recorder.rs

@@ -278,7 +279,8 @@ impl Recorder for TfAndPositionRecorder {
            }
            if let Some(doc_id_map) = doc_id_map {
                // this simple variant to remap may consume to much memory
-                doc_id_and_positions.push((doc_id_map.get_new_doc_id(doc), buffer_positions.to_vec()));


Can you remind me why lending is not an option for doc_id_and_positions?

lending of buffer_positions? this is just a temp vec which gets cleared on every loop

src/indexer/merger.rs

fix type rename test file add type

Fix posting list merge issue - ensure serializer always gets monotonically increasing doc ids handle sorting and merging for facets field

fix deserialization update changelog forward error on merge failed

cache store readers, to utilize lru cache (4x faster performance, due to less decompress calls on the block)

unset flag on deserialization and after finalize of a segment set flag when creating new instances

fulmicoton · 2021-05-17T07:04:33Z

src/indexer/merger.rs

+                        // not just stacked. The field serializer expects monotonically increasing
+                        // docids, so we collect and sort them first, before writing.
+                        //
+                        // I think this is not strictly necessary, it would be possible to


That sounds very expensive. Did you test it on a dataset?

merging was 2x slower with sorting on a dataset I tested, but I have a change in the pipeline, which will increase merge speed in both cases :).
Currently the overhead of the sorting is the same as in recorder.rs

0,9% of the merge time is spend in the vector push of the posting, in the webserver dataset (tokenized url field with positions)

fulmicoton · 2021-05-17T07:08:30Z

src/postings/recorder.rs

-            let term_freq = u32_it.next().unwrap_or(self.current_tf);
-            serializer.write_doc(doc as u32, term_freq, &[][..]);
+        if let Some(doc_id_map) = doc_id_map {
+            let mut doc_id_and_tf = vec![];


why was lending not possible here?

the bufferlender returns u32, but we need (u32, u32). I could extend buffer lender though.

yes. Or push them two by two.

fulmicoton · 2021-05-17T07:08:45Z

src/postings/recorder.rs

    ) {
        let (buffer_u8, buffer_positions) = buffer_lender.lend_all();
        self.stack.read_to_end(heap, buffer_u8);
        let mut u32_it = VInt32Reader::new(&buffer_u8[..]);
+        let mut doc_id_and_positions = vec![];


why was lending not possible here

this is of type (u32, Vec), we can't lend the Vec, since it needs to be owned.

I should also note, that I couldn't find any recorder serialize methods in the perf profile.

PSeitz added 11 commits April 29, 2021 07:43

sort index by field

521075d

add sort info to IndexSettings generate docid mapping for sorted field (only fastfield) remap singlevalue fastfield

support docid mapping in multivalue fastfield

5643ee2

move docid mapping to serialization step (less intermediate data for mapping) add support for docid mapping in multivalue fastfield

handle docid map in bytes fastfield

60bf3f8

forward docid mapping, remap postings

8469223

Merge remote-tracking branch 'upstream/main' into indexmeta

3cd436e

fix merge conflicts

045dfee

move test to index_sorter

e97bdc9

add docid index mapping old->new

62224fb

add docid mapping for both directions old->new (used in postings) and new->old (used in fast field) handle mapping in postings recorder warn instead of info for MAX_TOKEN_LEN

remap docid in fielnorm

1ec2e61

resort docids in recorder, more extensive tests

8a7dc78

PSeitz requested a review from fulmicoton May 4, 2021 14:58

PSeitz commented May 4, 2021

View reviewed changes