-
-
Notifications
You must be signed in to change notification settings - Fork 673
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sort Index/Docids By Field #1026
Conversation
add sort info to IndexSettings generate docid mapping for sorted field (only fastfield) remap singlevalue fastfield
move docid mapping to serialization step (less intermediate data for mapping) add support for docid mapping in multivalue fastfield
add docid mapping for both directions old->new (used in postings) and new->old (used in fast field) handle mapping in postings recorder warn instead of info for MAX_TOKEN_LEN
handle index sort in docstore, by saving all the docs in a temp docstore file (SegmentComponent::TempStore). On serialization the docid mapping is used to create a docstore in the correct order by reader the old docstore. add docstore sort tests refactor tests
src/core/index.rs
Outdated
/// | ||
/// let mut schema_builder = Schema::builder(); | ||
/// let id_field = schema_builder.add_text_field("id", STRING); | ||
/// let title_field = schema_builder.add_text_field("title", TEXT); | ||
/// let body_field = schema_builder.add_text_field("body", TEXT); | ||
/// let schema = schema_builder.build(); | ||
/// let settings = IndexSettings::default(); | ||
/// let settings = IndexSettings{sort_by_field: IndexSortByField{field:"title".to_string(), order:Order::Asc}}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this example is not possible currently and should be changed (or implemented)
src/postings/recorder.rs
Outdated
@@ -278,7 +279,8 @@ impl Recorder for TfAndPositionRecorder { | |||
} | |||
if let Some(doc_id_map) = doc_id_map { | |||
// this simple variant to remap may consume to much memory | |||
doc_id_and_positions.push((doc_id_map.get_new_doc_id(doc), buffer_positions.to_vec())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you remind me why lending is not an option for doc_id_and_positions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lending of buffer_positions? this is just a temp vec which gets cleared on every loop
fix type rename test file add type
Fix posting list merge issue - ensure serializer always gets monotonically increasing doc ids handle sorting and merging for facets field
fix deserialization update changelog forward error on merge failed
cache store readers, to utilize lru cache (4x faster performance, due to less decompress calls on the block)
unset flag on deserialization and after finalize of a segment set flag when creating new instances
// not just stacked. The field serializer expects monotonically increasing | ||
// docids, so we collect and sort them first, before writing. | ||
// | ||
// I think this is not strictly necessary, it would be possible to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds very expensive. Did you test it on a dataset?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
merging was 2x slower with sorting on a dataset I tested, but I have a change in the pipeline, which will increase merge speed in both cases :).
Currently the overhead of the sorting is the same as in recorder.rs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
0,9% of the merge time is spend in the vector push of the posting, in the webserver dataset (tokenized url field with positions)
let term_freq = u32_it.next().unwrap_or(self.current_tf); | ||
serializer.write_doc(doc as u32, term_freq, &[][..]); | ||
if let Some(doc_id_map) = doc_id_map { | ||
let mut doc_id_and_tf = vec![]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why was lending not possible here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the bufferlender returns u32, but we need (u32, u32). I could extend buffer lender though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes. Or push them two by two.
) { | ||
let (buffer_u8, buffer_positions) = buffer_lender.lend_all(); | ||
self.stack.read_to_end(heap, buffer_u8); | ||
let mut u32_it = VInt32Reader::new(&buffer_u8[..]); | ||
let mut doc_id_and_positions = vec![]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why was lending not possible here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is of type (u32, Vec), we can't lend the Vec, since it needs to be owned.
I should also note, that I couldn't find any recorder serialize methods in the perf profile.
closes #1014