-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Taxonomy index should use DocValues not StoredFields [LUCENE-9450] #10490
Comments
Michael McCandless (@mikemccand) (migrated from JIRA) +1, thanks @gautamworah96! It is really silly that the taxonomy index uses stored fields today and must do a number of stored field lookups for each query to resolve taxonomy ordinals back to human presentable facet labels. At search time, after pulling the Did you see an exception in tests when you tried your patch? The default Also, at indexing time, it looks like you are no longer indexing the Maybe post some of the failing tests if those two above fixes still don't work? Thanks for tackling this! |
Gautam Worah (@gautamworah96) (migrated from JIRA) Thanks for the feedback @mikemccand ! Unfortunately, there were no errors at the line where I called However, the docs for the
|
Gautam Worah (@gautamworah96) (migrated from JIRA) Thanks for reviewing the PR. Based on feedback received from the previous revision I've posted a new PR (revision 2). (Copied from a comment on GitHub) The new PR has the following changes:
TEST: I've not added any new tests because this PR changes a low level implementation detail and the current tests already cover this (we have a simple
|
Gautam Worah (@gautamworah96) (migrated from JIRA) I've added a new PR revision with the style modifications requested in the previous review. I am now trying to get the benchmark scripts running |
Gautam Worah (@gautamworah96) (migrated from JIRA) Benchmarks output:
TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff
Respell 389.13 (11.9%) 341.06 (17.2%) -12.4% ( -37% - 18%)
HighSloppyPhrase 354.72 (5.8%) 344.69 (4.6%) -2.8% ( -12% - 8%)
MedSloppyPhrase 754.17 (3.7%) 734.47 (9.1%) -2.6% ( -14% - 10%)
LowPhrase 422.80 (4.3%) 413.10 (7.4%) -2.3% ( -13% - 9%)
OrHighMed 666.99 (4.2%) 654.89 (6.0%) -1.8% ( -11% - 8%)
Prefix3 501.95 (7.1%) 494.34 (6.9%) -1.5% ( -14% - 13%)
MedSpanNear 924.07 (4.4%) 914.55 (5.6%) -1.0% ( -10% - 9%)
HighIntervalsOrdered 913.74 (2.0%) 905.72 (2.2%) -0.9% ( -4% - 3%)
MedTerm 3841.01 (3.0%) 3811.27 (3.2%) -0.8% ( -6% - 5%)
MedPhrase 912.31 (2.3%) 906.87 (2.7%) -0.6% ( -5% - 4%)
LowSpanNear 769.51 (13.1%) 766.22 (12.4%) -0.4% ( -22% - 28%)
BrowseDayOfYearSSDVFacets 2028.19 (2.8%) 2022.17 (2.2%) -0.3% ( -5% - 4%)
IntNRQ 1446.93 (2.4%) 1442.66 (1.6%) -0.3% ( -4% - 3%)
HighTermMonthSort 2631.06 (2.7%) 2628.64 (4.9%) -0.1% ( -7% - 7%)
AndHighLow 2713.80 (3.0%) 2711.42 (3.4%) -0.1% ( -6% - 6%)
HighTerm 2785.67 (2.7%) 2783.94 (3.6%) -0.1% ( -6% - 6%)
HighSpanNear 595.30 (11.4%) 595.48 (10.0%) 0.0% ( -19% - 24%)
BrowseMonthSSDVFacets 2367.18 (2.2%) 2369.20 (2.1%) 0.1% ( -4% - 4%)
AndHighHigh 885.75 (3.4%) 887.16 (3.9%) 0.2% ( -6% - 7%)
Wildcard 730.34 (11.7%) 732.30 (12.2%) 0.3% ( -21% - 27%)
HighPhrase 655.83 (3.5%) 658.13 (2.4%) 0.4% ( -5% - 6%)
HighTermDayOfYearSort 1724.90 (6.1%) 1731.23 (5.8%) 0.4% ( -10% - 13%)
LowTerm 4271.57 (2.7%) 4290.84 (3.9%) 0.5% ( -5% - 7%)
PKLookup 243.00 (2.9%) 244.62 (1.3%) 0.7% ( -3% - 5%)
LowSloppyPhrase 1702.96 (2.8%) 1718.94 (3.4%) 0.9% ( -5% - 7%)
Fuzzy1 398.68 (7.4%) 403.04 (4.8%) 1.1% ( -10% - 14%)
OrHighHigh 411.49 (9.1%) 417.82 (5.8%) 1.5% ( -12% - 18%)
OrHighLow 707.11 (4.2%) 718.78 (6.0%) 1.7% ( -8% - 12%)
AndHighMed 1110.43 (4.5%) 1134.33 (3.4%) 2.2% ( -5% - 10%)
Fuzzy2 46.38 (22.4%) 48.94 (19.1%) 5.5% ( -29% - 60%)
BrowseDateTaxoFacets 2996.90 (4.3%) 3212.31 (5.2%) 7.2% ( -2% - 17%)
BrowseDayOfYearTaxoFacets 2594.45 (2.8%) 2785.41 (3.7%) 7.4% ( 0% - 14%)
BrowseMonthTaxoFacets 2920.16 (3.7%) 3139.41 (3.9%) 7.5% ( 0% - 15%) My localrun.py script
|
Michael McCandless (@mikemccand) (migrated from JIRA) Whoa, that is good news! It looks like the change is net/net a speedup pure faceting tasks! Those However, those QPS numbers are silly high and not really trustworthy. Which Also, since you had to benchmark two indices (since this change impacts the index), can you add |
Gautam Worah (@gautamworah96) (migrated from JIRA) Benchmark command with the same localrun.py script: python src/python/localrun.py -source wikimediumall -forceMerge=True Instance details: EC2 c4.8xlarge CPU (/proc/cpuinfo):
Memory:
Java:
=== Results:
TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff
OrNotHighHigh 532.96 (6.0%) 518.91 (8.1%) -2.6% ( -15% - 12%)
IntNRQ 63.90 (2.8%) 62.51 (3.7%) -2.2% ( -8% - 4%)
LowTerm 1354.18 (4.9%) 1325.24 (7.1%) -2.1% ( -13% - 10%)
PKLookup 130.95 (2.3%) 128.48 (3.1%) -1.9% ( -7% - 3%)
Fuzzy1 49.16 (6.3%) 48.28 (9.8%) -1.8% ( -16% - 15%)
OrHighNotMed 518.20 (4.8%) 509.19 (5.4%) -1.7% ( -11% - 8%)
Wildcard 53.04 (2.1%) 52.14 (3.6%) -1.7% ( -7% - 4%)
OrHighLow 241.89 (3.8%) 237.82 (4.7%) -1.7% ( -9% - 7%)
MedTerm 1002.39 (5.2%) 985.94 (5.8%) -1.6% ( -12% - 9%)
AndHighLow 453.14 (3.3%) 445.71 (5.9%) -1.6% ( -10% - 7%)
HighTermMonthSort 22.03 (10.5%) 21.70 (10.3%) -1.5% ( -20% - 21%)
MedPhrase 228.73 (3.2%) 225.69 (4.6%) -1.3% ( -8% - 6%)
OrHighNotLow 461.58 (4.8%) 456.17 (6.9%) -1.2% ( -12% - 11%)
HighTermTitleBDVSort 64.10 (10.5%) 63.35 (9.7%) -1.2% ( -19% - 21%)
LowPhrase 134.07 (2.1%) 132.72 (2.8%) -1.0% ( -5% - 3%)
OrHighMed 56.83 (2.9%) 56.44 (3.7%) -0.7% ( -7% - 6%)
Prefix3 31.60 (2.1%) 31.42 (2.4%) -0.6% ( -4% - 4%)
AndHighMed 61.02 (2.7%) 60.72 (3.2%) -0.5% ( -6% - 5%)
HighPhrase 46.05 (3.2%) 45.84 (4.7%) -0.5% ( -8% - 7%)
LowSpanNear 40.71 (2.4%) 40.54 (2.2%) -0.4% ( -4% - 4%)
Fuzzy2 37.09 (7.5%) 36.94 (6.0%) -0.4% ( -12% - 14%)
Respell 28.71 (1.5%) 28.60 (2.0%) -0.4% ( -3% - 3%)
OrNotHighMed 375.44 (2.9%) 374.10 (5.0%) -0.4% ( -8% - 7%)
AndHighHigh 32.52 (3.2%) 32.46 (3.3%) -0.2% ( -6% - 6%)
HighSpanNear 9.05 (3.3%) 9.03 (3.6%) -0.2% ( -6% - 6%)
MedSloppyPhrase 61.05 (1.8%) 61.03 (2.5%) -0.0% ( -4% - 4%)
BrowseDayOfYearSSDVFacets 2.76 (0.8%) 2.75 (1.0%) -0.0% ( -1% - 1%)
HighTerm 917.59 (4.6%) 917.50 (6.4%) -0.0% ( -10% - 11%)
HighIntervalsOrdered 4.32 (2.5%) 4.32 (2.8%) 0.0% ( -5% - 5%)
BrowseMonthSSDVFacets 2.94 (0.6%) 2.95 (0.6%) 0.1% ( -1% - 1%)
OrHighHigh 6.22 (3.0%) 6.24 (3.3%) 0.2% ( -5% - 6%)
OrHighNotHigh 425.45 (5.3%) 426.34 (6.8%) 0.2% ( -11% - 13%)
LowSloppyPhrase 7.30 (2.6%) 7.32 (2.5%) 0.4% ( -4% - 5%)
HighSloppyPhrase 7.52 (2.1%) 7.55 (2.8%) 0.4% ( -4% - 5%)
MedSpanNear 11.22 (3.1%) 11.26 (3.0%) 0.4% ( -5% - 6%)
OrNotHighLow 367.16 (3.6%) 368.76 (4.8%) 0.4% ( -7% - 9%)
HighTermDayOfYearSort 35.05 (8.3%) 35.47 (9.1%) 1.2% ( -14% - 20%)
BrowseMonthTaxoFacets 1.39 (2.6%) 1.44 (7.3%) 4.0% ( -5% - 14%)
BrowseDateTaxoFacets 1.21 (2.4%) 1.26 (6.7%) 4.2% ( -4% - 13%)
BrowseDayOfYearTaxoFacets 1.20 (2.8%) 1.26 (6.9%) 4.7% ( -4% - 14%)
|
Michael McCandless (@mikemccand) (migrated from JIRA) Thanks @gautamworah96 – that is good news! ~4-5% speedup in faceting for match all docs query! That is very exciting. I'll review the PR – I think this is close. |
ASF subversion and git services (migrated from JIRA) Commit 3f8f84f in lucene-solr's branch refs/heads/master from Gautam Worah LUCENE-9450 Switch to BinaryDocValues instead of stored fields in Lucene's facet implementation, yielding ~4-5% red-line QPS gain in pure faceting benchmarks (#1733) |
Mayya Sharipova (@mayya-sharipova) (migrated from JIRA) Hello @gautamworah96! I wanted to ask opinion what to do with the test org.apache.lucene.facet.taxonomy.directory.TestBackwardsCompatibility::testCreateNewTaxonomy. Because of #10374 it is no longer possible to modify fields data structures even for older indices. Thus we can't add a new |
Gautam Worah (@gautamworah96) (migrated from JIRA) Hi @mayya-sharipova , I had looked at that test briefly when it was disabled. So, for now, I could modify the test so that it checks that the previously indexed StoredFields based field, can be read (and not modified). But the larger concern, is that with this #10374 change, a user won't be able to add docs with the new Lucene 9.0 version without reindexing (if they have taxonomy fields that were previously indexed with an older Lucene version).
|
Mayya Sharipova (@mayya-sharipova) (migrated from JIRA) @gautamworah96 That's indeed a concern. The workaround would be to add a binary doc values field in version 8.x, force merge to a single segment, so that a FieldInfo for $full_path$ contains doc values as well, and then upgrade to 9.0. We don't do data structures consistency checks for older indices on individual docs , just on a segment level. Do you think it is a viable workaround?
|
Gautam Worah (@gautamworah96) (migrated from JIRA) That makes sense to me. Naive question: If the user has 4 old StoredFields based segments and 1 new BinaryDocValues based segment (like the way you just described), does force merging them to 1 produce a single BinaryDocValues based segment? Also, I think, most Lucene users would have a system for reindexing documents as compared to force merging segments to one, so I guess the recommended approach for them would be to just reindex. -I see that in the older Edit: Found the MIGRATE.md file. If yes, we should definitely mention the change in field type for the taxonomy index and some migration instructions. I can add those instructions
|
Mayya Sharipova (@mayya-sharipova) (migrated from JIRA) I am not familiar how taxonomy works, so you adding migration instructions would be very helpful. Answering your question: > If the user has 4 old StoredFields based segments and 1 new BinaryDocValues based segment (like the way you just described), does force merging them to 1 produce a single BinaryDocValues based segment? We have This check in v 9.0 is for each for each individual field, so if you transition 4 fields from stored fields to binary doc values, a separate transition needs to be done for each field.
> reindexing documents as compared to force merging segments to one, so I guess the recommended approach for them would be to just reindex.
Reindexing should work as well, if it not an expensive operation for users.
|
Michael McCandless (@mikemccand) (migrated from JIRA) Catching up here... So yeah as things stand, we force users of taxonomy faceting to fully reindex on upgrade to 9.0, which is not a great option, but I really love that Lucene is getting stricter enforcement of consistent field types with #10374. @gautamworah96 could we instead index the new |
Adrien Grand (@jpountz) (migrated from JIRA) For the record, another approach could work that would consist of checking the created version of the index, and index a stored field on indices <9.0 and a BinaryDocValuesField on indices >= 9.0. This is similar to what we did when we changed the way that length normalization factors are encoded in 7.0. |
Michael McCandless (@mikemccand) (migrated from JIRA) Ahh that is a compelling option too @jpountz. So users would not see the benefits of the BDV switch until they fully re-indexed (creating a new index). That is maybe a good option too. @gautamworah96 what do you think? |
Gautam Worah (@gautamworah96) (migrated from JIRA) @mikemccand I tried the approach of indexing the new BinaryDocValues field with a different name. Currently, we index the StringField with the internal The side-effect of this is that users might have some categories in the |
Adrien Grand (@jpountz) (migrated from JIRA) Thinking a bit more about this, the approach of looking at the index created version means users won't benefit from this change on their existing indices and requires a bit more work in the short term, but it's also a bit cleaner and will help us get done with this change earlier? Scenario 1: use a different field name for binary doc values
Scenario 2: use index created version
|
Gautam Worah (@gautamworah96) (migrated from JIRA) @jpountz I think your points make sense. Version based checks are definitely cleaner and can act as self documentation on what changed. I had not thought about the migration process when I was thinking about the index using a different name approach. In the meantime, I had already submitted a PR with approach 1 yesterday. I'll revise it with version based checks |
Michael McCandless (@mikemccand) (migrated from JIRA) OK, +1 for Scenario 2 (use index created version)! |
Gautam Worah (@gautamworah96) (migrated from JIRA) Argh. Took some time to figure out the change that added the getIndexCreatedVersionMajor() API. However, a call like Something like |
Gautam Worah (@gautamworah96) (migrated from JIRA) Posted a new PR revision that implements the |
Adrien Grand (@jpountz) (migrated from JIRA) Closing after the 9.0.0 release |
The taxonomy index that maps binning labels to ordinals was created before Lucene added BinaryDocValues.
I've attached a WIP patch (does not pass tests currently)
Issue suggested by @mikemccand
Migrated from LUCENE-9450 by Gautam Worah (@gautamworah96), resolved Jan 16 2021
Attachments: LUCENE-9450-localrun.py-v1, wip_taxonomy_patch
Pull requests: apache/lucene-solr#1733
The text was updated successfully, but these errors were encountered: