Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Taxonomy index should use DocValues not StoredFields [LUCENE-9450] #10490

Closed
asfimport opened this issue Aug 9, 2020 · 24 comments
Closed

Taxonomy index should use DocValues not StoredFields [LUCENE-9450] #10490

asfimport opened this issue Aug 9, 2020 · 24 comments

Comments

@asfimport
Copy link

The taxonomy index that maps binning labels to ordinals was created before Lucene added BinaryDocValues.

I've attached a WIP patch (does not pass tests currently)

Issue suggested by @mikemccand


Migrated from LUCENE-9450 by Gautam Worah (@gautamworah96), resolved Jan 16 2021
Attachments: LUCENE-9450-localrun.py-v1, wip_taxonomy_patch
Pull requests: apache/lucene-solr#1733

@asfimport
Copy link
Author

Michael McCandless (@mikemccand) (migrated from JIRA)

+1, thanks @gautamworah96!  It is really silly that the taxonomy index uses stored fields today and must do a number of stored field lookups for each query to resolve taxonomy ordinals back to human presentable facet labels.

At search time, after pulling the BinaryDocValues, you need to .advanceExact to that docid, confirm (maybe, assert?) that method returns true, then pull the .binaryValue().

Did you see an exception in tests when you tried your patch?  The default Codec should throw an exception if you try to pull a .binaryValue() without first calling .advancExact() I hope.

Also, at indexing time, it looks like you are no longer indexing the StringField, but I think you must keep indexing it, but change the Field.Store.YES to Field.Store.NO.  This field is also stored in the inverted index and is what allows us to do the label -> ordinal lookup, I think.

Maybe post some of the failing tests if those two above fixes still don't work?  Thanks for tackling this!

@asfimport
Copy link
Author

Gautam Worah (@gautamworah96) (migrated from JIRA)

Thanks for the feedback @mikemccand !

Unfortunately, there were no errors at the line where I called binaryValue() (without calling .advanceExact())

However, the docs for the .binaryValue() function do say that


<font color="#808080">It is illegal to call this method after {</font><font color="#808080">@link </font><font color="#808080">#advanceExact(int)}</font><font color="#808080">\* returned {</font><font color="#808080">@code </font><font color="#808080">false}.</font>

<font color="#808080">```</font>

I've submitted a Revision 1 [PR](https://github.com/apache/lucene-solr/pull/1733) with the following changes:
- Added a call to ``advanceExact()`` before calling ``.binaryValue()`` and an `assert` to check that the field exists in the index

- Re-added the ``StringField`` with the ``Field.Store.YES`` changed to ``Field.Store.NO``.

- I've not added new tests at the moment. Trying to get the existing ones to work first.

 

 

 

 

@asfimport
Copy link
Author

Gautam Worah (@gautamworah96) (migrated from JIRA)

Thanks for reviewing the PR.

Based on feedback received from the previous revision I've posted a new PR (revision 2).

(Copied from a comment on GitHub)

The new PR has the following changes:

  1. Use ordinal instead of catIDInteger (IntelliJ says that boxing is anyways not needed, perhaps we can remove?)
  2. Use the correct values instance that has advanced to the correct docId
  3. Use ReaderUtil to get to the leaf and then use LeafReader instead of using the higher level MultiDocValues call

TEST:
ant test in the lucene-solr/lucene/facet directory (passes successfully)
ant precommit

I've not added any new tests because this PR changes a low level implementation detail and the current tests already cover this (we have a simple assert that checks that .advanceExact returns true)

 

 

 

 

 

@asfimport
Copy link
Author

Gautam Worah (@gautamworah96) (migrated from JIRA)

I've added a new PR revision with the style modifications requested in the previous review.

I am now trying to get the benchmark scripts running

@asfimport
Copy link
Author

Gautam Worah (@gautamworah96) (migrated from JIRA)

Benchmarks output:

 

                    TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff
                 Respell      389.13     (11.9%)      341.06     (17.2%)  -12.4% ( -37% -   18%)
        HighSloppyPhrase      354.72      (5.8%)      344.69      (4.6%)   -2.8% ( -12% -    8%)
         MedSloppyPhrase      754.17      (3.7%)      734.47      (9.1%)   -2.6% ( -14% -   10%)
               LowPhrase      422.80      (4.3%)      413.10      (7.4%)   -2.3% ( -13% -    9%)
               OrHighMed      666.99      (4.2%)      654.89      (6.0%)   -1.8% ( -11% -    8%)
                 Prefix3      501.95      (7.1%)      494.34      (6.9%)   -1.5% ( -14% -   13%)
             MedSpanNear      924.07      (4.4%)      914.55      (5.6%)   -1.0% ( -10% -    9%)
    HighIntervalsOrdered      913.74      (2.0%)      905.72      (2.2%)   -0.9% (  -4% -    3%)
                 MedTerm     3841.01      (3.0%)     3811.27      (3.2%)   -0.8% (  -6% -    5%)
               MedPhrase      912.31      (2.3%)      906.87      (2.7%)   -0.6% (  -5% -    4%)
             LowSpanNear      769.51     (13.1%)      766.22     (12.4%)   -0.4% ( -22% -   28%)
BrowseDayOfYearSSDVFacets     2028.19      (2.8%)     2022.17      (2.2%)   -0.3% (  -5% -    4%)
                  IntNRQ     1446.93      (2.4%)     1442.66      (1.6%)   -0.3% (  -4% -    3%)
       HighTermMonthSort     2631.06      (2.7%)     2628.64      (4.9%)   -0.1% (  -7% -    7%)
              AndHighLow     2713.80      (3.0%)     2711.42      (3.4%)   -0.1% (  -6% -    6%)
                HighTerm     2785.67      (2.7%)     2783.94      (3.6%)   -0.1% (  -6% -    6%)
            HighSpanNear      595.30     (11.4%)      595.48     (10.0%)    0.0% ( -19% -   24%)
   BrowseMonthSSDVFacets     2367.18      (2.2%)     2369.20      (2.1%)    0.1% (  -4% -    4%)
             AndHighHigh      885.75      (3.4%)      887.16      (3.9%)    0.2% (  -6% -    7%)
                Wildcard      730.34     (11.7%)      732.30     (12.2%)    0.3% ( -21% -   27%)
              HighPhrase      655.83      (3.5%)      658.13      (2.4%)    0.4% (  -5% -    6%)
   HighTermDayOfYearSort     1724.90      (6.1%)     1731.23      (5.8%)    0.4% ( -10% -   13%)
                 LowTerm     4271.57      (2.7%)     4290.84      (3.9%)    0.5% (  -5% -    7%)
                PKLookup      243.00      (2.9%)      244.62      (1.3%)    0.7% (  -3% -    5%)
         LowSloppyPhrase     1702.96      (2.8%)     1718.94      (3.4%)    0.9% (  -5% -    7%)
                  Fuzzy1      398.68      (7.4%)      403.04      (4.8%)    1.1% ( -10% -   14%)
              OrHighHigh      411.49      (9.1%)      417.82      (5.8%)    1.5% ( -12% -   18%)
               OrHighLow      707.11      (4.2%)      718.78      (6.0%)    1.7% (  -8% -   12%)
              AndHighMed     1110.43      (4.5%)     1134.33      (3.4%)    2.2% (  -5% -   10%)
                  Fuzzy2       46.38     (22.4%)       48.94     (19.1%)    5.5% ( -29% -   60%)
    BrowseDateTaxoFacets     2996.90      (4.3%)     3212.31      (5.2%)    7.2% (  -2% -   17%)
BrowseDayOfYearTaxoFacets     2594.45      (2.8%)     2785.41      (3.7%)    7.4% (   0% -   14%)
   BrowseMonthTaxoFacets     2920.16      (3.7%)     3139.41      (3.9%)    7.5% (   0% -   15%)

My localrun.py script   

 

 

 

@asfimport
Copy link
Author

Michael McCandless (@mikemccand) (migrated from JIRA)

Whoa, that is good news!  It looks like the change is net/net a speedup pure faceting tasks!  Those Browse* tasks are essentially a MatchAllDocsQuery counting facets for all hits.

However, those QPS numbers are silly high and not really trustworthy.  Which -source did you use?  Can you run with -source wikimediumall?  That will index all ~33.3M documents.

Also, since you had to benchmark two indices (since this change impacts the index), can you add forceMerge = True to your two indices?  Otherwise each index might have different segment geometries making the query performance not very comparable.  With forceMerge = True they will both merge down to one segment, making the QPS numbers at least comparable if not perhaps realistic in a production setting.

@asfimport
Copy link
Author

Gautam Worah (@gautamworah96) (migrated from JIRA)

Benchmark command with the same localrun.py script:

 python src/python/localrun.py -source wikimediumall -forceMerge=True

Instance details:

EC2 c4.8xlarge

CPU (/proc/cpuinfo):

Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 36
On-line CPU(s) list: 0-35
Thread(s) per core: 2

Memory:

MemTotal: 61833408 kB
MemFree: 1585956 kB
MemAvailable: 53673248 kB

Java:

openjdk version "11.0.8" 2020-07-14 LTS
OpenJDK Runtime Environment (build 11.0.8+10-LTS)
OpenJDK 64-Bit Server VM (build 11.0.8+10-LTS, mixed mode)

 

===

 Results:

 

                    TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff
           OrNotHighHigh      532.96      (6.0%)      518.91      (8.1%)   -2.6% ( -15% -   12%)
                  IntNRQ       63.90      (2.8%)       62.51      (3.7%)   -2.2% (  -8% -    4%)
                 LowTerm     1354.18      (4.9%)     1325.24      (7.1%)   -2.1% ( -13% -   10%)
                PKLookup      130.95      (2.3%)      128.48      (3.1%)   -1.9% (  -7% -    3%)
                  Fuzzy1       49.16      (6.3%)       48.28      (9.8%)   -1.8% ( -16% -   15%)
            OrHighNotMed      518.20      (4.8%)      509.19      (5.4%)   -1.7% ( -11% -    8%)
                Wildcard       53.04      (2.1%)       52.14      (3.6%)   -1.7% (  -7% -    4%)
               OrHighLow      241.89      (3.8%)      237.82      (4.7%)   -1.7% (  -9% -    7%)
                 MedTerm     1002.39      (5.2%)      985.94      (5.8%)   -1.6% ( -12% -    9%)
              AndHighLow      453.14      (3.3%)      445.71      (5.9%)   -1.6% ( -10% -    7%)
       HighTermMonthSort       22.03     (10.5%)       21.70     (10.3%)   -1.5% ( -20% -   21%)
               MedPhrase      228.73      (3.2%)      225.69      (4.6%)   -1.3% (  -8% -    6%)
            OrHighNotLow      461.58      (4.8%)      456.17      (6.9%)   -1.2% ( -12% -   11%)
    HighTermTitleBDVSort       64.10     (10.5%)       63.35      (9.7%)   -1.2% ( -19% -   21%)
               LowPhrase      134.07      (2.1%)      132.72      (2.8%)   -1.0% (  -5% -    3%)
               OrHighMed       56.83      (2.9%)       56.44      (3.7%)   -0.7% (  -7% -    6%)
                 Prefix3       31.60      (2.1%)       31.42      (2.4%)   -0.6% (  -4% -    4%)
              AndHighMed       61.02      (2.7%)       60.72      (3.2%)   -0.5% (  -6% -    5%)
              HighPhrase       46.05      (3.2%)       45.84      (4.7%)   -0.5% (  -8% -    7%)
             LowSpanNear       40.71      (2.4%)       40.54      (2.2%)   -0.4% (  -4% -    4%)
                  Fuzzy2       37.09      (7.5%)       36.94      (6.0%)   -0.4% ( -12% -   14%)
                 Respell       28.71      (1.5%)       28.60      (2.0%)   -0.4% (  -3% -    3%)
            OrNotHighMed      375.44      (2.9%)      374.10      (5.0%)   -0.4% (  -8% -    7%)
             AndHighHigh       32.52      (3.2%)       32.46      (3.3%)   -0.2% (  -6% -    6%)
            HighSpanNear        9.05      (3.3%)        9.03      (3.6%)   -0.2% (  -6% -    6%)
         MedSloppyPhrase       61.05      (1.8%)       61.03      (2.5%)   -0.0% (  -4% -    4%)
BrowseDayOfYearSSDVFacets        2.76      (0.8%)        2.75      (1.0%)   -0.0% (  -1% -    1%)
                HighTerm      917.59      (4.6%)      917.50      (6.4%)   -0.0% ( -10% -   11%)
    HighIntervalsOrdered        4.32      (2.5%)        4.32      (2.8%)    0.0% (  -5% -    5%)
   BrowseMonthSSDVFacets        2.94      (0.6%)        2.95      (0.6%)    0.1% (  -1% -    1%)
              OrHighHigh        6.22      (3.0%)        6.24      (3.3%)    0.2% (  -5% -    6%)
           OrHighNotHigh      425.45      (5.3%)      426.34      (6.8%)    0.2% ( -11% -   13%)
         LowSloppyPhrase        7.30      (2.6%)        7.32      (2.5%)    0.4% (  -4% -    5%)
        HighSloppyPhrase        7.52      (2.1%)        7.55      (2.8%)    0.4% (  -4% -    5%)
             MedSpanNear       11.22      (3.1%)       11.26      (3.0%)    0.4% (  -5% -    6%)
            OrNotHighLow      367.16      (3.6%)      368.76      (4.8%)    0.4% (  -7% -    9%)
   HighTermDayOfYearSort       35.05      (8.3%)       35.47      (9.1%)    1.2% ( -14% -   20%)
   BrowseMonthTaxoFacets        1.39      (2.6%)        1.44      (7.3%)    4.0% (  -5% -   14%)
    BrowseDateTaxoFacets        1.21      (2.4%)        1.26      (6.7%)    4.2% (  -4% -   13%)
BrowseDayOfYearTaxoFacets        1.20      (2.8%)        1.26      (6.9%)    4.7% (  -4% -   14%)

 

 

 

 

 

 

@asfimport
Copy link
Author

Michael McCandless (@mikemccand) (migrated from JIRA)

Thanks @gautamworah96 – that is good news!  ~4-5% speedup in faceting for match all docs query!  That is very exciting.  I'll review the PR – I think this is close.

@asfimport
Copy link
Author

ASF subversion and git services (migrated from JIRA)

Commit 3f8f84f in lucene-solr's branch refs/heads/master from Gautam Worah
https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=3f8f84f

LUCENE-9450 Switch to BinaryDocValues instead of stored fields in Lucene's facet implementation, yielding ~4-5% red-line QPS gain in pure faceting benchmarks (#1733)

@asfimport
Copy link
Author

asfimport commented Jul 16, 2021

Mayya Sharipova (@mayya-sharipova) (migrated from JIRA)

Hello @gautamworah96!  I wanted to ask opinion what to do with the test org.apache.lucene.facet.taxonomy.directory.TestBackwardsCompatibility::testCreateNewTaxonomy.  Because of #10374 it is no longer possible to modify fields data structures even for older indices. Thus we can't add a new $full_path$ field indexed with BinaryDocValues, because in previous segments of the older index this field was indexed without doc values. So unfortunately, I can't see any way to enable this test. 

@asfimport
Copy link
Author

asfimport commented Jul 16, 2021

Gautam Worah (@gautamworah96) (migrated from JIRA)

Hi @mayya-sharipova , I had looked at that test briefly when it was disabled.

So, for now, I could modify the test so that it checks that the previously indexed StoredFields based field, can be read (and not modified).

But the larger concern, is that with this #10374 change, a user won't be able to add docs with the new Lucene 9.0 version without reindexing (if they have taxonomy fields that were previously indexed with an older Lucene version).

 

 

@asfimport
Copy link
Author

Mayya Sharipova (@mayya-sharipova) (migrated from JIRA)

@gautamworah96  That's indeed a concern. The workaround  would be to add a binary doc values field in version 8.x, force merge to a single segment, so that a FieldInfo for $full_path$  contains doc values as well, and then upgrade to 9.0.  We don't do data structures consistency checks  for older indices on individual docs , just on a segment level.

Do you think it is a viable workaround?

 

 

@asfimport
Copy link
Author

Gautam Worah (@gautamworah96) (migrated from JIRA)

That makes sense to me.

Naive question: If the user has 4 old StoredFields based segments and 1 new BinaryDocValues based segment (like the way you just described), does force merging them to 1 produce a single BinaryDocValues based segment?

Also, I think, most Lucene users would have a system for reindexing documents as compared to force merging segments to one, so I guess the recommended approach for them would be to just reindex.

-I see that in the older lucene-solr repo, we had a migrate.txt file with instructions to how to migrate to newer Lucene versions, do we plan to produce something similar for the transition to Lucene 9.0.-

Edit: Found the MIGRATE.md file.

If yes, we should definitely mention the change in field type for the taxonomy index and some migration instructions. I can add those instructions

 

@asfimport
Copy link
Author

Mayya Sharipova (@mayya-sharipova) (migrated from JIRA)

I am not familiar how taxonomy works, so you adding migration instructions would be very helpful.

Answering your question:

>  If the user has 4 old StoredFields based segments and 1 new BinaryDocValues based segment (like the way you just described), does force merging them to 1 produce a single BinaryDocValues based segment?

We have FieldInfo class for each field with a distinct name, and this class contains information how this field is indexed (for example with BinaryDocValues).  From v 9.0, there is a check that in each segment of the index each field is indexed the same way, you can't have a field indexed with BinaryDocValues in one segment, and without BinaryDocValues in another segment.  In 8.x you can still have this option, and when you merge segments to a single segment, the resulting segment will record this field as indexed with BinaryDocValues. 

This check in v 9.0 is for each for each individual field, so if you transition 4 fields from stored fields to binary doc values, a separate transition needs to be done for each field. 

 

> reindexing documents as compared to force merging segments to one, so I guess the recommended approach for them would be to just reindex.

 

Reindexing should work as well, if it not an expensive operation for users. 

 

@asfimport
Copy link
Author

asfimport commented Jul 20, 2021

Michael McCandless (@mikemccand) (migrated from JIRA)

Catching up here...

So yeah as things stand, we force users of taxonomy faceting to fully reindex on upgrade to 9.0, which is not a great option, but I really love that Lucene is getting stricter enforcement of consistent field types with #10374.

@gautamworah96 could we instead index the new BinaryDocValues form into a different Lucene field? Then we could have graceful back compat (no reindexing required) while also migrating to the more efficient BinaryDocValues representation? It's OK to make this change pre-9.0 since we are allowed to make breaking changes to Lucene's nightly/SNAPSHOT builds. But if this approach can work, we should make this a blocker for 9.0.

@asfimport
Copy link
Author

Adrien Grand (@jpountz) (migrated from JIRA)

For the record, another approach could work that would consist of checking the created version of the index, and index a stored field on indices <9.0 and a BinaryDocValuesField on indices >= 9.0. This is similar to what we did when we changed the way that length normalization factors are encoded in 7.0.

@asfimport
Copy link
Author

Michael McCandless (@mikemccand) (migrated from JIRA)

Ahh that is a compelling option too @jpountz.

So users would not see the benefits of the BDV switch until they fully re-indexed (creating a new index).  That is maybe a good option too.

@gautamworah96 what do you think?

@asfimport
Copy link
Author

Gautam Worah (@gautamworah96) (migrated from JIRA)

@mikemccand I tried the approach of indexing the new BinaryDocValues field with a different name. Currently, we index the StringField with the internal $full_path$ name, but indexing the BinaryDocValues field with a different name and then trying to pull out (leafReader.getBinaryDocValues()) the BDV runs successfully (passes the back compat test, and users won't need to reindex).

The side-effect of this is that users might have some categories in the $full_path$ Lucene field and the newer (v9.0) categories in the $full_path_binary$ Lucene field. However, this field addition logic is entirely hidden from the users so we should be fine..

@asfimport
Copy link
Author

Adrien Grand (@jpountz) (migrated from JIRA)

Thinking a bit more about this, the approach of looking at the index created version means users won't benefit from this change on their existing indices and requires a bit more work in the short term, but it's also a bit cleaner and will help us get done with this change earlier?

Scenario 1: use a different field name for binary doc values

  • Lucene 9: writes binary doc values under field name $full_path_binary$, reads either stored fields for $full_path$ (Lucene 8 compatibility) or binary doc values for $full_path_binary$ (Lucene 9 compatibility).
  • Lucene 10: writes binary doc values under field name $full_path$, reads either binary doc values for $full_path_binary$ (Lucene 9 compatibility) or binary doc values for $full_path$ (Lucene 10 compatibility).
  • Lucene 11: migration is complete, Lucene always use binary doc values with field name $full_path$.

Scenario 2: use index created version

  • Lucene 9: reads/writes binary doc values or stored fields depending on the index created version.
  • Lucene 10: migration is complete, Lucene always use binary doc values with field name $full_path$.

@asfimport
Copy link
Author

Gautam Worah (@gautamworah96) (migrated from JIRA)

@jpountz I think your points make sense. Version based checks are definitely cleaner and can act as self documentation on what changed.

I had not thought about the migration process when I was thinking about the index using a different name approach.

In the meantime, I had already submitted a PR with approach 1 yesterday. I'll revise it with version based checks

@asfimport
Copy link
Author

Michael McCandless (@mikemccand) (migrated from JIRA)

OK, +1 for Scenario 2 (use index created version)!

@asfimport
Copy link
Author

Gautam Worah (@gautamworah96) (migrated from JIRA)

Argh. Took some time to figure out the change that added the getIndexCreatedVersionMajor() API.

However, a call like <font color="#871094">indexWriter</font>.getConfig().getIndexCreatedVersionMajor() still returns the version as 9 for the older index.

Something like SegmentInfos</font>.readLatestCommit(<font color="#871094">indexWriter</font>.getDirectory()).getIndexCreatedVersionMajor() also says the version is 9, where it should've reported 8. Needs some more debugging..

@asfimport
Copy link
Author

Gautam Worah (@gautamworah96) (migrated from JIRA)

Posted a new PR revision that implements the use index created version approach

@asfimport
Copy link
Author

Adrien Grand (@jpountz) (migrated from JIRA)

Closing after the 9.0.0 release

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant