LUCENE-9450: Use BinaryDocValue fields in the taxonomy index based on the existing index version #220

gautamworah96 · 2021-07-21T08:38:37Z

Category documents added in the Lucene 9.0 taxonomy index use a BDV field with a different name

Using BDV fields with a different "$full_path_binary$" name
ensures that the earlier "$full_path$" StringField does not have the same name as the
BDV field and hence they don't violate the field type consistency check
(LUCENE-9334).

This commit also enables the back-compat check that was disabled
earlier.

https://issues.apache.org/jira/browse/LUCENE-9450

Solution

There were two proposed solutions in the JIRA ticket:

Add the BDV field with a different name.
When we were adding the BDV field with the same Consts.FULL name, it was causing a java.lang.IllegalArgumentException: cannot change field "$full_path$" from doc values type=NONE to inconsistent doc values type=BINARY error because the current logic checks all fields with the same name across segments and ensures that they use the same BinaryDocValues field TYPE.

Adding the BDV field with a different name ensures that the check does not trip. We are careful here to use the same new name when trying to retrieve values in the DirectoryTaxonomyReader

Perform a check on the index version when we try to add a BDV field. If the index is pre 9.0 we only add the StringField and use only that field when trying to read the value from the index. If the index is newer (>=9.0), we add and read the value from a BDV field.

This PR implements the approach described in step 1.

Tests

Enabled the back-compat test in TestBackwardsCompatibility.testCreateNewTaxonomy

Checklist

Please review the following and check all that apply:

I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
I have created a Jira issue and added the issue ID to my pull request title.
I have given Lucene maintainers access to contribute to my PR branch. (optional but recommended)
I have developed this patch against the main branch.
I have run ./gradlew check.
I have added tests for my changes.

BDV field with a different name Using BDV fields with a different "$full_path_binary$" name ensures that the earlier "$full_path$" StringField does not have the same name as the BDV field and hence they don't violate the field type consistency check (LUCENE-9334). This commit also enables the back-compat check that was disabled earlier.

the last index commit If the Lucene version was < 9 then use a StringField or else if the index is fresh or if the index is was built using a version >= 9, then use a BDV field.

gautamworah96 · 2021-07-26T19:30:35Z

Changes in the new b9cbc4c commit:

The reason why the SegmentInfos.readLatestCommit(dir).getMinSegmentLuceneVersion() call was returning 9 as the version, was that the older zip file in the mainline was using the Lucene 8.6 Codec but the major version variable was still assigned as 9. This was because the main branch in the repo (during the 8.6 release) had already set the major version as 9. I reconstructed the 8.10 taxonomy index from the branch_8x branch and that correctly set the major version as 8 for those older segments.
Use a version based check for storing BDV fields or StringFields

I think the new commit might be slower that the previous $full_path_binary$ option during indexing because it checks the Lucene version of the last commit everytime we add a new category.

Finally, I think there should be a cleaner way of knowing if the index has atleast one commit or no. I use the indexWriter.getLiveCommitData().iterator().hasNext() call but maybe there is a better way..

Side questions that need more thought:

What is the use of the LiveIndexWriterConfig.createdVersionMajor param. I think instead of initializing it to the latest version, maybe we can assign the value of the min back compat version of the index to it (when the LiveIndexWriterConfig class is initialized).
Can we fix the DirectoryTaxonomyWriter.indexEpoch variable to hold the accurate index epoch of the taxonomy index.
The current logic for indexEpoch assigns 1 even if the index is completely fresh. It also saves 1 as the value when the index has just 1 commit.

mikemccand

I love how simple this is! I left a couple comments.

lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyWriter.java

constructor

mikemccand · 2021-07-27T12:32:35Z

Also, to be clear, even though the opening comment says the PR implemented option 1, it has now iterated onto option 2 (switching based on the index created version metadata).

mikemccand

I left a few small comments! I think this is close!

lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyWriter.java

...ne/facet/src/test/org/apache/lucene/facet/taxonomy/directory/TestBackwardsCompatibility.java

jpountz · 2021-07-28T12:59:13Z

What is the use of the LiveIndexWriterConfig.createdVersionMajor

It's very expert. It's necessary if you have multiple workers creating indices that you then want to merge together using IndexWriter#addIndexes. addIndexes requires that all indices have the same major version, so if you are doing a rolling upgrade on your workers to a new Lucene major, this helps ensure that all indices are created in a way that they can be merged eventually.

mikemccand

OK, this looks great @gautamworah96 -- thanks! I'll review and push soon.

I like this approach to back-compat (using the index created version) -- it gives a more consistent index than trying to blend in, segment by segment, the new changes.

mikemccand · 2021-07-29T17:15:54Z

OK I just merged this via git command-line, but apparently GitHub hasn't noticed. Thanks @gautamworah96 !

Gautam Worah added 2 commits July 21, 2021 01:08

Use BDV or a StoredField based on the Lucene version that has created

b9cbc4c

the last index commit If the Lucene version was < 9 then use a StringField or else if the index is fresh or if the index is was built using a version >= 9, then use a BDV field.

gautamworah96 force-pushed the 9450-fix-consistency branch from 2ea7f26 to b9cbc4c Compare July 26, 2021 18:52

mikemccand reviewed Jul 26, 2021

View reviewed changes

lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyWriter.java Outdated Show resolved Hide resolved

lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyWriter.java Outdated Show resolved Hide resolved

Gautam Worah added 2 commits July 26, 2021 15:21

Move the version check to a final variable that is initialized in the

3881fcb

constructor

Fix minor logic

c2c3696

gautamworah96 changed the title ~~LUCENE-9450: Use BinaryDocValue fields with a different name in the taxonomy index~~ LUCENE-9450: Use BinaryDocValue fields in the taxonomy index based on the existing index version Jul 26, 2021

mikemccand reviewed Jul 27, 2021

View reviewed changes

gautamworah96 requested a review from mikemccand July 27, 2021 17:47

PR fixes 1. Change negation to 2. Move statement inside if condition

b7bd713

Simplify some code

24fc3e3

mikemccand approved these changes Jul 28, 2021

View reviewed changes

mikemccand closed this Jul 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LUCENE-9450: Use BinaryDocValue fields in the taxonomy index based on the existing index version #220

LUCENE-9450: Use BinaryDocValue fields in the taxonomy index based on the existing index version #220

gautamworah96 commented Jul 21, 2021

gautamworah96 commented Jul 26, 2021

mikemccand left a comment

mikemccand commented Jul 27, 2021

mikemccand left a comment

jpountz commented Jul 28, 2021

mikemccand left a comment

mikemccand commented Jul 29, 2021

LUCENE-9450: Use BinaryDocValue fields in the taxonomy index based on the existing index version #220

LUCENE-9450: Use BinaryDocValue fields in the taxonomy index based on the existing index version #220

Conversation

gautamworah96 commented Jul 21, 2021

Solution

Tests

Checklist

gautamworah96 commented Jul 26, 2021

mikemccand left a comment

Choose a reason for hiding this comment

mikemccand commented Jul 27, 2021

mikemccand left a comment

Choose a reason for hiding this comment

jpountz commented Jul 28, 2021

mikemccand left a comment

Choose a reason for hiding this comment

mikemccand commented Jul 29, 2021