Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LUCENE-9450: Use BinaryDocValue fields in the taxonomy index based on the existing index version #220

Closed
wants to merge 6 commits into from

Conversation

gautamworah96
Copy link
Contributor

Category documents added in the Lucene 9.0 taxonomy index use a BDV field with a different name

Using BDV fields with a different "$full_path_binary$" name
ensures that the earlier "$full_path$" StringField does not have the same name as the
BDV field and hence they don't violate the field type consistency check
(LUCENE-9334).

This commit also enables the back-compat check that was disabled
earlier.

https://issues.apache.org/jira/browse/LUCENE-9450

Solution

There were two proposed solutions in the JIRA ticket:

  1. Add the BDV field with a different name.
    When we were adding the BDV field with the same Consts.FULL name, it was causing a java.lang.IllegalArgumentException: cannot change field "$full_path$" from doc values type=NONE to inconsistent doc values type=BINARY error because the current logic checks all fields with the same name across segments and ensures that they use the same BinaryDocValues field TYPE.

Adding the BDV field with a different name ensures that the check does not trip. We are careful here to use the same new name when trying to retrieve values in the DirectoryTaxonomyReader

  1. Perform a check on the index version when we try to add a BDV field. If the index is pre 9.0 we only add the StringField and use only that field when trying to read the value from the index. If the index is newer (>=9.0), we add and read the value from a BDV field.

This PR implements the approach described in step 1.

Tests

Enabled the back-compat test in TestBackwardsCompatibility.testCreateNewTaxonomy

Checklist

Please review the following and check all that apply:

  • I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
  • I have created a Jira issue and added the issue ID to my pull request title.
  • I have given Lucene maintainers access to contribute to my PR branch. (optional but recommended)
  • I have developed this patch against the main branch.
  • I have run ./gradlew check.
  • I have added tests for my changes.

Gautam Worah added 2 commits July 21, 2021 01:08
BDV field with a different name

Using BDV fields with a different "$full_path_binary$" name
ensures that the earlier "$full_path$" StringField does not have the same name as the
BDV field and hence they don't violate the field type consistency check
(LUCENE-9334).

This commit also enables the back-compat check that was disabled
earlier.
the last index commit

If the Lucene version was < 9 then use a StringField or else
if the index is fresh or if the index is was built using a
version >= 9, then use a BDV field.
@gautamworah96
Copy link
Contributor Author

Changes in the new b9cbc4c commit:

  1. The reason why the SegmentInfos.readLatestCommit(dir).getMinSegmentLuceneVersion() call was returning 9 as the version, was that the older zip file in the mainline was using the Lucene 8.6 Codec but the major version variable was still assigned as 9. This was because the main branch in the repo (during the 8.6 release) had already set the major version as 9. I reconstructed the 8.10 taxonomy index from the branch_8x branch and that correctly set the major version as 8 for those older segments.
  2. Use a version based check for storing BDV fields or StringFields

I think the new commit might be slower that the previous $full_path_binary$ option during indexing because it checks the Lucene version of the last commit everytime we add a new category.

Finally, I think there should be a cleaner way of knowing if the index has atleast one commit or no. I use the indexWriter.getLiveCommitData().iterator().hasNext() call but maybe there is a better way..

Side questions that need more thought:

  1. What is the use of the LiveIndexWriterConfig.createdVersionMajor param. I think instead of initializing it to the latest version, maybe we can assign the value of the min back compat version of the index to it (when the LiveIndexWriterConfig class is initialized).
  2. Can we fix the DirectoryTaxonomyWriter.indexEpoch variable to hold the accurate index epoch of the taxonomy index.
    The current logic for indexEpoch assigns 1 even if the index is completely fresh. It also saves 1 as the value when the index has just 1 commit.

Copy link
Member

@mikemccand mikemccand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love how simple this is! I left a couple comments.

@gautamworah96 gautamworah96 changed the title LUCENE-9450: Use BinaryDocValue fields with a different name in the taxonomy index LUCENE-9450: Use BinaryDocValue fields in the taxonomy index based on the existing index version Jul 26, 2021
@mikemccand
Copy link
Member

Also, to be clear, even though the opening comment says the PR implemented option 1, it has now iterated onto option 2 (switching based on the index created version metadata).

Copy link
Member

@mikemccand mikemccand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a few small comments! I think this is close!

@jpountz
Copy link
Contributor

jpountz commented Jul 28, 2021

What is the use of the LiveIndexWriterConfig.createdVersionMajor

It's very expert. It's necessary if you have multiple workers creating indices that you then want to merge together using IndexWriter#addIndexes. addIndexes requires that all indices have the same major version, so if you are doing a rolling upgrade on your workers to a new Lucene major, this helps ensure that all indices are created in a way that they can be merged eventually.

Copy link
Member

@mikemccand mikemccand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, this looks great @gautamworah96 -- thanks! I'll review and push soon.

I like this approach to back-compat (using the index created version) -- it gives a more consistent index than trying to blend in, segment by segment, the new changes.

@mikemccand
Copy link
Member

OK I just merged this via git command-line, but apparently GitHub hasn't noticed. Thanks @gautamworah96 !

@mikemccand mikemccand closed this Jul 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants