Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LUCENE-9959: Rename TermVectorsReader to TermVectorsReaderBase #205

Conversation

zacharymorn
Copy link
Contributor

Description

Rename TermVectorsReader to TermVectorsReaderBase in preparation for introducing TermVectors and new TermVectorsReader abstractions. Please see #180 (comment) for details.

Please note that relevant variable names were currently not renamed, in order to keep the changes minimal and reduce the need to undo the renaming once TermVectorsReader is introduced back.

Tests

Passed existing tests.

Checklist

Please review the following and check all that apply:

  • I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
  • I have created a Jira issue and added the issue ID to my pull request title.
  • I have given Lucene maintainers access to contribute to my PR branch. (optional but recommended)
  • I have developed this patch against the main branch.
  • I have run ./gradlew check.
  • I have added tests for my changes.

…paration for introducing TermVectors and new TermVectorsReader abstractions
Copy link
Contributor

@dsmiley dsmiley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also okay with introducing the new/simpler TermVectorsReader abstraction if it would reduce this PR diff a lot. Up to you.

@@ -57,7 +57,7 @@
final NormsProducer normsProducer;

final StoredFieldsReader fieldsReaderOrig;
final TermVectorsReader termVectorsReaderOrig;
final TermVectorsReaderBase termVectorsReaderOrig;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the subsequent PR that re-introduces "TermVectorsReader", I assume you will change some of the references of TermVectorsReaderBase back to TermVectorsReader, such as right here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the subsequent PR that re-introduces "TermVectorsReader", I assume you will change some of the references of TermVectorsReaderBase back to TermVectorsReader

Yes I'm planning to do that!

such as right here?

For this particular one though, later in the code its close method is called via:

void decRef() throws IOException {
if (ref.decrementAndGet() == 0) {
try (Closeable finalizer = this::notifyCoreClosedListeners) {
IOUtils.close(
termVectorsLocal,
fieldsReaderLocal,
fields,
termVectorsReaderOrig,

So it may still need to use TermVectorsReaderBase there, as that class still implements Closeable per #180 (comment) ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps then TermVectorsReader should also implement Closeable? (to be done in another PR)

I called out the line above because it is not harmonious with the other "*Reader" suffixed classes declared next to it.

Copy link
Contributor Author

@zacharymorn zacharymorn Jul 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this indeed looks inconsistent.

I think the main thing here is, the overall understanding I got from #180 (comment) & #180 (comment) (this comment specifically asked not to move up Closeable from the original TermVectorsReader) is that, the new TermVectors / TermVectorsReader abstraction is supposed to be an index API, while the original TermVectorsReader / now TermVectorsReaderBase is a codec API, which has clone/getMergeInstance/checkIntegrity methods, and also implements Closeable (and all other *Reader/Producer codec classes also implement Closeable as well). If we were to move Closeable from TermVectorsReaderBase to the new TermVectorsReader, wouldn't that break the separation there?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sigh; I'm not sure what to do then. The non-harmoneous aspect is a code smell but if it's very limited (not widespread) then it's not a big deal I guess. Otherwise it's a stronger indicator there is misalignment. I'm not really sure why there needs to be both TermVectorsReader & TermVectorsReaderBase any way but surely you & Adrien do so I leave that to your judgement.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe another idea is leave TermVectorsReader be. In the other PR you have, rename TermVectors to TermVectorsSupplier? Just a thought for your consideration; I have no strong opinion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not really sure why there needs to be both TermVectorsReader & TermVectorsReaderBase any way but surely you & Adrien do so I leave that to your judgement.

The main reason that we are adding a new index level API for TermVectors (TermVectors in the other PR) is that we would like to provide a new way of TV access from IndexReader like the following

TermVectors termVectors = reader.getTermVectorsReader(); // would clone the underlying TermVectorsReader to be 
 thread-safe, and GC-able once usage is finished

for (int docId = 0; docId < numDocs; docId++) {
  Fields vectors = termVectors.get(docId);
}

to replace the old way of access in the form of

for (int docId = 0; docId < numDocs; docId++) {
  Fields vectors = reader.getTermVectors(docId); // using thread-local to cache
}

as per #137 (comment) . From my understanding, it's a bit like separating out the existing index level API IndexReader#getTermVectors into a new dedicated class to handle TV (and we do plan to deprecate the existing API once the new one is ready), and hence some of the internal index / codec classes organization details are now exposed via the indirection and inheritance relationships, and we need a bridge between them (by having codec class TermVectrosReaderBase to inherit the new index class TermVectorsReader).

Not sure if @jpountz or @rmuir have any further suggestion on this? If there's no strong objection to the current approach in this PR and the other one #180, we can potentially use https://issues.apache.org/jira/browse/LUCENE-10018 to further enhance on this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just left a comment back on #180 to suggest merging that PR as-is and why.

@zacharymorn
Copy link
Contributor Author

I'm also okay with introducing the new/simpler TermVectorsReader abstraction if it would reduce this PR diff a lot. Up to you.

Thanks @dsmiley for the review and approval! I've thought about having the new TermVectorsReader here as well, but I'm leaning more towards introducing it in the other PR after some digging, as I see so far the majority of TermVectorsReaderBase usage here involves using the existing codec APIs checkIntegrity/clone/getMergeInstance, so the newer TermVectorsReader class (that may not have any method yet) may not be usable directly. In addition, introducing it here may also lead to some git merge conflicts with the other PR later as well I feel, as TermVectors was already created there as the super class of TermVectorsReader. So I think saving it for the other PR may be an easier approach overall?

@zacharymorn zacharymorn self-assigned this Jul 7, 2021
@rmuir
Copy link
Member

rmuir commented Jul 19, 2021

-1 to these renames that solve no api problems.

@jpountz
Copy link
Contributor

jpountz commented Jul 20, 2021

I prefer the current API that's been merged as part of #180 than the proposed API where Fields, TermVectors and TermVectorsReader get renamed to TermVectors, TermVectorsReader and TermVectorsReaderBase. Removing usage of Fields doesn't help if we're introducing a new TermVectorsReaderBase abstraction instead?

@dsmiley
Copy link
Contributor

dsmiley commented Jul 20, 2021

FWIW, what I care about most in this is that Fields.java is relegated to an internal only thing associated with the terms index, for those that build PostingsFormats 'n such. I find it confusing to have this Fields class used in different ways across the codebase -- the PostingsFormat related Fields, and the term vector Fields. I care less about what we name some of the things, or wether a new public class is needed somewhere.

@zacharymorn
Copy link
Contributor Author

Sorry I have been away for a few days. I should have closed this PR after the merge of #180, as we (@dsmiley and I) also agreed earlier that the renaming may not be a good approach for the problem we are trying to solve here, sorry for the delay there as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants