Backport Faiss-based vector format to 10.x#14843
Backport Faiss-based vector format to 10.x#14843mikemccand merged 4 commits intoapache:branch_10xfrom
Conversation
# Conflicts: # lucene/CHANGES.txt
|
Also included #14847 in this PR |
|
Did you try running the new monster Faiss test too @kaivalnp? |
|
@mikemccand it succeeded a few times, but I ran into a failure after multiple runs: Command (I'm on JDK21): gradlew :lucene:sandbox:test --tests "org.apache.lucene.sandbox.codecs.faiss.TestFaissKnnVectorsFormat.testLargeVectorData" -Ptests.heapsize=16g -Ptests.jvms=12 "-Ptests.jvmargs=-XX:TieredStopAtLevel=1 -XX:+UseParallelGC -XX:ActiveProcessorCount=1" -Ptests.seed=6FE1439CE22DB5DA -Ptests.useSecurityManager=true -Ptests.monster=true -Ptests.gui=false -Ptests.file.encoding=UTF-8 -Ptests.vectorsize=256 -Ptests.forceintegervectors=true -Ptests.faiss.run=trueError: The codec footer check seems to fail on the KnnVectorsFormat format =
new Lucene99FlatVectorsFormat(FlatVectorScorerUtil.getLucene99FlatVectorsScorer());and the test still fails, so it seems unrelated / pre-existing? FYI I see the same failure on |
Egads, OK, this does look like a pre-existing issue. Can you open a spinoff issue so we can get to the bottom of that? If I'll try to review this PR soon! Thank you for working out the backport. |
mikemccand
left a comment
There was a problem hiding this comment.
This looks great @kaivalnp. It looks like a rote backport, except for the sandbox mrjar addition. Since main is happy with the Faiss codec, and it's baked ~3 weeks now, I'll merge this in a day or so. (It is Prime Day over here, kinda busy for Amazon Search team heh).
|
I am not fully happy with this, so I would have given a -1 for the merge. Now it is too late! I am very busy at moment so I wasn't aber to review the code earlier! IMHO, we should have waited and better release main branch with this as Lucene 11. I'd suggest (see my talk @ buzzwords) to release Lucene 11 a few months after the release of Java 25 (e.g., November 2025). We have good features like this already. There are also some code issues (like catch Throwable and not rethrowing RuntimeException|Error unmodified and instead all is wrapped in unspecifid RuntimeException.... - this is a no-go in Lucene code and also discouraged by security researchers). Please add followup PRs and cleanup the exception handling. If you catch "Throwable" alsways add a In addition, all code triggering the restricted warning should be in a single class (@rmuir already complained with TODOs). So my personal opinion: Revert revert revert. |
| configure(project(":lucene:core")) { | ||
| configure([ | ||
| project(":lucene:core"), | ||
| project(":lucene:sandbox"), |
There was a problem hiding this comment.
unfortunately this causes also the vector API be replicated in the sandbox module.... I don't like that. It also adds useless JAR files again to the code.
There was a problem hiding this comment.
Makes sense, I'll try to reuse the apijar from lucene.core
| * | ||
| * @lucene.experimental | ||
| */ | ||
| public class FaissKnnVectorsFormatProvider extends KnnVectorsFormat { |
There was a problem hiding this comment.
I think this class should be named FaissKnnVectorsFormat. It is not a provider pattern. In MMapDir this is different, there it is a provider for IndexInputs.
Instead the FaissKnnVectorsFormat in the java21 part should be package private (it is now public) and have a different name, e.g. 'FaissKnnVectorsFormatImpl'
There was a problem hiding this comment.
Makes sense, I've addressed this in the refactor on main
|
|
||
| private static KnnVectorsFormat lookup(Object... args) { | ||
| try { | ||
| MethodHandles.Lookup lookup = MethodHandles.lookup(); |
There was a problem hiding this comment.
we have a private lookup here, so we can lookup a package private impl.
| * | ||
| * @lucene.experimental | ||
| */ | ||
| public final class FaissKnnVectorsFormat extends KnnVectorsFormat { |
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. | ||
|
|
||
| org.apache.lucene.sandbox.codecs.faiss.FaissKnnVectorsFormatProvider |
Oh, sorry, thanks @uschindler -- I will simply revert and we can iterate some more! This is good feedback, and also applies to the But this is all in sandbox :) It's OK to be a little ... sandy! I will revert shortly. |
|
OK I reverted -- let's iterate some more -- @kaivalnp can you open a followon PR maybe? Thank you! Maybe start with @uschindler's feedback on |
There was a problem hiding this comment.
Think we should not duplicate that file.
Either refer to it from the other module (seems most non-invasive) and especially compatible with main branch or move the code that accesses foreign APIs to main branch. In ideal world only the LibFaissC should access native APIs and add abstractions for all code to get rid of MemorySegment. Then we could place a provider to get the LibFaissC fclass with tha API abstractions from the core's org.apache.lucene.internal package (which is only exported to sandbox but not to public) and the KNN vector codec would reside in the sandbox package and only call methods which use public Java classes.
I would prefer also on main branch to put all restricted code in LibFaissC only.
This would also solve @rmuir's complaints about duoing "restricted" chnages to MemorySegments.
There was a problem hiding this comment.
only the LibFaissC should access native APIs and add abstractions for all code to get rid of MemorySegment
also solve @rmuir's complaints about duoing "restricted" chnages to MemorySegments
Makes sense, I've addressed these in the refactor!
| try { | ||
| return (T) handle.invokeWithArguments(args); | ||
| } catch (Throwable e) { | ||
| throw new RuntimeException(e); |
There was a problem hiding this comment.
The MethodHandle is to native code so in general it can only throw errors regarding type addaption. So please change this to:
} catch (Error|RuntimeException e) {
throw e;
} catch (Throwable e) {
throw new AssertionError("Should not throw checked exceptions", e);
}
Please keep in mind that this is the slowest way to call a methodhandle. If those "calls" are done in loops affectimg performance, you should really really use invokeExact with exact types. For that you have to repeat the error handling code over and over, but it is better than the current code.
Have you looked at jextract which can generate a stub class from a header file?
There was a problem hiding this comment.
jextractwhich can generate a stub class from a header file?
Yes, this would have been ideal! https://github.com/openjdk/jextract
But the C API of Faiss makes use of forward declarations which aren't supported:
__stddef_max_align_t.h:22:15: warning: Skipping __clang_max_align_nonce2 (type LongDouble is not supported)
Index_c.h:21:1: warning: Skipping FaissRangeSearchResult_H (type Declared(FaissRangeSearchResult_H) is not supported)
Index_c.h:21:1: warning: Skipping FaissRangeSearchResult (type Declared(FaissRangeSearchResult_H) is not supported)
(output when I tried to generate classes for some C headers)
| allocateArrayFunctionName, | ||
| MethodType.methodType(MemorySegment.class, MemoryLayout.class, long.class)); | ||
| } catch (IllegalAccessException | NoSuchMethodException e) { | ||
| throw new RuntimeException(e); |
There was a problem hiding this comment.
Please throw LinkageError here
There was a problem hiding this comment.
The better way to solve this would be to add the Java 21 code in some static-only class to wrap calls and add another sourceset for Java 22 where you reimplement the same class with Java 22+ code. Of course we need an extract of apijar for java 22 then.
The MR-JAR magic will automatically load the correct implementation based on runtime Java version.
There was a problem hiding this comment.
I'd prefer to add a jdk-22 extract of all APIs in core package. But then we should find a way to reuse the apijar files in sandbox and not have duplicates.
| throw new LinkageError("FaissKnnVectorsFormat is missing from JAR file", e); | ||
| } catch (IllegalAccessException | NoSuchMethodException e) { | ||
| throw new LinkageError("FaissKnnVectorsFormat is missing correctly typed constructor", e); | ||
| } catch (Throwable t) { |
There was a problem hiding this comment.
Please check how it is done here:
lucene/lucene/core/src/java/org/apache/lucene/store/MMapDirectory.java
Lines 399 to 420 in e610d0e
As said before carefully differentiate between Exception types.
There was a problem hiding this comment.
In the original code we have a separate try/ctach around the invoke call, this is unsafe as it may catch too many exceptions from unrelated places.
So please add a try-catch around only the call.
An please use a correctly typed method handle..... This is too lazy as not statically typed.
There was a problem hiding this comment.
I wozuld only code the multi-arg ctor here statically and delegate from no-arg ctor of this class by calling this(DEFAULT.....)
There was a problem hiding this comment.
Addressed in the refactor!
I do really like this idea. See related thread as an example struggle with just a one-line PR from yesterday: #14916 (comment) In the |
Well, getting there, getting there. ;) |
|
Let's start a discussion on dev list about releasing and new features like this. |
| import org.apache.lucene.index.SegmentWriteState; | ||
|
|
||
| /** | ||
| * Provides a Faiss-based vector format, see corresponding classes in {@code java21/} for docs! |
There was a problem hiding this comment.
the docs of private classes in java21 are not visible in javadocs. So need to stay here.
|
Thanks a lot for the review @uschindler, it was super helpful! I've segregated all "unsafe" code outside Would appreciate a review! |
+1, it seems like the time is right-ish for 11.0 release! Thanks @uschindler and @rmuir and @kaivalnp. |
Description
Backport #14178 to 10.x
Summary of changes
libfaiss_c.soshared library from tests, enforced by the security managerI also ran the Faiss tests offline with JDK 21, 22, 23, 24 using:
..and they all passed!
Diff of relevant files
https://gist.github.com/kaivalnp/77fb274ebd3de22c500396e80bb5df3c