Enable Faiss-based vector format to index larger number of vectors in a single segment by kaivalnp · Pull Request #14847 · apache/lucene

kaivalnp · 2025-06-25T18:10:39Z

Description

I was trying to index a large number of vectors in a single segment, and ran into an error because of the way we copy vectors to native memory, before calling Faiss to create an index:

Caused by: java.lang.IllegalStateException: Segment is too large to wrap as ByteBuffer. Size: 3276800000
        at org.apache.lucene.index.SegmentMerger.mergeWithLogging(SegmentMerger.java:314)
        at java.base/jdk.internal.foreign.AbstractMemorySegmentImpl.checkArraySize(AbstractMemorySegmentImpl.java:374)
        at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:158)

This limitation was hit because we use a ByteBuffer (backed by native memory) to copy vectors from heap -- which has a 2 GB size limit

As a fix, I've changed it to use MemorySegment specific functions to copy vectors (also moving away from these byte buffers in other places, and using more appropriate IO methods)

With these changes, we no longer see the above error and are able to build and search an index. Also ran benchmarks for a case where this limit was not hit to check for performance impact:

Baseline (on main):

    type  recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  beamWidth  quantized  index(s)  index_docs/s  force_merge(s)  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB)  indexType
   faiss   0.997        1.855   1.819        0.981  100000   100      50       32        200         no     31.07       3218.44           32.76             1         3152.11      1562.500     1562.500       HNSW

Candidate (on this PR):

    type  recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  beamWidth  quantized  index(s)  index_docs/s  force_merge(s)  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB)  indexType
   faiss   0.998        1.817   1.794        0.987  100000   100      50       32        200         no     29.57       3381.46           33.20             1         3152.11      1562.500     1562.500       HNSW

..and indexing / search performance is largely unchanged

Edit: Related to #14178

… a single segment - Moves away from a ByteBuffer (with a 2 GB limit) to direct copying of vectors to native memory - Also simplify some other off-heap memory IO instances

github-actions · 2025-06-25T18:11:37Z

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.

msokolov

This LGTM - one question: do we have unit tests covering this?

github-actions · 2025-06-26T02:52:44Z

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.

kaivalnp · 2025-06-26T02:57:59Z

@msokolov I wasn't sure about attempting to index a large amount of vector data, given that it'll take up a few GB of RAM. I've added a test for now, please let me know if I should keep it (or how to test it better). Perhaps having the test is fine, because we run Faiss tests (and only those) in a separate GH action?

The test fails deterministically when added to main:

   >     java.lang.IllegalStateException: Segment is too large to wrap as ByteBuffer. Size: 2149576700
   >         at __randomizedtesting.SeedInfo.seed([1B557576B3F191C9:6F03E13EF7CEF63A]:0)
   >         at java.base/jdk.internal.foreign.AbstractMemorySegmentImpl.checkArraySize(AbstractMemorySegmentImpl.java:374)
   >         at java.base/jdk.internal.foreign.AbstractMemorySegmentImpl.asByteBuffer(AbstractMemorySegmentImpl.java:199)
   >         at org.apache.lucene.sandbox.codecs.faiss.LibFaissC.createIndex(LibFaissC.java:224)

msokolov · 2025-06-26T12:51:36Z

Sorry I was too vague - I didn't mean we should be testing the > 2GB case! I just wanted to make sure we had unit test coverage for these classes at all because I'm not familiar with this part of thge codebase

- Also modify the test to make backporting easier

kaivalnp · 2025-06-26T14:49:44Z

we had unit test coverage for these classes at all

Yes, we have a test class that runs all tests in the BaseKnnVectorsFormatTestCase

We had to modify / disable a few because the format only supports float vectors and a few similarity functions..

We run these tests on each PR / commit via GH actions, see sample run from this PR, which ran:

> Task :lucene:sandbox:test
:lucene:sandbox:test (SUCCESS): 53 test(s), 8 skipped

I didn't mean we should be testing the > 2GB case

I kind of like that we have this test, can we just mark it as "monster" so that we don't run it locally / from GH actions?
Also refactored a bit to make backporting easier..

I was able to run it using:

./gradlew -p lucene/sandbox -Dtests.faiss.run=true test --tests "org.apache.lucene.sandbox.codecs.faiss.*" -Dtests.monster=true -Dtests.heapsize=16g

..where it took a (relatively) long time to run:

:lucene:sandbox:test (SUCCESS): 53 test(s), 8 skipped
The slowest tests during this run:
  14.64s TestFaissKnnVectorsFormat.testLargeVectorData (:lucene:sandbox)
The slowest suites during this run:
  16.63s TestFaissKnnVectorsFormat (:lucene:sandbox)

Also, running it on main gives the same error as above

github-actions · 2025-06-26T14:50:18Z

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.

mikemccand · 2025-06-30T15:19:47Z

I kind of like that we have this test, can we just mark it as "monster" so that we don't run it locally / from GH actions?

+1, this is exactly why we have the monster annotation!

mikemccand

Thanks @kaivalnp -- this looks like a rote cutover from the legacy ByteBuffer to MemorySegment. Thank you for adding the new monster test and confirming it passes!

Do we have any tests that check for memory leaks? E.g. a test that creates Faiss HNSW graph, and then opens/closes it thousands of times? I don't think we should block this awesome change for these tests ... we can separately pursue.

kaivalnp · 2025-06-30T16:32:45Z

Thanks @mikemccand!

Do we have any tests that check for memory leaks?

I don't think we have tests today, so I opened #14875 to track it -- plus the broader question of how to safely use the new format!

github-actions · 2025-07-03T17:17:54Z

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.

kaivalnp · 2025-07-03T17:24:00Z

@mikemccand I stumbled upon a way to allocate a long[] in native memory using a specific byte order (LITTLE_ENDIAN) -- which we use in a filtered search (i.e. if an explicit filter is provided, or the segment has deletes)

With this, I think we've moved away from all ByteBuffer usages to copy bytes to native memory in LibFaissC

Edit: Also posting a benchmark run to check that we didn't change any behavior

main:

recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  beamWidth  quantized  index(s)  index_docs/s  force_merge(s)  num_segments  index_size(MB)  selectivity  filterType  vec_disk(MB)  vec_RAM(MB)  indexType
 0.702        1.893   1.765        0.932  100000   100      50       64        250         no      8.89      11251.13           10.70             1          637.45         0.10  pre-filter       292.969      292.969       HNSW

This PR:

recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  beamWidth  quantized  index(s)  index_docs/s  force_merge(s)  num_segments  index_size(MB)  selectivity  filterType  vec_disk(MB)  vec_RAM(MB)  indexType
 0.702        1.851   1.763        0.952  100000   100      50       64        250         no      7.99      12514.08           10.40             1          637.45         0.10  pre-filter       292.969      292.969       HNSW

There is no tangible difference in performance (seems to be within range of noise)..

mikemccand · 2025-07-05T18:16:06Z

Thanks @kaivalnp -- I'll merge this one soon. Let's remember to also backport this to 10.x?

mikemccand · 2025-07-05T18:17:49Z

Could you also add an entry in CHANGES.txt? I think it's important to show that this Faiss based KNN Lucene codec format can handle large KNN indices...

kaivalnp · 2025-07-05T18:40:05Z

entry in CHANGES.txt

Thanks @mikemccand, I thought it was a follow-up to the original PR adding the codec, and may not need a separate entry -- but I've added one under "Bug Fixes" now..

I'll update the backport PR once this is merged!

Enable Faiss-based vector format to index larger number of vectors in…

d4b09a1

… a single segment - Moves away from a ByteBuffer (with a 2 GB limit) to direct copying of vectors to native memory - Also simplify some other off-heap memory IO instances

github-project-automation bot added this to OpenSearch Lucene & Core Performance Tracking Jun 25, 2025

github-project-automation bot moved this to Open in OpenSearch Lucene & Core Performance Tracking Jun 25, 2025

github-actions bot added the module:sandbox label Jun 25, 2025

msokolov reviewed Jun 25, 2025

View reviewed changes

Add test

c7dbca2

Mark test as "monster"

f70051d

- Also modify the test to make backporting easier

mikemccand approved these changes Jun 30, 2025

View reviewed changes

kaivalnp mentioned this pull request Jun 30, 2025

How to catch potential issues with the new Faiss vector format? #14875

Open

kaivalnp added 2 commits July 3, 2025 17:09

Allocate filtered doc bits without ByteBuffer

5cf3c60

Merge branch 'main' into faiss-larger-vectors

4c9b052

kaivalnp added 2 commits July 5, 2025 18:31

Merge branch 'main' into faiss-larger-vectors

7b3cdb5

Add CHANGES.txt entry

2de7c99

github-actions bot added this to the 11.0.0 milestone Jul 5, 2025

mikemccand approved these changes Jul 5, 2025

View reviewed changes

mikemccand merged commit bba7aee into apache:main Jul 5, 2025
8 checks passed

github-project-automation bot moved this from Open to Merged in OpenSearch Lucene & Core Performance Tracking Jul 5, 2025

kaivalnp deleted the faiss-larger-vectors branch July 5, 2025 19:25

kaivalnp mentioned this pull request Jul 5, 2025

Backport Faiss-based vector format to 10.x #14843

Merged

Conversation

kaivalnp commented Jun 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Uh oh!

github-actions bot commented Jun 25, 2025

Uh oh!

msokolov left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Jun 26, 2025

Uh oh!

kaivalnp commented Jun 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

msokolov commented Jun 26, 2025

Uh oh!

kaivalnp commented Jun 26, 2025

Uh oh!

github-actions bot commented Jun 26, 2025

Uh oh!

mikemccand commented Jun 30, 2025

Uh oh!

mikemccand left a comment

Choose a reason for hiding this comment

Uh oh!

kaivalnp commented Jun 30, 2025

Uh oh!

github-actions bot commented Jul 3, 2025

Uh oh!

kaivalnp commented Jul 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mikemccand commented Jul 5, 2025

Uh oh!

mikemccand commented Jul 5, 2025

Uh oh!

kaivalnp commented Jul 5, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

kaivalnp commented Jun 25, 2025 •

edited

Loading

kaivalnp commented Jun 26, 2025 •

edited

Loading

kaivalnp commented Jul 3, 2025 •

edited

Loading