Skip to content
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -1136,21 +1136,21 @@ private void buildGraphFromScratchWithRetry(final GraphBuildCallback graphCallba
// If pages have corrupted entries (e.g., old-format tombstones), the parser may miss many vectors.
// In that case, fall back to scanning documents directly to rebuild the vector list.
boolean documentScanPerformed = false;
final String typeName = getTypeName();
if (typeName != null && !ridToLatestVector.isEmpty()) {
if (metadata.associatedBucketId != -1 && !ridToLatestVector.isEmpty()) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The condition !ridToLatestVector.isEmpty() prevents the fallback mechanism from triggering if the page parser fails to recover any vectors at all (e.g., due to severe corruption). Since the docCount check already handles the case where the bucket is empty, this extra check is unnecessary and prevents recovery in cases of total page corruption.

Suggested change
if (metadata.associatedBucketId != -1 && !ridToLatestVector.isEmpty()) {
if (metadata.associatedBucketId != -1) {

try {
final long docCount = database.countType(typeName, false);
final com.arcadedb.engine.Bucket bucket = database.getSchema().getBucketById(metadata.associatedBucketId);
final long docCount = database.countBucket(bucket.getName());
if (ridToLatestVector.size() < docCount * 8 / 10) {
Comment on lines +1141 to 1143

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using database.countBucket(bucket.getName()) is less efficient than calling bucket.count() directly on the bucket object. Also, there is a risk of NullPointerException if bucket is null. While the surrounding try-catch block handles exceptions, a null check is preferred for clarity and to avoid unnecessary exception overhead.

Suggested change
final com.arcadedb.engine.Bucket bucket = database.getSchema().getBucketById(metadata.associatedBucketId);
final long docCount = database.countBucket(bucket.getName());
if (ridToLatestVector.size() < docCount * 8 / 10) {
final com.arcadedb.engine.Bucket bucket = database.getSchema().getBucketById(metadata.associatedBucketId);
final long docCount = bucket != null ? bucket.count() : 0;
if (bucket != null && ridToLatestVector.size() < docCount * 8 / 10) {

Comment on lines +1139 to 1143

Copilot AI Apr 3, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change scopes the fallback cross-check/scan to metadata.associatedBucketId, which fixes the multi-bucket type behavior described in #3722, but there’s no regression test covering the scenario where a type has multiple buckets and the vector index is bound to only one bucket. Please add a test that creates a type with 2+ buckets, builds a bucket-scoped LSM vector index, triggers the page-parser-missed-vectors fallback, and asserts only records from the associated bucket are scanned/used (records from the other bucket must not affect the docCount heuristic or be added to ridToLatestVector).

Copilot uses AI. Check for mistakes.
LogManager.instance().log(this, Level.WARNING,
"Page-parsed vectors (%d) significantly less than document count (%d) for index %s. "
+ "Falling back to document scan to recover missing vectors.",
ridToLatestVector.size(), docCount, indexName);

// Scan all documents to find vectors missing from the page-parsed set
// Scan all documents in the bucket to find vectors missing from the page-parsed set
final String vectorProp =
metadata.propertyNames != null && !metadata.propertyNames.isEmpty() ? metadata.propertyNames.getFirst() :
"vector";
database.scanType(typeName, false, record -> {
database.scanBucket(bucket.getName(), record -> {
final Document doc = (Document) record;
final RID rid = doc.getIdentity();
if (!ridToLatestVector.containsKey(rid)) {
Expand Down
Loading