MDD-53: Add build_with_compression API for higher quality graph construction #2

devin-ai-integration · 2025-10-28T20:12:32Z

MDD-53: Add build_with_compression API for higher quality graph construction

Summary

This PR implements MDD-53, adding new Vamana index building APIs that accept uncompressed VectorDataLoader data with runtime compression parameters. The key improvement is that graphs are now built using uncompressed data for higher quality, then compressed for efficient storage.

Previous workflow (required manual compression):

# Step 1: Manual compression
data = svs.data.SimpleData.load(data_path)
compressed = scalar.SQDataset.compress(data)

# Step 2: Build with compressed data (lower graph quality)
index = svs.Vamana.build(parameters, compressed, distance, num_threads)

New simplified workflow:

# Single step: builds with uncompressed, stores compressed
index = svs.Vamana.build_with_compression(
    parameters, 
    data_loader,  # uncompressed VectorDataLoader
    svs.CompressionType.ScalarInt8,  # compression applied after graph building
    distance, 
    num_threads
)

Changes:

C++ Core: Added auto_build_with_compression() in index.h that loads uncompressed data → builds graph with VamanaBuilder → compresses data → creates final VamanaIndex with pre-built graph and compressed data
Orchestrator: Added Vamana::build_with_compression() static method following existing API patterns
Compression Types: New CompressionType enum (None, ScalarInt8) and CompressionParameters struct in include/svs/core/compression.h
Python Bindings: Full dispatcher-based bindings with runtime type selection for query/data types
Testing: Integration test (compression_build.cpp) validates recall quality across L2, IP, and Cosine distance metrics

Review & Testing Checklist for Human

⚠️ CRITICAL - This code has NOT been compiled or tested locally due to missing build commands in the repo setup. Please verify thoroughly:

CI passes all checks - Build, lint, and all integration tests must pass (especially the new compression_build test)
Verify graph building logic - Review auto_build_with_compression() in index.h lines 1071-1138: confirm it correctly builds graph with uncompressed data before compressing
Test Python API end-to-end - Run a simple Python script using build_with_compression() with ScalarInt8 compression to verify:
- API is exposed correctly
- Runtime type dispatch works
- Index can search and returns correct results
Memory usage sanity check - The implementation briefly keeps both uncompressed (for building) and compressed (for storage) data in memory simultaneously. For large datasets, verify this doesn't cause OOM
Review dispatcher registration - Check Python binding dispatcher setup (lines 141-176 in vamana.cpp) matches existing patterns

Test Plan Recommendation

# Quick validation script
import svs

# Build with new API
index = svs.Vamana.build_with_compression(
    svs.VamanaBuildParameters(),
    svs.VectorDataLoader("path/to/data"),
    svs.CompressionType.ScalarInt8,
    svs.DistanceType.L2,
    num_threads=4
)

# Verify search works
queries = svs.data.SimpleData.load("queries.fvecs")
results = index.search(queries, 10)
print(f"Search completed, found {len(results)} results")

Notes

Backward compatibility preserved - All existing APIs remain unchanged
Extension system integration - Relies on existing VamanaBuildAdaptor to handle compressed data in final index (not explicitly tested but should work based on existing patterns)
Future extensibility - CompressionParameters struct designed to support additional compression types beyond ScalarInt8
Session info: Implemented by @milind-cognition, Devin run

…ruction - Add CompressionType enum and CompressionParameters struct - Add auto_build_with_compression() that builds with uncompressed data then compresses - Add Vamana::build_with_compression() orchestrator method - Add Python bindings with dispatcher for runtime type selection - Add integration tests for build with compression - Builds graphs on uncompressed data for better quality, stores compressed for efficiency Co-Authored-By: [email protected] <[email protected]>

devin-ai-integration · 2025-10-28T20:12:37Z

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

Disable automatic comment and CI monitoring

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

MDD-53: Add build_with_compression API for higher quality graph construction #2

MDD-53: Add build_with_compression API for higher quality graph construction #2

Uh oh!

devin-ai-integration bot commented Oct 28, 2025

Uh oh!

devin-ai-integration bot commented Oct 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

MDD-53: Add build_with_compression API for higher quality graph construction #2

Are you sure you want to change the base?

MDD-53: Add build_with_compression API for higher quality graph construction #2

Uh oh!

Conversation

devin-ai-integration bot commented Oct 28, 2025

MDD-53: Add build_with_compression API for higher quality graph construction

Summary

Review & Testing Checklist for Human

Test Plan Recommendation

Notes

Uh oh!

devin-ai-integration bot commented Oct 28, 2025

🤖 Devin AI Engineer

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant