Can you extend a relational database system to support storing and querying over vectors?
- Fork this project
- Understand the source code
- Write your improvements
- Run experiments
- Write a report (Details can be seen below)
- Push to GitLab and open a merge request
- Start the server
- Load the provided SIFT1M dataset
- Stop the server to flush all the changes
- Restart the server
- Run the provided SIFT benchmark
- Check the benchmark result
Note: Your improvement will be evaluated on the provided properties (1M items with 128-dimensional vector embedding each), same machine, same time limit(30 minutes for Loadtestbed+Benchmark excluding Recall Calculation).
The new workload includes insert
. Please be sure to reload the testbed each time you run a benchmark.
- We will only use
HeuristicQueryPlanner
for our vector search operations. - Our naive implementation sorts all the vector and return the top-k closest records to the client.
- You can easily beat our performance by implementing any indexing algorithms for the vector search. Note that you still have to consider correctness because we will measure recall.
- Make sure
TablePlanner
calls your index, if you choose to implement one. - You can look into
org.vanilladb.core.sql.distfn.EuclideanFn
to implement SIMD. Note: Our benchmark will only useEuclideanFn
. You may choose not to implement SIMD for CosineFn. - Make sure you run jdk17 with
jdk.incubator.vector
package (default jdk17 in VScode is not contain this package) to enable SIMD in Java.
Based on the workload we provide, show the followings:
- Throughput
- Recall
Show the comparison between the performance of the unmodified source code and the performance of your modification.
You can then think about the parameter settings that really show your improvements.
-
Briefly explain what you do
- How you implement your indexes
- How you implement SIMD
- Other improvements you made to speed up the search
-
Experiments
- Your experiment environment (a list of your hardware components, your operating system)
- e.g. Intel Core i5-3470 CPU @ 3.2GHz, 16 GB RAM, 128 GB SSD, CentOS 7
- Based on the workload we provide:
- Show your improvement using graphs
- Your benchmark parameters
- Analysis on the results of the experiments
- Your experiment environment (a list of your hardware components, your operating system)
Note: There is no strict limitation to the length of your report. Generally, a 2-3 page report with some figures and tables is fine. Remember to include all your group members' student IDs
The procedure is as follows:
- Fork the final project
- Clone the repository you forked
- Finish your work and write the report
- Commit your work, push your work to GitLab.
- Name your report
[Team Number]_final_project_report.pdf
- e.g. team1_final_project_report.pdf
- Name your report
- Open a merge request to the original repository.
- Source branch: Your working branch
- Target branch: The branch with your team number (e.g.
team-1
) - Title:
Team-X Submission
(e.g.Team-1 Submission
)
Note: Only one submission for each team.
If we find you copying someone's code, you get 0 point for this assignment.
Database is not allowed to put all your data into memory.
modify our benchmark is not allowed.
Submit your work before 2024/06/16 (Sun) 23:59:59.