Skip to content

In-browser, multi-lingual vector embedding and search

License

Notifications You must be signed in to change notification settings

kyr0/vectorstore

Repository files navigation

vectorstore

Pure JavaScript implementation of a vector store with similarity search. Runs locally, in Node/Bun/Deno and soon even in your browser. Supports various embedding models, default to nomic-embed-text-v1. Open-source, fast and cost-free.

Features

  • ✅ Search for text similarities, locally, without API key, free of charge
  • ✅ Best in class performance; better than OpenAI (see "The Science" section)
  • ✅ Downloads the model automatically, caches it, executes offline afterwards
  • ✅ Runs Node.js and (soon) in the browser (large download though ~500 MB)
  • ✅ Uses the open-source nomic-embed-text-v1 text embedding model, 8192 token context window
  • ✅ Benchmarked: ~1 GiB memory usage at runtime
  • ✅ Fast! Inference < 0.05 sec. on average (per document)
  • ✅ Available as a simple API
  • ✅ Tree-shakable and side-effect free
  • ✅ Runs on Windows, Mac, Linux, CI tested
  • ✅ First class TypeScript support
  • ✅ Well tested (soon to be... ;-)

Roadmap

  • HNSW (Hierarchical Navigable Small World graphs) based vector search with complexity scaling O(log(N))
    • in best case, based on WebAssembly, using SMID vector instruction set
  • alternativem WebGPU backend to run expensive vector operations (making use of wasm_webgpu)

Initial results

Demo/Setup

npm install
npm run demo

The Science

If you came here to understand the math behind the scenes, please head on to: https://towardsdatascience.com/text-embeddings-comprehensive-guide-afd97fce8fb5 where Mariya Mansurova wrote an excellent article on Text Embeddings.

Now let's dive deeper into metrics and open-source models: https://towardsdatascience.com/openai-vs-open-source-multilingual-embedding-models-e5ccb7c90f05

This is why I decided to use nomic-embed-text-v1. (Nomic-Embed): The model was designed by Nomic, and claims better performances than OpenAI Ada-002 and text-embedding-3-small while being only 0.55GB in size. Interestingly, the model is the first to be fully reproducible and auditable (open data and open-source training code).

https://huggingface.co/nomic-ai/nomic-embed-text-v1

Example usage (API, as a library)

Setup

  • yarn: yarn add vectorstore
  • npm: npm install vectorstore

ESM

import { createDocument, search, type Document } from "vectorstore";
// This text corpus is a collection of documents in different languages, each describing the ocean.
// They share the meaning, but the words and even the symbols used to describe it are different.
// However, using vector embeddings, we can compare the documents and find similarities,
// which allows for cross-lingual search - a search that is made for humans, not machines.
// This quality is, for an open-source model, a major breakthrough.
// Combined with vector embedding search, everyone has access to local, powerful text search now.
// And the best news: It's fast, it's available, it's possible, now, and for free!
const myDocuments = [
  {
    text: "Exploring the depths of the ocean reveals a world beyond imagination.",
    metaData: { id: 1, language: "English" },
  },
  {
    text: "La exploración de las profundidades del océano revela un mundo más allá de la imaginación.",
    metaData: { id: 2, language: "Spanish" },
  },
  {
    text: "探索海洋的深处揭示了一个超乎想象的世界。",
    metaData: { id: 3, language: "Chinese" },
  },
  {
    text: "L'exploration des profondeurs de l'océan révèle un monde au-delà de l'imagination.",
    metaData: { id: 4, language: "French" },
  },
  {
    text: "Die Erforschung der Tiefen des Ozeans offenbart eine Welt jenseits der Vorstellungskraft.",
    metaData: { id: 5, language: "German" },
  },
];

console.log("Text corpus:", myDocuments);

const haystack: Array<Document> = [];

// vectorization (text to embeddings)
for (const doc of myDocuments) {
  haystack.push(await createDocument(doc.text, doc.metaData));
}

// vecotrization of the search string (which doesn't share much text similarity, BUT MEANING)
const needle = await createDocument(
  "Unveiling the mysteries beneath the sea",
);

console.log("Searching for:", "Unveiling the mysteries beneath the sea");

// running cosine similarity search in vector space
const searchResults = search(haystack, needle);

// displaying search results
console.log(
  searchResults.map((result) => ({
    score: result.score,
    id: result.doc.metadata.id,
    language: result.doc.metadata.language, // include language in the result for better context
  })),
);

/** Prints something like this:
 * Searching for: Unveiling the mysteries beneath the sea
  [
    { score: 0.6855015563968822, id: 1, language: 'English' },
    { score: 0.5687096727474149, id: 4, language: 'French' },
    { score: 0.5426440067625005, id: 2, language: 'Spanish' },
    { score: 0.4697886145316811, id: 3, language: 'Chinese' },
    { score: 0.34714563173592217, id: 5, language: 'German' }
  ]
  benchmarked: elapsed secs 0.292
  benchmarked: total memory usage was: 1073.42 MiB
 */

return searchResults;

You can run this exact code as a demo when checking out this repository using git clone, run npm i followed by npm run demo

CommonJS

const { createDocument, search } = require('vectorstore')

// same API like ESM variant

About

In-browser, multi-lingual vector embedding and search

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published