Skip to content

Commit

Permalink
Add: toBinary for JavaScript
Browse files Browse the repository at this point in the history
  • Loading branch information
ashvardanian committed Apr 8, 2024
1 parent 508e7a0 commit 1f1fd3a
Show file tree
Hide file tree
Showing 2 changed files with 54 additions and 9 deletions.
35 changes: 28 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,14 @@
# SimSIMD 📏

_Computing dot-products, similarity measures, and distances between low- and high-dimensional vectors is ubiquitous in Machine Learning, Scientific Computing, Geo-Spatial Analysis, and Information Retrieval.
![SimSIMD banner](https://github.com/ashvardanian/ashvardanian/blob/master/repositories/SimSIMD.png?raw=true)

Computing dot-products, similarity measures, and distances between low- and high-dimensional vectors is ubiquitous in Machine Learning, Scientific Computing, Geo-Spatial Analysis, and Information Retrieval.
These algorithms generally have linear complexity in time, constant complexity in space, and are data-parallel.
In other words, it is easily parallelizable and vectorizable and often available in packages like BLAS and LAPACK, as well as higher-level `numpy` and `scipy` Python libraries.
Ironically, even with decades of evolution in compilers and numerical computing, [most libraries can be 3-200x slower than hardware potential][benchmarks] even on the most popular hardware, like 64-bit x86 and Arm CPUs.
SimSIMD attempts to fill that gap.
1️⃣ SimSIMD functions are practically as fast as `memcpy`.
2️⃣ SimSIMD [compiles to more platforms than NumPy (105 vs 35)][compatibility] and has more backends than most BLAS implementations._
2️⃣ SimSIMD [compiles to more platforms than NumPy (105 vs 35)][compatibility] and has more backends than most BLAS implementations.

[benchmarks]: https://ashvardanian.com/posts/simsimd-faster-scipy
[compatibility]: https://pypi.org/project/simsimd/#files
Expand Down Expand Up @@ -400,8 +402,7 @@ To install, choose one of the following options depending on your environment:
- `pnpm add simsimd`
- `bun install simsimd`

The package is distributed with prebuilt binaries for Node.js v10 and above for Linux (x86_64, arm64), macOS (x86_64, arm64), and Windows (i386, x86_64).
If your platform is not supported, you can build the package from the source via `npm run build`.
The package is distributed with prebuilt binaries, but if your platform is not supported, you can build the package from the source via `npm run build`.
This will automatically happen unless you install the package with the `--ignore-scripts` flag or use Bun.
After you install it, you will be able to call the SimSIMD functions on various `TypedArray` variants:

Expand All @@ -415,14 +416,34 @@ const distance = sqeuclidean(vectorA, vectorB);
console.log('Squared Euclidean Distance:', distance);
```

Other numeric types and precision levels are supported as well:
Other numeric types and precision levels are supported as well.
For double-precsion floating-point numbers, use `Float64Array`:

```js
const vectorA = new Float64Array([1.0, 2.0, 3.0]);
const vectorB = new Float64Array([4.0, 5.0, 6.0]);

const distance = cosine(vectorA, vectorB);
console.log('Cosine Similarity:', distance);
```

When doing machine learning and vector search with high-dimensional vectors you may want to quantize them to 8-bit integers.
You may want to project values from the $[-1, 1]$ range to the $[-100, 100]$ range and then cast them to `Uint8Array`:

```js
const quantizedVectorA = new Uint8Array(vectorA.map(v => (v * 100)));
const quantizedVectorB = new Uint8Array(vectorB.map(v => (v * 100)));
const distance = cosine(quantizedVectorA, quantizedVectorB);
```

A more extreme quantization case would be to use binary vectors.
You can map all positive values to `1` and all negative values and zero to `0`, packing eight values into a single byte.
After that, Hamming and Jaccard distances can be computed.

```js
const { toBinary, hamming } = require('simsimd');

const binaryVectorA = toBinary(vectorA);
const binaryVectorB = toBinary(vectorB);
const distance = hamming(binaryVectorA, binaryVectorB);
```

## Using SimSIMD in C
Expand Down
28 changes: 26 additions & 2 deletions javascript/simsimd.ts
Original file line number Diff line number Diff line change
Expand Up @@ -102,6 +102,27 @@ export const jensenshannon = (a: Float64Array | Float32Array, b: Float64Array |
return compiled.jensenshannon(a, b);
};

/**
* Quantizes a floating-point vector into a binary vector (1 for positive values, 0 for non-positive values) and packs the result into a Uint8Array, where each element represents 8 binary values from the original vector.
* This function is useful for preparing data for bitwise distance or similarity computations, such as Hamming or Jaccard indices.
*
* @param {Float32Array | Float64Array | Int8Array} vector The floating-point vector to be quantized and packed.
* @returns {Uint8Array} A Uint8Array where each byte represents 8 binary quantized values from the input vector.
*/
export const toBinary = (vector: Float32Array | Float64Array | Int8Array): Uint8Array => {
const byteLength = Math.ceil(vector.length / 8);
const packedVector = new Uint8Array(byteLength);

for (let i = 0; i < vector.length; i++) {
if (vector[i] > 0) {
const byteIndex = Math.floor(i / 8);
const bitPosition = 7 - (i % 8);
packedVector[byteIndex] |= (1 << bitPosition);
}
}

return packedVector;
};
export default {
dot,
inner,
Expand All @@ -111,10 +132,13 @@ export default {
jaccard,
kullbackleibler,
jensenshannon,
toBinary,
};

// utility functions to help find native builds

/**
* @brief Finds the directory where the native build of the simsimd module is located.
* @param {string} dir - The directory to start the search from.
*/
function getBuildDir(dir: string) {
if (existsSync(path.join(dir, "build"))) return dir;
if (existsSync(path.join(dir, "prebuilds"))) return dir;
Expand Down

0 comments on commit 1f1fd3a

Please sign in to comment.