llama-node

Large Language Model LLaMA on node.js

This project is in an early stage, the API for nodejs may change in the future, use it with caution.

_{Picture generated by stable diffusion.}

llama-node

Introduction

This is a nodejs client library for llama (or llama based) LLM built on top of llama-rs and llm-chain-llama-sys which generate bindings for llama.cpp. It uses napi-rs for channel messages between node.js and llama thread.

From v0.0.21, both llama-rs and llama.cpp backends are supported!

Currently supported platforms:

darwin-x64
darwin-arm64
linux-x64-gnu
linux-x64-musl
win32-x64-msvc

Node.js version: >= 16

I do not have hardware for testing 13B or larger models, but I have tested it supported llama 7B model with both ggml llama and ggml alpaca.

Install

Install main package

npm install llama-node

Install llama-rs backend

npm install @llama-node/core

Install llama.cpp backend

npm install @llama-node/llama-cpp

Getting the weights

The llama-node uses llama-rs under the hook and uses the model format derived from llama.cpp. Due to the fact that the meta-release model is only used for research purposes, this project does not provide model downloads. If you have obtained the original .pth model, please read the document Getting the weights and use the convert tool provided by llama-rs for conversion.

Model versioning

There are now 3 versions from llama.cpp community:

GGML: legacy format, oldest ggml tensor file format
GGMF: also legacy format, newer than GGML, older than GGJT
GGJT: mmap-able format

The llama-rs backend now only supports GGML/GGMF models, and llama.cpp backend only supports GGJT models.

Usage (llama.cpp backend)

The current version supports only one inference session on one LLama instance at the same time

If you wish to have multiple inference sessions concurrently, you need to create multiple LLama instances

Inference

import { LLama } from "llama-node";
import { LLamaCpp, LoadConfig } from "llama-node/dist/llm/llama-cpp.js";
import path from "path";

const model = path.resolve(process.cwd(), "./ggml-vicuna-7b-4bit-rev1.bin");

const llama = new LLama(LLamaCpp);

const config: LoadConfig = {
    path: model,
    enableLogging: true,
    nCtx: 1024,
    nParts: -1,
    seed: 0,
    f16Kv: false,
    logitsAll: false,
    vocabOnly: false,
    useMlock: false,
    embedding: false,
};

llama.load(config);

const template = `How are you`;

const prompt = `### Human:

${template}

### Assistant:`;

llama.createCompletion(
    {
        nThreads: 4,
        nTokPredict: 2048,
        topK: 40,
        topP: 0.1,
        temp: 0.2,
        repeatPenalty: 1,
        stopSequence: "### Human",
        prompt,
    },
    (response) => {
        process.stdout.write(response.token);
    }
);

Tokenize

import { LLama } from "llama-node";
import { LLamaCpp, LoadConfig } from "llama-node/dist/llm/llama-cpp.js";
import path from "path";

const model = path.resolve(process.cwd(), "./ggml-vicuna-7b-4bit-rev1.bin");

const llama = new LLama(LLamaCpp);

const config: LoadConfig = {
    path: model,
    enableLogging: true,
    nCtx: 1024,
    nParts: -1,
    seed: 0,
    f16Kv: false,
    logitsAll: false,
    vocabOnly: false,
    useMlock: false,
    embedding: false,
};

llama.load(config);

const content = "how are you?";

llama.tokenize({ content, nCtx: 2048 }).then(console.log);

Embedding

import { LLama } from "llama-node";
import { LLamaCpp, LoadConfig } from "llama-node/dist/llm/llama-cpp.js";
import path from "path";

const model = path.resolve(process.cwd(), "./ggml-vicuna-7b-4bit-rev1.bin");

const llama = new LLama(LLamaCpp);

const config: LoadConfig = {
    path: model,
    enableLogging: true,
    nCtx: 1024,
    nParts: -1,
    seed: 0,
    f16Kv: false,
    logitsAll: false,
    vocabOnly: false,
    useMlock: false,
    embedding: false,
};

llama.load(config);

const prompt = `Who is the president of the United States?`;

const params = {
    nThreads: 4,
    nTokPredict: 2048,
    topK: 40,
    topP: 0.1,
    temp: 0.2,
    repeatPenalty: 1,
    prompt,
};

llama.getEmbedding(params).then(console.log);

Usage (llama-rs backend)

The current version supports only one inference session on one LLama instance at the same time

If you wish to have multiple inference sessions concurrently, you need to create multiple LLama instances

Inference

import { LLama } from "llama-node";
import { LLamaRS } from "llama-node/dist/llm/llama-rs.js";
import path from "path";

const model = path.resolve(process.cwd(), "./ggml-alpaca-7b-q4.bin");

const llama = new LLama(LLamaRS);

llama.load({ path: model });

const template = `how are you`;

const prompt = `Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:

${template}

### Response:`;

llama.createCompletion(
    {
        prompt,
        numPredict: 128,
        temp: 0.2,
        topP: 1,
        topK: 40,
        repeatPenalty: 1,
        repeatLastN: 64,
        seed: 0,
        feedPrompt: true,
    },
    (response) => {
        process.stdout.write(response.token);
    }
);

Tokenize

Get tokenization result from LLaMA

import { LLama } from "llama-node";
import { LLamaRS } from "llama-node/dist/llm/llama-rs.js";
import path from "path";

const model = path.resolve(process.cwd(), "./ggml-alpaca-7b-q4.bin");

const llama = new LLama(LLamaRS);

llama.load({ path: model });

const content = "how are you?";

llama.tokenize(content).then(console.log);

Embedding

Preview version, embedding end token may change in the future. Do not use it in production!

import { LLama } from "llama-node";
import { LLamaRS } from "llama-node/dist/llm/llama-rs.js";
import path from "path";
import fs from "fs";

const model = path.resolve(process.cwd(), "./ggml-alpaca-7b-q4.bin");

const llama = new LLama(LLamaRS);

llama.load({ path: model });

const getWordEmbeddings = async (prompt: string, file: string) => {
    const data = await llama.getEmbedding({
        prompt,
        numPredict: 128,
        temp: 0.2,
        topP: 1,
        topK: 40,
        repeatPenalty: 1,
        repeatLastN: 64,
        seed: 0,
    });

    console.log(prompt, data);

    await fs.promises.writeFile(
        path.resolve(process.cwd(), file),
        JSON.stringify(data)
    );
};

const run = async () => {
    const dog1 = `My favourite animal is the dog`;
    await getWordEmbeddings(dog1, "./example/semantic-compare/dog1.json");

    const dog2 = `I have just adopted a cute dog`;
    await getWordEmbeddings(dog2, "./example/semantic-compare/dog2.json");

    const cat1 = `My favourite animal is the cat`;
    await getWordEmbeddings(cat1, "./example/semantic-compare/cat1.json");
};

run();

Performance related

We provide prebuild binaries for linux-x64, win32-x64, apple-x64, apple-silicon. For other platforms, before you install the npm package, please install rust environment for self built.

Due to complexity of cross compilation, it is hard for pre-building a binary that fits all platform needs with best performance.

If you face low performance issue, I would strongly suggest you do a manual compilation. Otherwise you have to wait for a better pre-compiled native binding. I am trying to investigate the way to produce a matrix of multi-platform supports.

Manual compilation (from node_modules)

The following steps will allow you to compile the binary with best quality on your platform

Pre-request: install rust
Under node_modules/@llama-node/core folder
```
npm run build
```

Manual compilation (from source)

The following steps will allow you to compile the binary with best quality on your platform

Pre-request: install rust
Under root folder, run
```
npm install && npm run build
```
Under packages/core folder, run
```
npm run build
```
You can use the dist under root folder

Future plan

prompt extensions
more platforms and cross compile (performance related)
tweak embedding API, make end token configurable
cli and interactive
support more open source models as llama-rs planned rustformers/llm#85 rustformers/llm#75
more backends (eg. rwkv) supports!

Name		Name	Last commit message	Last commit date
Latest commit History 105 Commits
.github		.github
.vscode		.vscode
doc/assets		doc/assets
example		example
miscellaneous/models		miscellaneous/models
packages		packages
scripts		scripts
src		src
.gitignore		.gitignore
.npmignore		.npmignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE.MD		LICENSE.MD
README-zh-CN.md		README-zh-CN.md
README.md		README.md
package-lock.json		package-lock.json
package.json		package.json
tsconfig.json		tsconfig.json
tsup.config.ts		tsup.config.ts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

llama-node

Introduction

Install

Getting the weights

Model versioning

Usage (llama.cpp backend)

Inference

Tokenize

Embedding

Usage (llama-rs backend)

Inference

Tokenize

Embedding

Performance related

Manual compilation (from node_modules)

Manual compilation (from source)

Future plan

About

Releases

Packages

Languages

License

JoshuaeKaiser/llama-node

Folders and files

Latest commit

History

Repository files navigation

llama-node

Introduction

Install

Getting the weights

Model versioning

Usage (llama.cpp backend)

Inference

Tokenize

Embedding

Usage (llama-rs backend)

Inference

Tokenize

Embedding

Performance related

Manual compilation (from node_modules)

Manual compilation (from source)

Future plan

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages