Skip to content

Commit cf6a166

Browse files
committed
implement blingfire tokenizer
1 parent 8bf6234 commit cf6a166

File tree

8 files changed

+470
-1
lines changed

8 files changed

+470
-1
lines changed
Lines changed: 141 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,141 @@
1+
# BlingFire Tokenizer
2+
3+
This directory contains a TypeScript wrapper for the [BlingFire](https://github.com/microsoft/BlingFire) tokenization library using WebAssembly.
4+
5+
## Overview
6+
7+
BlingFire is a lightning-fast tokenizer developed by Microsoft that provides high-quality sentence and word segmentation. This implementation uses the WebAssembly build of BlingFire to enable fast tokenization in TypeScript/JavaScript environments.
8+
9+
## Files
10+
11+
### Auto-generated Files
12+
13+
The following files are **auto-generated** and should not be manually edited:
14+
15+
- **`blingfire.ts`** - JavaScript code generated by Emscripten, with the following header added:
16+
```typescript
17+
// auto generated file
18+
/* eslint-disable */
19+
// @ts-ignore
20+
```
21+
This file is copied directly from the BlingFire build output.
22+
23+
- **`blingfire.wasm`** - The compiled WebAssembly binary, copied 1:1 from the BlingFire build output to the resources folder.
24+
25+
### Source Files
26+
27+
- **`blingfire_wrapper.js`** - Wrapper functions for the BlingFire WASM module. Based on the wrapper provided by BlingFire's WASM implementation, with slight adaptations for this project.
28+
29+
- **`index.ts`** - Main entry point that implements the `SentenceTokenizer` interface following the LiveKit agents pattern.
30+
31+
## Building BlingFire WASM
32+
33+
To regenerate the `blingfire.ts` and `blingfire.wasm` files:
34+
35+
### 1. Clone the BlingFire Repository
36+
37+
```bash
38+
git clone https://github.com/microsoft/BlingFire.git
39+
cd BlingFire
40+
```
41+
42+
### 2. Follow Initial Setup
43+
44+
Follow the instructions at: https://github.com/microsoft/BlingFire/blob/master/wasm/readme.md
45+
46+
### 3. Modify the Makefile
47+
48+
Change the Makefile to run the following `em++` command:
49+
50+
```bash
51+
em++ ../blingfiretools/blingfiretokdll/blingfiretokdll.cpp \
52+
../blingfiretools/blingfiretokdll/*.cxx \
53+
../blingfireclient.library/src/*.cpp \
54+
-s WASM=1 \
55+
-s EXPORTED_FUNCTIONS="[_GetBlingFireTokVersion, _TextToSentences, _TextToWords, _TextToIds, _SetModel, _FreeModel, _WordHyphenationWithModel, _malloc, _free]" \
56+
-s "EXPORTED_RUNTIME_METHODS=['lengthBytesUTF8', 'stackAlloc', 'stringToUTF8', 'UTF8ToString', 'cwrap']" \
57+
-s ALLOW_MEMORY_GROWTH=1 \
58+
-s DISABLE_EXCEPTION_CATCHING=0 \
59+
-s MODULARIZE=1 \
60+
-s EXPORT_ES6 \
61+
-I ../blingfireclient.library/inc/ \
62+
-I ../blingfirecompile.library/inc/ \
63+
-DHAVE_ICONV_LIB \
64+
-DHAVE_NO_SPECSTRINGS \
65+
-D_VERBOSE \
66+
-DBLING_FIRE_NOAP \
67+
-DBLING_FIRE_NOWINDOWS \
68+
-DNDEBUG \
69+
-O3 \
70+
--std=c++11 \
71+
-o blingfire.js
72+
```
73+
74+
**Key Changes:**
75+
- Added `-s MODULARIZE=1` - Makes the output a module that can be imported
76+
- Added `-s EXPORT_ES6` - Exports as ES6 module
77+
- Fixed `_malloc` and `_free` exports in `EXPORTED_FUNCTIONS`
78+
79+
### 4. Copy Files to LiveKit
80+
81+
After building, copy the generated files:
82+
83+
```bash
84+
# From the BlingFire wasm build directory
85+
cp blingfire.js /path/to/livekit/agents-js/agents/src/tokenize/blingfire/blingfire.ts
86+
cp blingfire.wasm /path/to/livekit/agents-js/agents/src/tokenize/blingfire/blingfire.wasm
87+
```
88+
89+
Then add the header comments to `blingfire.ts`:
90+
91+
```typescript
92+
// auto generated file
93+
/* eslint-disable */
94+
// @ts-ignore
95+
```
96+
97+
## Usage
98+
99+
```typescript
100+
import { tokenizer } from '@livekit/agents';
101+
102+
// Create a tokenizer instance
103+
const tokenizer = new tokenizer.blingfire.SentenceTokenizer({
104+
minSentenceLength: 20,
105+
streamContextLength: 10,
106+
});
107+
108+
// Batch tokenization
109+
const sentences = tokenizer.tokenize('This is a sentence. And another one.');
110+
console.log(sentences);
111+
// Output: ['This is a sentence. And another one.']
112+
113+
// Stream tokenization
114+
const stream = tokenizer.stream();
115+
stream.pushText('This is the first sentence. ');
116+
stream.pushText('This is the second sentence.');
117+
stream.endInput();
118+
119+
for await (const { token, segmentId } of stream) {
120+
console.log(token);
121+
}
122+
```
123+
124+
## Configuration Options
125+
126+
- **`minSentenceLength`** (default: 20) - Minimum length for buffered sentences
127+
- **`streamContextLength`** (default: 10) - Minimum context length for stream processing
128+
129+
## Features
130+
131+
- Lightning-fast sentence tokenization using BlingFire
132+
- Support for batch and streaming tokenization
133+
- Handles abbreviations (Dr., Mr., etc.) correctly
134+
- Supports numbers with decimals
135+
- Multi-language support (Latin, CJK characters, etc.)
136+
- Compatible with the LiveKit agents tokenizer interface
137+
138+
## License
139+
140+
BlingFire is licensed under the MIT License by Microsoft Corporation.
141+
See: https://github.com/microsoft/BlingFire/blob/master/LICENSE
Lines changed: 177 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,177 @@
1+
// SPDX-FileCopyrightText: 2024 LiveKit, Inc.
2+
//
3+
// SPDX-License-Identifier: Apache-2.0
4+
import { describe, expect, it } from 'vitest';
5+
import { SentenceTokenizer } from './index.js';
6+
7+
const TEXT =
8+
'Hi! ' +
9+
'LiveKit is a platform for live audio and video applications and services. \n\n' +
10+
'R.T.C stands for Real-Time Communication... again R.T.C. ' +
11+
'Mr. Theo is testing the sentence tokenizer. ' +
12+
'\nThis is a test. Another test. ' +
13+
'A short sentence.\n' +
14+
'A longer sentence that is longer than the previous sentence. ' +
15+
'f(x) = x * 2.54 + 42. ' +
16+
'Hey!\n Hi! Hello! ' +
17+
'\n\n' +
18+
'This is a sentence. 这是一个中文句子。これは日本語の文章です。' +
19+
'你好!LiveKit是一个直播音频和视频应用程序和服务的平台。' +
20+
'\nThis is a sentence contains consecutive spaces.';
21+
22+
// BlingFire may split sentences differently than the basic tokenizer
23+
// These are the expected results when using BlingFire with minSentenceLength=20
24+
const EXPECTED_MIN_20 = [
25+
'Hi! LiveKit is a platform for live audio and video applications and services.',
26+
'R.T.C stands for Real-Time Communication... again R.T.C. Mr. Theo is testing the sentence tokenizer.',
27+
'This is a test. Another test.',
28+
'A short sentence. A longer sentence that is longer than the previous sentence. f(x) = x * 2.54 + 42.',
29+
'Hey! Hi! Hello! This is a sentence.',
30+
'这是一个中文句子。これは日本語の文章です。',
31+
'你好!LiveKit是一个直播音频和视频应用程序和服务的平台。',
32+
'This is a sentence contains consecutive spaces.',
33+
];
34+
35+
const SIMPLE_TEXT = 'This is a sentence. This is another sentence. And a third one.';
36+
37+
describe('blingfire tokenizer', () => {
38+
describe('SentenceTokenizer', () => {
39+
const tokenizer = new SentenceTokenizer();
40+
41+
it('should tokenize simple sentences correctly', () => {
42+
const result = tokenizer.tokenize(SIMPLE_TEXT);
43+
expect(result).toBeDefined();
44+
expect(result.length).toBeGreaterThan(0);
45+
// BlingFire should split the text into sentences
46+
expect(result.some((s) => s.includes('This is a sentence'))).toBeTruthy();
47+
});
48+
49+
it('should tokenize complex text correctly', () => {
50+
const result = tokenizer.tokenize(TEXT);
51+
expect(result).toBeDefined();
52+
expect(result.length).toBeGreaterThan(0);
53+
// Verify we get similar structure to expected
54+
expect(result.length).toBe(EXPECTED_MIN_20.length);
55+
});
56+
57+
it('should handle empty string', () => {
58+
const result = tokenizer.tokenize('');
59+
expect(result).toEqual([]);
60+
});
61+
62+
it('should handle single sentence', () => {
63+
const result = tokenizer.tokenize('This is a single sentence.');
64+
expect(result).toBeDefined();
65+
expect(result.length).toBeGreaterThan(0);
66+
});
67+
68+
it('should respect minSentenceLength option', () => {
69+
const tokenizerMin50 = new SentenceTokenizer({ minSentenceLength: 50 });
70+
const result = tokenizerMin50.tokenize(TEXT);
71+
expect(result).toBeDefined();
72+
// All tokens except possibly the last should be >= 50 chars
73+
result.slice(0, -1).forEach((token) => {
74+
expect(token.length).toBeGreaterThanOrEqual(50);
75+
});
76+
});
77+
78+
it('should stream tokenize sentences correctly', async () => {
79+
const pattern = [1, 2, 4];
80+
let text = TEXT;
81+
const chunks = [];
82+
const patternIter = Array(Math.ceil(text.length / pattern.reduce((sum, num) => sum + num, 0)))
83+
.fill(pattern)
84+
.flat()
85+
[Symbol.iterator]();
86+
87+
// @ts-ignore
88+
for (const size of patternIter) {
89+
if (!text) break;
90+
chunks.push(text.slice(undefined, size));
91+
text = text.slice(size);
92+
}
93+
94+
const stream = tokenizer.stream();
95+
for (const chunk of chunks) {
96+
stream.pushText(chunk);
97+
}
98+
stream.endInput();
99+
100+
const tokens = [];
101+
for await (const value of stream) {
102+
tokens.push(value.token);
103+
}
104+
105+
expect(tokens).toBeDefined();
106+
expect(tokens.length).toBeGreaterThan(0);
107+
// Should produce the same number of tokens as batch mode
108+
expect(tokens.length).toBe(EXPECTED_MIN_20.length);
109+
});
110+
111+
it('should handle flush correctly', async () => {
112+
const stream = tokenizer.stream();
113+
stream.pushText('This is the first part. ');
114+
stream.flush();
115+
stream.pushText('This is the second part.');
116+
stream.endInput();
117+
118+
const tokens = [];
119+
for await (const value of stream) {
120+
tokens.push(value.token);
121+
}
122+
123+
expect(tokens.length).toBeGreaterThan(0);
124+
});
125+
126+
it('should handle multiple pushText calls', async () => {
127+
const stream = tokenizer.stream();
128+
stream.pushText('First sentence. ');
129+
stream.pushText('Second sentence. ');
130+
stream.pushText('Third sentence.');
131+
stream.endInput();
132+
133+
const tokens = [];
134+
for await (const value of stream) {
135+
tokens.push(value.token);
136+
}
137+
138+
expect(tokens.length).toBeGreaterThan(0);
139+
});
140+
141+
it('should handle abbreviations correctly', () => {
142+
const text = 'Dr. Smith went to Washington D.C. yesterday. It was nice.';
143+
const result = tokenizer.tokenize(text);
144+
expect(result).toBeDefined();
145+
expect(result.length).toBeGreaterThan(0);
146+
});
147+
148+
it('should handle numbers with decimals', () => {
149+
const text = 'The value is 3.14159. Another value is 2.71828.';
150+
const result = tokenizer.tokenize(text);
151+
expect(result).toBeDefined();
152+
expect(result.some((s) => s.includes('3.14159'))).toBeTruthy();
153+
});
154+
155+
it('should provide segment IDs in stream mode', async () => {
156+
const stream = tokenizer.stream();
157+
stream.pushText('First sentence. ');
158+
stream.flush();
159+
stream.pushText('Second sentence after flush.');
160+
stream.endInput();
161+
162+
const tokens = [];
163+
for await (const value of stream) {
164+
tokens.push(value);
165+
expect(value.segmentId).toBeDefined();
166+
expect(typeof value.segmentId).toBe('string');
167+
}
168+
169+
// Tokens from different segments should have different segment IDs
170+
if (tokens.length > 1) {
171+
const segmentIds = new Set(tokens.map((t) => t.segmentId));
172+
// After flush, we should have at least 2 different segment IDs
173+
expect(segmentIds.size).toBeGreaterThanOrEqual(1);
174+
}
175+
});
176+
});
177+
});

agents/src/tokenize/blingfire/blingfire.ts

Lines changed: 5 additions & 0 deletions
Large diffs are not rendered by default.
684 KB
Binary file not shown.
Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
import createModule from './blingfire.js';
2+
3+
const Module = (await createModule()) as any;
4+
5+
// breaks to sentences, takes a JS string and returns a JS string
6+
export function TextToSentences(s: string): string | null {
7+
const len = Module['lengthBytesUTF8'](s);
8+
9+
if (!len) {
10+
return null;
11+
}
12+
13+
const inUtf8 = Module['_malloc'](len + 1); // if we don't do +1 this library won't copy the last character
14+
Module['stringToUTF8'](s, inUtf8, len + 1); // since it always also needs a space for a 0-char
15+
16+
const MaxOutLength = (len << 1) + 1; // worst case every character is a token
17+
const outUtf8 = Module['_malloc'](MaxOutLength);
18+
19+
try {
20+
const actualLen = Module['_TextToSentences'](inUtf8, len, outUtf8, MaxOutLength);
21+
if (0 > actualLen || actualLen > MaxOutLength) {
22+
return null;
23+
}
24+
} finally {
25+
if (inUtf8 != 0) {
26+
Module['_free'](inUtf8);
27+
}
28+
29+
if (outUtf8 != 0) {
30+
Module['_free'](outUtf8);
31+
}
32+
}
33+
34+
return Module['UTF8ToString'](outUtf8);
35+
}

0 commit comments

Comments
 (0)