Skip to content

Commit 1f35e24

Browse files
committed
implement streaming parser
this is a fundamental rewrite of the library to maximize throughput the public API is essentially the same, except for some internal typing stuff that is technically exposed in the exported namespace it's really fast and a bunch of things have been improved, including svg support for inlined xml perf: optimize string slicing in matchInlineFormatting and parseRefLink - Modify matchInlineFormatting to accept position parameters instead of sliced strings - Eliminate unnecessary slice in parseRefLink by checking '](' pattern directly - Update tests to use new matchInlineFormatting signature Performance improvement: ~2.3x faster for simple markdown (250k vs 108k ops/sec) perf: eliminate string slicing in parseLink and parseRefLink - Pass positions directly to parseInlineSpan instead of slicing first - Remove unused text variable in parseRefLink - Maintains ~2.3x performance improvement chore: adjust cursor rule perf: eliminate state cloning with direct mutation - Replace { ...state, ... } cloning with direct property mutation - Save and restore original values after recursive calls - Reduces allocations in hot paths - Performance improvement: ~252k ops/sec vs ~248k baseline cleanup Consolidate duplicate URL parsing logic in parseLink and parseImage - Extract shared parseUrlAndTitle function to eliminate code duplication - Improves performance by ~1.2% (256k vs 253k ops/sec) - Reduces bundle size and improves maintainability Add early-exit optimizations for parser dispatch - Skip HTML parsers early if disableParsingRawHTML is true - Skip link parsers if inAnchor state - Skip bare URL parser if disabled or inAnchor - Performance neutral but improves code clarity Optimize HTML entity processing by skipping regex when no & present - Skip HTML_CHAR_CODE_R regex entirely when text contains no & - Most text chunks don't have HTML entities, avoiding unnecessary regex - Improves performance by ~7-8% (280k vs 260k ops/sec) Also refactored parseInlineSpan to use character-based dispatch: - Replace sequential if checks with switch statement - Eliminates ~90% of unnecessary parser checks - Inline text accumulation to avoid parseText function call overhead perf: reduce state object cloning in parseMarkdown Eliminate unnecessary object allocations by passing state directly instead of cloning with { ...state, inline: false }. Since parseMarkdown runs in block mode, state.inline is already false, making the clones redundant. For buildListItemChildren, use mutation-with-restoration pattern instead of object spread. Benchmark results show ~2-3% improvement on large markdown documents. perf: optimize string concatenation in loops for large documents Replace string concatenation (+=) with array.join() pattern in loops to avoid O(n²) string operations. Optimized: - parseCodeBlock: content building - parseCodeFenced: content building - parseBlockQuote: rawContent building For large documents with many lines, this reduces string allocation overhead by collecting pieces in an array and joining once at the end. Benchmark results show ~1.4% improvement on large markdown documents. feat: improve void element detection and fix malformed HTML handling - Combine HTML5 and SVG void elements into single VOID_ELEMENTS Set - Add isVoidElement() function that handles: - HTML5 void elements (area, br, img, etc.) - SVG void elements (circle, path, rect, etc.) - Custom web components (hyphenated tags like my-component) - Namespace prefixes (e.g., svg:circle -> circle) - Add comprehensive parse tests for void element detection: - SVG void elements (7 tests) - Custom web components (5 tests) - Non-void element protection (4 tests) - Edge cases (4 tests) - Markdown formatting integration (3 tests) - Fix sibling <g> tag detection for malformed HTML: - Check newline between opening tag end and sibling tag - Ensures proper parsing of consecutive SVG <g> tags with newlines - Restore URL autolink detection (only blocks when :// appears in tag, not attributes) - Restore parsing of tags with attributes when no closing tag found (needed for sanitization) perf: optimize character lookup functions using Sets Replace string.indexOf() lookups with Set.has() for O(1) character checks: - isSpecialInlineChar: convert SPECIAL_INLINE_CHARS string to Set - isBlockStartChar: convert BLOCK_START_CHARS string to Set These functions are called frequently in hot paths (every character check in parseText and block parsing), so O(1) lookups provide better performance especially for large documents. Benchmark results show performance maintained (~924 ops/sec average for large documents, within variance of previous baseline). cleanup chore: add metrics back small optimization perf: optimize HTML block parsing and parseParagraph - Optimize matchHTMLBlock: replace character-by-character tag matching with substring comparisons for ~40-50% performance improvement on deeply nested HTML - Optimize parseParagraph: combine line finding with empty line detection in single pass, reduce redundant findLineEnd calls, and optimize trimming logic - Simplify string operations: use trim() and includes() helpers instead of manual loops where appropriate - Optimize table parsing: simplify line and cell trimming logic adjust footnotes internal react 19 upgrade use ts for benchmark convert profile to ts split bench suite up for faster iteration Remove internal type definitions and rename RuleOutput to ASTRender add failing test for 678 fix inline html handling permission to grow add rehype to bench and type the benchmark library add changeset Drop support for React versions less than 16
1 parent a4da988 commit 1f35e24

29 files changed

+8332
-3753
lines changed
Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
---
2+
'markdown-to-jsx': major
3+
---
4+
5+
Refactor: Major internal restructuring and performance optimizations
6+
7+
This branch includes significant internal improvements:
8+
9+
- **Code Organization**: Restructured codebase by moving all source files into `src/` directory for better organization
10+
- **Parser Refactoring**: Split inline formatting matching logic from `match.ts` into separate `parse.ts` and `types.ts` modules
11+
- **Performance Optimizations**: Multiple performance improvements including:
12+
- Optimized character lookup functions using Sets instead of arrays
13+
- Eliminated state object cloning in parseMarkdown
14+
- Optimized string concatenation in loops for large documents
15+
- Reduced string slicing operations in link and image parsing
16+
- Added early-exit optimizations for parser dispatch
17+
- Optimized HTML entity processing by skipping regex when no `&` present
18+
- Consolidated duplicate URL parsing logic
19+
- **Code Quality**: Improved void element detection and fixed malformed HTML handling
20+
- **Constants**: Deduplicated shared constants and utilities across modules
21+
22+
All changes are internal and maintain backward compatibility with no breaking API changes.
Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
---
2+
'markdown-to-jsx': major
3+
---
4+
5+
Remove internal type definitions and rename RuleOutput to ASTRender
6+
7+
This change removes internal type definitions from the `MarkdownToJSX` namespace:
8+
9+
- Removed `NestedParser` type
10+
- Removed `Parser` type
11+
- Removed `Rule` type
12+
- Removed `Rules` type
13+
- Renamed `RuleOutput` to `ASTRender` for clarity
14+
15+
**Breaking changes:**
16+
17+
- Code referencing `MarkdownToJSX.NestedParser`, `MarkdownToJSX.Parser`, `MarkdownToJSX.Rule`, or `MarkdownToJSX.Rules` will need to be updated
18+
- The `renderRule` option in `MarkdownToJSX.Options` now uses `ASTRender` instead of `RuleOutput` for the `renderChildren` parameter type
19+
- `HTMLNode.children` type changed from `ReturnType<MarkdownToJSX.NestedParser>` to `ASTNode[]` (semantically equivalent, but requires updates if using the old type)
Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
---
2+
'markdown-to-jsx': major
3+
---
4+
5+
Drop support for React versions less than 16
6+
7+
- Update peer dependency requirement from `>= 0.14.0` to `>= 16.0.0`
8+
- Remove legacy code that wrapped string children in `<span>` elements for React < 16 compatibility
9+
- Directly return single children and null without wrapper elements
Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
---
2+
'markdown-to-jsx': major
3+
---
4+
5+
Upgrade to React 19 types
6+
7+
- Update to `@types/react@^19.2.2` and `@types/react-dom@^19.2.2`
8+
- Use `React.JSX.*` namespace instead of `JSX.*` for React 19 compatibility

.cursor/worktrees.json

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
{
2+
"setup-worktree": [
3+
"bun install"
4+
]
5+
}

.cursorrules

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,3 +2,6 @@
22
- When debugging, you can directly import index.tsx if using the bun CLI without needing a build step to run experimental scripts.
33
- All scripts can be written in TypeScript since Bun handles it natively.
44
- Use `bun run test` to run the tests.
5+
- NEVER UPDATE TEST SNAPSHOTS UNDER ANY CIRCUMSTANCES.
6+
- When running experiments, use bun cli to run a script instead of node because it can handle importing typescript files directly without needing to build the library first.
7+
- When you run CLI benchmarking and tests do not use `tail` to limit the output.

.nvmrc

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
v20
1+
v24

benchmark.js

Lines changed: 0 additions & 62 deletions
This file was deleted.

benchmark.ts

Lines changed: 159 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,159 @@
1+
import BenchTable from 'benchtable'
2+
import cliProgress from 'cli-progress'
3+
import * as fs from 'fs'
4+
import MarkdownIt from 'markdown-it'
5+
import { compiler as latestCompiler } from 'markdown-to-jsx-latest'
6+
import path from 'path'
7+
import ReactMarkdown from 'react-markdown'
8+
import rehypeRaw from 'rehype-raw'
9+
import rehypeReact from 'rehype-react'
10+
import { remark } from 'remark'
11+
import remarkDirective from 'remark-directive'
12+
import remarkGfm from 'remark-gfm'
13+
import remarkRehype from 'remark-rehype'
14+
import SimpleMarkdown from 'simple-markdown'
15+
import { compiler } from './dist/index.module.js'
16+
// @ts-ignore - react/jsx-runtime types may be incomplete
17+
import * as prod from 'react/jsx-runtime'
18+
19+
const { version: latestVersion } = JSON.parse(
20+
fs.readFileSync(
21+
path.join(
22+
import.meta.dirname,
23+
'node_modules/markdown-to-jsx-latest/package.json'
24+
),
25+
'utf8'
26+
)
27+
)
28+
29+
const mdIt = new MarkdownIt()
30+
const suite = new BenchTable()
31+
32+
const fixture = fs.readFileSync('./src/fixture.md', 'utf8')
33+
34+
// Set up rehype processor with all plugins to match markdown-to-jsx features
35+
// @ts-ignore - rehype-react types may be incomplete
36+
const production = { Fragment: prod.Fragment, jsx: prod.jsx, jsxs: prod.jsxs }
37+
38+
const rehypeProcessor = remark()
39+
.use(remarkGfm) // GFM tables, task lists, autolinks, strikethrough, footnotes
40+
.use(remarkDirective) // Directives like [!NOTE] (though may not match exactly)
41+
.use(remarkRehype, { allowDangerousHtml: true }) // Convert mdast to hast
42+
.use(rehypeRaw) // Process HTML in markdown
43+
.use(rehypeReact, production) // Convert hast to React
44+
45+
// Create a parse-only processor for AST benchmarks
46+
const remarkParseProcessor = remark()
47+
.use(remarkGfm) // GFM tables, task lists, autolinks, strikethrough, footnotes
48+
.use(remarkDirective)
49+
50+
const bar = new cliProgress.SingleBar(
51+
{
52+
clearOnComplete: true,
53+
},
54+
cliProgress.Presets.shades_classic
55+
)
56+
let totalCycles
57+
58+
const isAll = process.argv.includes('--all')
59+
const isJsx = process.argv.includes('--jsx')
60+
// const isHtml = process.argv.includes('--html')
61+
62+
type Test = {
63+
name: string
64+
fn: (input: any) => any
65+
}
66+
67+
const parseTests = [
68+
{
69+
name: 'markdown-to-jsx (next) [parse]',
70+
fn: input => compiler(input, { ast: true }),
71+
},
72+
{
73+
name: `markdown-to-jsx (${latestVersion}) [parse]`,
74+
fn: input => latestCompiler(input, { ast: true }),
75+
},
76+
isAll && {
77+
name: 'rehype [parse]',
78+
fn: input => remarkParseProcessor.processSync(input),
79+
},
80+
isAll && {
81+
name: 'simple-markdown [parse]',
82+
fn: input => SimpleMarkdown.defaultBlockParse(input),
83+
},
84+
isAll && {
85+
name: 'markdown-it [parse]',
86+
fn: input => mdIt.parse(input),
87+
},
88+
].filter(Boolean) as Test[]
89+
90+
const jsxTests = [
91+
{
92+
name: 'markdown-to-jsx (next) [jsx]',
93+
fn: input => compiler(input),
94+
},
95+
{
96+
name: `markdown-to-jsx (${latestVersion}) [jsx]`,
97+
fn: input => latestCompiler(input),
98+
},
99+
isAll && {
100+
name: 'rehype [jsx]',
101+
fn: input => rehypeProcessor.processSync(input).result,
102+
},
103+
isAll && {
104+
name: 'simple-markdown [jsx]',
105+
fn: input =>
106+
SimpleMarkdown.defaultReactOutput(
107+
SimpleMarkdown.defaultBlockParse(input)
108+
),
109+
},
110+
isAll && {
111+
name: 'react-markdown [jsx]',
112+
fn: input => ReactMarkdown({ children: input }),
113+
},
114+
].filter(Boolean) as Test[]
115+
116+
const htmlTests = [
117+
isAll && {
118+
name: 'markdown-it [html]',
119+
fn: input => mdIt.render(input),
120+
},
121+
].filter(Boolean) as Test[]
122+
123+
async function setupBenchmark() {
124+
let evals = suite
125+
126+
for (const test of parseTests) {
127+
evals.addFunction(test.name, test.fn)
128+
}
129+
130+
if (isAll || isJsx) {
131+
for (const test of jsxTests) {
132+
evals.addFunction(test.name, test.fn)
133+
}
134+
135+
// if (isAll || isHtml) {
136+
// for (const test of htmlTests) {
137+
// evals.addFunction(test.name, test.fn)
138+
// }
139+
// }
140+
}
141+
142+
evals
143+
.addInput('simple markdown string', ['_Hello_ **world**!'])
144+
.addInput('large markdown string', [fixture])
145+
.on('start', () => {
146+
totalCycles = suite._counter
147+
bar.start(totalCycles, 0)
148+
})
149+
.on('cycle', () => bar.update(totalCycles - suite._counter))
150+
.on('abort error complete', () => bar.stop())
151+
.on('complete', function () {
152+
console.log('Fastest is ' + suite.filter('fastest').map('name'))
153+
console.log(suite.table.toString())
154+
})
155+
// run async
156+
.run({ async: true })
157+
}
158+
159+
setupBenchmark()

benchtable.d.ts

Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,38 @@
1+
/**
2+
* Type definitions for benchtable
3+
*/
4+
5+
declare module 'benchtable' {
6+
import Benchmark = require('benchmark')
7+
8+
interface Table {
9+
push(item: Record<string, string[]>): void
10+
toString(): string
11+
}
12+
13+
interface BenchTableOptions extends Benchmark.Options {
14+
isTransposed?: boolean
15+
}
16+
17+
interface CycleEvent extends Benchmark.Event {
18+
target: Benchmark.Target
19+
}
20+
21+
class BenchTable extends Benchmark.Suite {
22+
table: Table
23+
_counter: number
24+
_transposed: boolean
25+
26+
constructor(name?: string, options?: BenchTableOptions)
27+
28+
addFunction(
29+
name: string,
30+
fun: (...args: unknown[]) => unknown,
31+
options?: Benchmark.Options
32+
): this
33+
34+
addInput(name: string, input: unknown[]): this
35+
}
36+
37+
export = BenchTable
38+
}

0 commit comments

Comments
 (0)