Skip to content

Improve document preprocessing#1530

Open
thecrypticace wants to merge 2 commits intofeat/document-no-async-findfrom
feat/document-scan-fast
Open

Improve document preprocessing#1530
thecrypticace wants to merge 2 commits intofeat/document-no-async-findfrom
feat/document-scan-fast

Conversation

@thecrypticace
Copy link
Contributor

@thecrypticace thecrypticace commented Dec 31, 2025

In the language server we do a lot of scanning of documents for:

  • Embedded languages (e.g. HTML <style> blocks are CSS, <script> tags are JS)
  • Class lists
  • Function calls in CSS
  • Complete or partial at-rules in CSS
  • etc…

Additionally, the user can provide custom regexes to target arbitrary text as class lists.

We don't want to detect any of these inside comments so we preprocess documents by replacing comments with spaces. Additionally, in JS, we replace regex literals with spaces as we don't want something like /<style>/ to accidentally get detected the start of an embedded language.

This process takes a small emount of time and memory and can be complicated to do correctly. Here I've replaced the existing scanner/parser with a UTF-16 code unit based version (e.g. String#charCodeAt) that is more correct than regexes (JS has no support for recursive patterns), uses less memory, and is up to ~8x faster in my benchmarks.

The implementation here isn't perfect either but it is a bit better.

This is both faster and uses less memory — especially on large documents
@thecrypticace thecrypticace force-pushed the feat/document-scan-fast branch from c1b2c98 to fef9459 Compare January 1, 2026 01:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant