Skip to content

fix(ingestion): increase bufferSize from 256KB to 2MB for large files#216

Closed
JasonOA888 wants to merge 1 commit into
abhigyanpatwari:mainfrom
JasonOA888:fix/issue-198-buffer-size-limit
Closed

fix(ingestion): increase bufferSize from 256KB to 2MB for large files#216
JasonOA888 wants to merge 1 commit into
abhigyanpatwari:mainfrom
JasonOA888:fix/issue-198-buffer-size-limit

Conversation

@JasonOA888

Copy link
Copy Markdown
Contributor

Fixes #198

Problem

Files between 256KB-512KB crash tree-sitter with Invalid argument because bufferSize (256KB) is too small.

The file size filter (MAX_FILE_SIZE = 512KB) allows them through, but the parse buffer (bufferSize: 256KB) is too small.

Fix

Increased bufferSize from 1024*256 (256KB) to 1024*1024*2 (2MB)

This allows parsing of files up to MAX_FILE_SIZE (512KB) without errors.

Files Changed

  • call-processor.ts
  • heritage-processor.ts
  • import-processor.ts
  • parsing-processor.ts
  • workers/parse-worker.ts

Fixes abhigyanpatwari#198

Problem: Files between 256KB-512KB crash tree-sitter with
'Invalid argument' because bufferSize (256KB) is too small.

Fix: Increased bufferSize from 1024*256 (256KB) to 1024*1024*2 (2MB)

This allows parsing of files up to MAX_FILE_SIZE (512KB) without errors.
@abhigyanpatwari

Copy link
Copy Markdown
Owner

🟢 GitNexus Blast Radius: LOW

Metric Count
Changed symbols 5
Direct dependents (d=1) 2
Indirect (d=2) 1
Transitive (d=3) 0
Flows impacted 0
Total affected 3

Changed: processCalls, processHeritage, processImports, processParsingSequential, processFileGroup
Flows hit: None

View full blast radius graph →


Generated by GitNexus — code intelligence powered by knowledge graphs

@vercel

vercel Bot commented Mar 8, 2026

Copy link
Copy Markdown

@JasonOA888 is attempting to deploy a commit to the NexusCore Team on Vercel.

A member of the Team first needs to authorize it.

@magyargergo

Copy link
Copy Markdown
Collaborator

Can you please resolve the merge conflicts and add unit and integration tests to cover your changes? 🙏

@reversTeam reversTeam left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice fix, @JasonOA888. Bumping the tree-sitter parse buffer from 256KB to 2MB across all five ingestion call sites is a solid change for handling larger source files.

The change is consistent across all processors (call-processor, heritage-processor, import-processor, parsing-processor, parse-worker), and all sites already have try/catch with continue to skip unparseable files, so there's no risk of unhandled errors from the larger allocation.

The 2MB buffer is reasonable — tree-sitter allocates this lazily so it won't waste memory for small files. Good improvement for repos with large generated or vendored files.

LGTM.

@magyargergo

Copy link
Copy Markdown
Collaborator

I think we should make this configurable. Mainly because allowing parsing a 1MB+ file will result in some quite heavy ops. Furthermore, it will also produce quite a big graph for a file.

Could you please perform some manual testing and see slow down of the speed and growth of KuzuDB size?

Thanks

@magyargergo magyargergo left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to see some manual testing as I asked before.

magyargergo added a commit that referenced this pull request Mar 10, 2026
Merges fixes from PRs #163, #170, #178, #216, #227, #234 into a single
coherent changeset with shared modules and deduplication.

Phase 0 — Pre-merge consolidation:
- Extract isNodeExported to shared export-detection.ts module
- Extract TREE_SITTER_BUFFER_SIZE to shared constants.ts with adaptive sizing
- Consolidate FUNCTION_NODE_TYPES, extractFunctionName, isBuiltInOrNoise
  from duplicated call-processor.ts and parse-worker.ts into shared utils.ts
- Add query compilation smoke tests for all 12 languages

Language fixes:
- fix(c/cpp): isExported checks static linkage instead of returning false
- fix(c/cpp): .h files parsed as C++ (tree-sitter-cpp is superset of C)
- fix(c/cpp): expanded entry point patterns (~30 new for C, ~18 for C++)
- fix(cpp): add typedef, union, macro, prototype, inline method queries
- fix(c#): isExported scans sibling modifiers instead of parent walk
- fix(c#): heritage queries use correct base_list AST structure
- fix(c#): add framework detection, import resolution, entry point scoring
- fix(rust): isExported scans sibling visibility_modifier in declaration
- fix(builtins): remove open/read/write/close (real C POSIX syscalls)
- fix(buffer): adaptive bufferSize (2x fileSize, 512KB-32MB range)
- feat(ts/js): add call_expression query patterns for const assignments

Deduplication:
- call-processor.ts: -226 lines (uses shared utils)
- parse-worker.ts: -320 lines (uses shared utils)
- parsing-processor.ts: -156 lines (uses shared export-detection)
@magyargergo

Copy link
Copy Markdown
Collaborator

Superseded by #237 which consolidates this PR along with #163, #170, #178, #227, #234 into a single coherent branch with all C/C++/C#/Rust fixes, shared modules, and comprehensive tests.

@magyargergo

Copy link
Copy Markdown
Collaborator

Closing in favor of #237. Thank you for your contribution — your language support improvements are fully incorporated in the consolidated PR!

terrylica pushed a commit to terrylica/GitNexus that referenced this pull request Mar 10, 2026
abhigyanpatwari#237)

* fix: consolidate C/C++/C#/Rust language support from 6 overlapping PRs

Merges fixes from PRs abhigyanpatwari#163, abhigyanpatwari#170, abhigyanpatwari#178, abhigyanpatwari#216, abhigyanpatwari#227, abhigyanpatwari#234 into a single
coherent changeset with shared modules and deduplication.

Phase 0 — Pre-merge consolidation:
- Extract isNodeExported to shared export-detection.ts module
- Extract TREE_SITTER_BUFFER_SIZE to shared constants.ts with adaptive sizing
- Consolidate FUNCTION_NODE_TYPES, extractFunctionName, isBuiltInOrNoise
  from duplicated call-processor.ts and parse-worker.ts into shared utils.ts
- Add query compilation smoke tests for all 12 languages

Language fixes:
- fix(c/cpp): isExported checks static linkage instead of returning false
- fix(c/cpp): .h files parsed as C++ (tree-sitter-cpp is superset of C)
- fix(c/cpp): expanded entry point patterns (~30 new for C, ~18 for C++)
- fix(cpp): add typedef, union, macro, prototype, inline method queries
- fix(c#): isExported scans sibling modifiers instead of parent walk
- fix(c#): heritage queries use correct base_list AST structure
- fix(c#): add framework detection, import resolution, entry point scoring
- fix(rust): isExported scans sibling visibility_modifier in declaration
- fix(builtins): remove open/read/write/close (real C POSIX syscalls)
- fix(buffer): adaptive bufferSize (2x fileSize, 512KB-32MB range)
- feat(ts/js): add call_expression query patterns for const assignments

Deduplication:
- call-processor.ts: -226 lines (uses shared utils)
- parse-worker.ts: -320 lines (uses shared utils)
- parsing-processor.ts: -156 lines (uses shared export-detection)

* perf: fix review findings — hoist Sets, deduplicate DEFINITION_CAPTURE_KEYS

- Hoist CSHARP_DECL_TYPES and RUST_DECL_TYPES to module-level constants
  in export-detection.ts (was allocating new Set on every isNodeExported call)
- Extract DEFINITION_CAPTURE_KEYS and getDefinitionNodeFromCaptures to
  shared utils.ts (was duplicated in parsing-processor.ts and parse-worker.ts)
- Pre-compute merged entry point patterns to avoid per-call array spread
  in calculateEntryPointScore

* test: add C, C++, and Tree-sitter buffer size tests

* fix: C/C++/Rust review findings + comprehensive test coverage (+72 tests)

Source fixes:
- Add Rust built-in noise (unwrap, clone, into, collect, panic, etc.)
- C++ anonymous namespace → internal linkage (not exported)
- Replace .text regex with storage_class_specifier child scan (perf)
- Raise file skip threshold from 512KB to 32MB (TREE_SITTER_MAX_BUFFER)
- Export TREE_SITTER_MAX_BUFFER from constants.ts
- Add C++ double pointer query patterns to CPP_QUERIES
- Add C#: record_struct, record_class, file_scoped_namespace to decl types
- Add Rust: union_item to visibility scanning set

Tests (214 → 286):
- ingestion-utils: +24 (Rust/C# noise, pointer/ref/destructor extraction, buffer)
- parsing: +36 (real AST C/C++ static/namespace, Rust/C#/Java/PHP/Swift edge cases)
- tree-sitter-languages: +12 (query accuracy for C/C++/C#/Rust captures)
motolese pushed a commit to motolese/datamoto-gitnexus that referenced this pull request Apr 23, 2026
abhigyanpatwari#237)

* fix: consolidate C/C++/C#/Rust language support from 6 overlapping PRs

Merges fixes from PRs abhigyanpatwari#163, abhigyanpatwari#170, abhigyanpatwari#178, abhigyanpatwari#216, abhigyanpatwari#227, abhigyanpatwari#234 into a single
coherent changeset with shared modules and deduplication.

Phase 0 — Pre-merge consolidation:
- Extract isNodeExported to shared export-detection.ts module
- Extract TREE_SITTER_BUFFER_SIZE to shared constants.ts with adaptive sizing
- Consolidate FUNCTION_NODE_TYPES, extractFunctionName, isBuiltInOrNoise
  from duplicated call-processor.ts and parse-worker.ts into shared utils.ts
- Add query compilation smoke tests for all 12 languages

Language fixes:
- fix(c/cpp): isExported checks static linkage instead of returning false
- fix(c/cpp): .h files parsed as C++ (tree-sitter-cpp is superset of C)
- fix(c/cpp): expanded entry point patterns (~30 new for C, ~18 for C++)
- fix(cpp): add typedef, union, macro, prototype, inline method queries
- fix(c#): isExported scans sibling modifiers instead of parent walk
- fix(c#): heritage queries use correct base_list AST structure
- fix(c#): add framework detection, import resolution, entry point scoring
- fix(rust): isExported scans sibling visibility_modifier in declaration
- fix(builtins): remove open/read/write/close (real C POSIX syscalls)
- fix(buffer): adaptive bufferSize (2x fileSize, 512KB-32MB range)
- feat(ts/js): add call_expression query patterns for const assignments

Deduplication:
- call-processor.ts: -226 lines (uses shared utils)
- parse-worker.ts: -320 lines (uses shared utils)
- parsing-processor.ts: -156 lines (uses shared export-detection)

* perf: fix review findings — hoist Sets, deduplicate DEFINITION_CAPTURE_KEYS

- Hoist CSHARP_DECL_TYPES and RUST_DECL_TYPES to module-level constants
  in export-detection.ts (was allocating new Set on every isNodeExported call)
- Extract DEFINITION_CAPTURE_KEYS and getDefinitionNodeFromCaptures to
  shared utils.ts (was duplicated in parsing-processor.ts and parse-worker.ts)
- Pre-compute merged entry point patterns to avoid per-call array spread
  in calculateEntryPointScore

* test: add C, C++, and Tree-sitter buffer size tests

* fix: C/C++/Rust review findings + comprehensive test coverage (+72 tests)

Source fixes:
- Add Rust built-in noise (unwrap, clone, into, collect, panic, etc.)
- C++ anonymous namespace → internal linkage (not exported)
- Replace .text regex with storage_class_specifier child scan (perf)
- Raise file skip threshold from 512KB to 32MB (TREE_SITTER_MAX_BUFFER)
- Export TREE_SITTER_MAX_BUFFER from constants.ts
- Add C++ double pointer query patterns to CPP_QUERIES
- Add C#: record_struct, record_class, file_scoped_namespace to decl types
- Add Rust: union_item to visibility scanning set

Tests (214 → 286):
- ingestion-utils: +24 (Rust/C# noise, pointer/ref/destructor extraction, buffer)
- parsing: +36 (real AST C/C++ static/namespace, Rust/C#/Java/PHP/Swift edge cases)
- tree-sitter-languages: +12 (query accuracy for C/C++/C#/Rust captures)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bufferSize too small for large files — tree-sitter crashes with "Invalid argument" on files >256KB

4 participants