fix(ingestion): increase bufferSize from 256KB to 2MB for large files#216
fix(ingestion): increase bufferSize from 256KB to 2MB for large files#216JasonOA888 wants to merge 1 commit into
Conversation
Fixes abhigyanpatwari#198 Problem: Files between 256KB-512KB crash tree-sitter with 'Invalid argument' because bufferSize (256KB) is too small. Fix: Increased bufferSize from 1024*256 (256KB) to 1024*1024*2 (2MB) This allows parsing of files up to MAX_FILE_SIZE (512KB) without errors.
🟢 GitNexus Blast Radius: LOW
Changed: View full blast radius graph → Generated by GitNexus — code intelligence powered by knowledge graphs |
|
@JasonOA888 is attempting to deploy a commit to the NexusCore Team on Vercel. A member of the Team first needs to authorize it. |
|
Can you please resolve the merge conflicts and add unit and integration tests to cover your changes? 🙏 |
reversTeam
left a comment
There was a problem hiding this comment.
Nice fix, @JasonOA888. Bumping the tree-sitter parse buffer from 256KB to 2MB across all five ingestion call sites is a solid change for handling larger source files.
The change is consistent across all processors (call-processor, heritage-processor, import-processor, parsing-processor, parse-worker), and all sites already have try/catch with continue to skip unparseable files, so there's no risk of unhandled errors from the larger allocation.
The 2MB buffer is reasonable — tree-sitter allocates this lazily so it won't waste memory for small files. Good improvement for repos with large generated or vendored files.
LGTM.
|
I think we should make this configurable. Mainly because allowing parsing a 1MB+ file will result in some quite heavy ops. Furthermore, it will also produce quite a big graph for a file. Could you please perform some manual testing and see slow down of the speed and growth of KuzuDB size? Thanks |
magyargergo
left a comment
There was a problem hiding this comment.
I need to see some manual testing as I asked before.
Merges fixes from PRs #163, #170, #178, #216, #227, #234 into a single coherent changeset with shared modules and deduplication. Phase 0 — Pre-merge consolidation: - Extract isNodeExported to shared export-detection.ts module - Extract TREE_SITTER_BUFFER_SIZE to shared constants.ts with adaptive sizing - Consolidate FUNCTION_NODE_TYPES, extractFunctionName, isBuiltInOrNoise from duplicated call-processor.ts and parse-worker.ts into shared utils.ts - Add query compilation smoke tests for all 12 languages Language fixes: - fix(c/cpp): isExported checks static linkage instead of returning false - fix(c/cpp): .h files parsed as C++ (tree-sitter-cpp is superset of C) - fix(c/cpp): expanded entry point patterns (~30 new for C, ~18 for C++) - fix(cpp): add typedef, union, macro, prototype, inline method queries - fix(c#): isExported scans sibling modifiers instead of parent walk - fix(c#): heritage queries use correct base_list AST structure - fix(c#): add framework detection, import resolution, entry point scoring - fix(rust): isExported scans sibling visibility_modifier in declaration - fix(builtins): remove open/read/write/close (real C POSIX syscalls) - fix(buffer): adaptive bufferSize (2x fileSize, 512KB-32MB range) - feat(ts/js): add call_expression query patterns for const assignments Deduplication: - call-processor.ts: -226 lines (uses shared utils) - parse-worker.ts: -320 lines (uses shared utils) - parsing-processor.ts: -156 lines (uses shared export-detection)
|
Closing in favor of #237. Thank you for your contribution — your language support improvements are fully incorporated in the consolidated PR! |
abhigyanpatwari#237) * fix: consolidate C/C++/C#/Rust language support from 6 overlapping PRs Merges fixes from PRs abhigyanpatwari#163, abhigyanpatwari#170, abhigyanpatwari#178, abhigyanpatwari#216, abhigyanpatwari#227, abhigyanpatwari#234 into a single coherent changeset with shared modules and deduplication. Phase 0 — Pre-merge consolidation: - Extract isNodeExported to shared export-detection.ts module - Extract TREE_SITTER_BUFFER_SIZE to shared constants.ts with adaptive sizing - Consolidate FUNCTION_NODE_TYPES, extractFunctionName, isBuiltInOrNoise from duplicated call-processor.ts and parse-worker.ts into shared utils.ts - Add query compilation smoke tests for all 12 languages Language fixes: - fix(c/cpp): isExported checks static linkage instead of returning false - fix(c/cpp): .h files parsed as C++ (tree-sitter-cpp is superset of C) - fix(c/cpp): expanded entry point patterns (~30 new for C, ~18 for C++) - fix(cpp): add typedef, union, macro, prototype, inline method queries - fix(c#): isExported scans sibling modifiers instead of parent walk - fix(c#): heritage queries use correct base_list AST structure - fix(c#): add framework detection, import resolution, entry point scoring - fix(rust): isExported scans sibling visibility_modifier in declaration - fix(builtins): remove open/read/write/close (real C POSIX syscalls) - fix(buffer): adaptive bufferSize (2x fileSize, 512KB-32MB range) - feat(ts/js): add call_expression query patterns for const assignments Deduplication: - call-processor.ts: -226 lines (uses shared utils) - parse-worker.ts: -320 lines (uses shared utils) - parsing-processor.ts: -156 lines (uses shared export-detection) * perf: fix review findings — hoist Sets, deduplicate DEFINITION_CAPTURE_KEYS - Hoist CSHARP_DECL_TYPES and RUST_DECL_TYPES to module-level constants in export-detection.ts (was allocating new Set on every isNodeExported call) - Extract DEFINITION_CAPTURE_KEYS and getDefinitionNodeFromCaptures to shared utils.ts (was duplicated in parsing-processor.ts and parse-worker.ts) - Pre-compute merged entry point patterns to avoid per-call array spread in calculateEntryPointScore * test: add C, C++, and Tree-sitter buffer size tests * fix: C/C++/Rust review findings + comprehensive test coverage (+72 tests) Source fixes: - Add Rust built-in noise (unwrap, clone, into, collect, panic, etc.) - C++ anonymous namespace → internal linkage (not exported) - Replace .text regex with storage_class_specifier child scan (perf) - Raise file skip threshold from 512KB to 32MB (TREE_SITTER_MAX_BUFFER) - Export TREE_SITTER_MAX_BUFFER from constants.ts - Add C++ double pointer query patterns to CPP_QUERIES - Add C#: record_struct, record_class, file_scoped_namespace to decl types - Add Rust: union_item to visibility scanning set Tests (214 → 286): - ingestion-utils: +24 (Rust/C# noise, pointer/ref/destructor extraction, buffer) - parsing: +36 (real AST C/C++ static/namespace, Rust/C#/Java/PHP/Swift edge cases) - tree-sitter-languages: +12 (query accuracy for C/C++/C#/Rust captures)
abhigyanpatwari#237) * fix: consolidate C/C++/C#/Rust language support from 6 overlapping PRs Merges fixes from PRs abhigyanpatwari#163, abhigyanpatwari#170, abhigyanpatwari#178, abhigyanpatwari#216, abhigyanpatwari#227, abhigyanpatwari#234 into a single coherent changeset with shared modules and deduplication. Phase 0 — Pre-merge consolidation: - Extract isNodeExported to shared export-detection.ts module - Extract TREE_SITTER_BUFFER_SIZE to shared constants.ts with adaptive sizing - Consolidate FUNCTION_NODE_TYPES, extractFunctionName, isBuiltInOrNoise from duplicated call-processor.ts and parse-worker.ts into shared utils.ts - Add query compilation smoke tests for all 12 languages Language fixes: - fix(c/cpp): isExported checks static linkage instead of returning false - fix(c/cpp): .h files parsed as C++ (tree-sitter-cpp is superset of C) - fix(c/cpp): expanded entry point patterns (~30 new for C, ~18 for C++) - fix(cpp): add typedef, union, macro, prototype, inline method queries - fix(c#): isExported scans sibling modifiers instead of parent walk - fix(c#): heritage queries use correct base_list AST structure - fix(c#): add framework detection, import resolution, entry point scoring - fix(rust): isExported scans sibling visibility_modifier in declaration - fix(builtins): remove open/read/write/close (real C POSIX syscalls) - fix(buffer): adaptive bufferSize (2x fileSize, 512KB-32MB range) - feat(ts/js): add call_expression query patterns for const assignments Deduplication: - call-processor.ts: -226 lines (uses shared utils) - parse-worker.ts: -320 lines (uses shared utils) - parsing-processor.ts: -156 lines (uses shared export-detection) * perf: fix review findings — hoist Sets, deduplicate DEFINITION_CAPTURE_KEYS - Hoist CSHARP_DECL_TYPES and RUST_DECL_TYPES to module-level constants in export-detection.ts (was allocating new Set on every isNodeExported call) - Extract DEFINITION_CAPTURE_KEYS and getDefinitionNodeFromCaptures to shared utils.ts (was duplicated in parsing-processor.ts and parse-worker.ts) - Pre-compute merged entry point patterns to avoid per-call array spread in calculateEntryPointScore * test: add C, C++, and Tree-sitter buffer size tests * fix: C/C++/Rust review findings + comprehensive test coverage (+72 tests) Source fixes: - Add Rust built-in noise (unwrap, clone, into, collect, panic, etc.) - C++ anonymous namespace → internal linkage (not exported) - Replace .text regex with storage_class_specifier child scan (perf) - Raise file skip threshold from 512KB to 32MB (TREE_SITTER_MAX_BUFFER) - Export TREE_SITTER_MAX_BUFFER from constants.ts - Add C++ double pointer query patterns to CPP_QUERIES - Add C#: record_struct, record_class, file_scoped_namespace to decl types - Add Rust: union_item to visibility scanning set Tests (214 → 286): - ingestion-utils: +24 (Rust/C# noise, pointer/ref/destructor extraction, buffer) - parsing: +36 (real AST C/C++ static/namespace, Rust/C#/Java/PHP/Swift edge cases) - tree-sitter-languages: +12 (query accuracy for C/C++/C#/Rust captures)
Fixes #198
Problem
Files between 256KB-512KB crash tree-sitter with
Invalid argumentbecause bufferSize (256KB) is too small.The file size filter (
MAX_FILE_SIZE = 512KB) allows them through, but the parse buffer (bufferSize: 256KB) is too small.Fix
Increased bufferSize from
1024*256(256KB) to1024*1024*2(2MB)This allows parsing of files up to MAX_FILE_SIZE (512KB) without errors.
Files Changed
call-processor.tsheritage-processor.tsimport-processor.tsparsing-processor.tsworkers/parse-worker.ts