Skip to content

refactor: split global BUILT_IN_NAMES into per-language provider fields#523

Merged
magyargergo merged 13 commits into
abhigyanpatwari:mainfrom
mrwogu:refactor/per-language-noise-filter
Mar 26, 2026
Merged

refactor: split global BUILT_IN_NAMES into per-language provider fields#523
magyargergo merged 13 commits into
abhigyanpatwari:mainfrom
mrwogu:refactor/per-language-noise-filter

Conversation

@mrwogu

@mrwogu mrwogu commented Mar 26, 2026

Copy link
Copy Markdown
Contributor

Closes #522

Summary

  • Add builtInNames?: ReadonlySet<string> field to LanguageProviderConfig
  • Move all built-in entries from the global BUILT_IN_NAMES set into per-language provider definitions (languages/*.ts)
  • Rewrite isBuiltInOrNoise(name, provider) to check provider.builtInNames instead of a global set
  • Update all 3 call sites (parse-worker.ts, call-processor.ts, type-env.ts) to pass their existing provider

Each language defines its own noise entries inline, following the same pattern as exportChecker, typeConfig, and importResolver. Languages without built-in noise (Java, Go) simply omit the field.

Cross-language pollution fixed: serialize was previously filtered globally (PHP section) and suppressed a legitimate user.serialize() call in Java. The Java heritage test now correctly expects 3 CALLS edges instead of 2.

Per-language commits

One commit per language for easy review:

  • Infrastructure: isBuiltInOrNoise signature change + call site updates
  • JS/TS, Python, Kotlin, C/C++, C#, PHP, Swift, Rust, Ruby, Dart
  • Tests: updated existing tests + 15 new cross-language isolation tests

Test plan

  • tsc --noEmit clean
  • 118/119 test files pass (1 pre-existing flaky lbug lock)
  • 4121 tests pass
  • New noise-filter.test.ts verifies cross-language isolation (e.g., console filtered for JS but not Python)
  • Java heritage test updated: serialize() correctly unfiltered

mrwogu added 12 commits March 26, 2026 10:15
…N_NAMES

Add builtInNames field to LanguageProviderConfig. Rewrite noise-filter.ts
to accept a LanguageProvider and check provider.builtInNames instead of
a global Set. Update all 3 call sites to pass their existing provider.

Built-in entries will be added per-language in subsequent commits.
…ests

- Update ingestion-utils.test.ts to pass provider to isBuiltInOrNoise
- Add noise-filter.test.ts with 15 cross-language isolation tests
- Fix Java heritage test: serialize() is now correctly unfiltered for Java
  (was false-positive noise from global PHP serialize entry)
@vercel

vercel Bot commented Mar 26, 2026

Copy link
Copy Markdown

@mrwogu is attempting to deploy a commit to the NexusCore Team on Vercel.

A member of the Team first needs to authorize it.

Comment thread gitnexus/src/core/ingestion/utils/noise-filter.ts Outdated
Comment thread gitnexus/src/core/ingestion/utils/noise-filter.ts Outdated
@github-actions

github-actions Bot commented Mar 26, 2026

Copy link
Copy Markdown
Contributor

CI Report

All checks passed

Pipeline Status

Stage Status Details
✅ Typecheck success tsc --noEmit
✅ Tests success unit tests, 3 platforms
✅ E2E success gitnexus-web changes only

Test Results

Tests Passed Failed Skipped Duration
4346 4345 0 1 175s

✅ All 4345 tests passed

1 test(s) skipped — expand for details
  • buildTypeEnv > known limitations (documented skip tests) > Ruby block parameter: users.each { |user| } — closure param inference, different feature

Code Coverage

Tests

Metric Coverage Covered Base Delta Status
Statements 70.29% 10799/15363 70.22% 📈 +0.1 🟢 ██████████████░░░░░░
Branches 60.71% 7324/12062 60.64% 📈 +0.1 🟢 ████████████░░░░░░░░
Functions 73.93% 939/1270 73.91% 📈 +0.0 🟢 ██████████████░░░░░░
Lines 72.79% 9645/13249 72.73% 📈 +0.1 🟢 ██████████████░░░░░░

📋 View full run · Generated by CI

@magyargergo

Copy link
Copy Markdown
Collaborator

@claude Can you check if the bult-ins are correclt for the languages? Find gaps and or anything that doesn't belong to a language built-ins.

@github-actions

github-actions Bot commented Mar 26, 2026

Copy link
Copy Markdown
Contributor

Claude finished @magyargergo's task in 3m 8s —— View job


Built-in Names Audit

  • Read all per-language built-in definitions
  • Audit each language for correctness, gaps, and misplaced entries
  • Report findings

Findings

JavaScript / TypeScript (typescript.ts)

Misplaced — framework entries, not language built-ins:

  • React Hooks: useState, useEffect, useCallback, useMemo, useRef, useContext, useReducer, useLayoutEffect, useImperativeHandle, useDebugValue — these are from the react package, not the JS/TS language
  • React API: createElement, createContext, createRef, forwardRef, memo, lazy — same, react package

Filtering these means any user-defined function named useEffect, memo, etc. in non-React code gets silently suppressed. More importantly, tracking React hook usage is signal worth keeping.

Overly generic names that will suppress user code:

  • log, warn, error, info, debug — these are console methods, but as standalone names they match many user-defined functions (e.g. a service's own log(), error())
  • parse, stringify — methods of JSON, but also very common user-defined function names
  • resolve, reject, then, catch, finally — Promise methods, but resolve, reject, and then are common in domain code
  • map, filter, reduce, forEach, find, findIndex, some, every, includes, indexOf, slice, splice, concat, join, split, push, pop, shift, unshift, sort, reverse — these are Array prototype methods, not global functions. Filtering them will suppress calls to domain services that happen to be named find(), sort(), etc.

Not function call names:

  • import, export — these are keywords, they won't appear as call expressions

Missing globals:

  • structuredClone, queueMicrotask, globalThis, URL, URLSearchParams, Blob, File, FormData, AbortController, TextEncoder, TextDecoder, crypto (Web Crypto), performance, requestAnimationFrame, cancelAnimationFrame

Python (python.ts)

Missing built-ins — Python has ~70 built-in functions. Notable gaps:

  • open, input, print is present but input is missing
  • map, filter, any, all — very commonly used built-in higher-order functions
  • bool, bytes, complex, chr, ord, hex, oct, bin
  • id, hash, iter, next, callable, super, object, property
  • format, repr, vars, dir, help
  • exec, eval, compile
  • round, pow, divmod

Overly generic method names (correct as built-ins but risky):

  • append, extend, update — list/dict/set methods; a domain model's update() will be suppressed

Kotlin (kotlin.ts)

Misplaced — library functions, not Kotlin stdlib:

The Coroutines section comes from kotlinx.coroutines, a separate library dependency:

  • launch, async, runBlocking, withContext, coroutineScope, supervisorScope, delay

The Flow section comes from kotlinx.coroutines.flow:

  • flow, flowOf, collect, emit, onEach, catch, buffer, conflate, distinctUntilChanged, flatMapLatest, flatMapMerge, combine, stateIn, shareIn, launchIn

Including third-party library names as "built-ins" defeats the purpose of this filter — emit(), collect(), catch(), and launch() are all plausible user-defined function names that will now be silently dropped.


C / C++ (c-cpp.ts)

Both C and C++ share the same list, which already warrants a note — C++ has its own idioms not covered here (new, delete, dynamic_cast, RAII patterns).

Linux kernel–specific entries that don't belong in general C/C++ built-ins:

  • Kernel macros: BUG, BUG_ON, WARN, WARN_ON, WARN_ONCE, IS_ERR, PTR_ERR, ERR_PTR, IS_ERR_OR_NULL, ARRAY_SIZE, container_of, list_for_each_entry, list_for_each_entry_safe
  • Kernel logging: pr_info, pr_warn, pr_err, pr_debug, pr_notice, pr_crit, pr_emerg, printk, dev_info, dev_warn, dev_err, dev_dbg
  • Kernel memory: kfree, kmalloc, kzalloc, kcalloc, krealloc, kvmalloc, kvfree, GFP_KERNEL, GFP_ATOMIC
  • Kernel locking: spin_lock, spin_unlock, spin_lock_irqsave, spin_unlock_irqrestore, mutex_lock, mutex_unlock, mutex_init
  • Kernel reference counting: likely, unlikely, get, put

For any non-kernel C project, these are irrelevant. For a kernel project, this is useful — but the real concern is get and put: these are extremely common names for user-defined accessor functions in any codebase.


C# (csharp.ts)

Not a real .NET method name:

  • Throw — this is a C# keyword, not a callable method. (ArgumentNullException.ThrowIfNull is real but listed separately as ThrowIfNull)

Overly generic LINQ/collection method names — every one of these will suppress user-defined methods with the same name:

  • Add, Remove, Contains, Clear, Count — fundamental collection members, but also ubiquitous in domain models
  • Any, All — LINQ methods but extremely common domain predicates
  • RunTask.Run() pattern, but also a very common domain method name (e.g., a pipeline's Run())

PHP (php.ts)

Framework helpers, not PHP language built-ins:

  • dd — Laravel's "dump and die"; not a PHP built-in
  • dump — Symfony's VarDumper; not a PHP built-in

In a non-Laravel/Symfony project, a user-defined dump() function would be incorrectly suppressed.


Swift (swift.ts)

UIKit framework methods — not Swift language built-ins:

  • View lifecycle: addSubview, removeFromSuperview, layoutSubviews, setNeedsLayout, layoutIfNeeded, setNeedsDisplay, invalidateIntrinsicContentSize
  • Control events: addTarget, removeTarget, addGestureRecognizer
  • Auto Layout: addConstraint, addConstraints, removeConstraint, removeConstraints
  • TableView/CollectionView: reloadData, reloadSections, reloadRows, performBatchUpdates, register, dequeueReusableCell, dequeueReusableSupplementaryView, beginUpdates, endUpdates, insertRows, deleteRows, insertSections, deleteSections
  • Navigation: present, dismiss, pushViewController, popViewController, popToRootViewController, performSegue, prepare

Combine framework (not Swift stdlib):

  • sink, store, assign, receive, subscribe

For server-side Swift (Vapor, Hummingbird), none of these UIKit/Combine entries apply, and prepare, assign, store, post are all reasonable user-defined method names.

Particularly risky generics:

  • async, syncDispatchQueue.async/sync but also natural user method names
  • postNotificationCenter.post but common in REST/HTTP domain code
  • type — legitimate Swift stdlib function but also a very common variable/method name

Rust (rust.ts)

Generally the cleanest of all languages — the entries closely track actual trait methods and macros from std.

Minor concerns:

  • read — RwLock method, but one of the most common I/O function names in any Rust codebase
  • write — same issue; it's both a Rust macro (write!) and a ubiquitous I/O function
  • from, into — these are the From/Into trait methods and are everywhere in Rust; blanket suppression may hide important conversion call edges

Missing:

  • Common macros: assert_eq!, assert_ne!, debug_assert! (if macros are tracked as calls)
  • new as a constructor pattern (though this may be handled elsewhere)

Ruby (ruby.ts)

Ruby's list is well-targeted. A few gaps:

Missing common aliases:

  • map (alias for collect, but more commonly used)
  • reduce (alias for inject, but more commonly used)
  • find (alias for detect)
  • each_slice, each_cons, zip, take, drop

Missing Kernel methods:

  • sleep, exit, abort, rand, puts is present but print is missing, gets
  • Integer(), Float(), String(), Array() (coercion functions)

Dart (dart.ts)

Almost entirely Flutter-specific. The list is biased toward Flutter widget lifecycle/navigation, with virtually no core Dart language built-ins.

Missing Dart core built-ins:

  • print, toString, hashCode, identical, identical, assert
  • int.parse, double.parse (commonly appear as just parse)
  • Type coercions and core class constructors

Risky generic names in the current list:

  • push, pop — Navigator methods but common data structure operations in user code
  • listen — Stream method but very common event-handling name
  • then, catchError — Future methods; a user-defined then() would be suppressed

Summary Table

Language Framework entries mixed in Overly generic names Notable gaps
JS/TS React hooks + APIs log, error, map, filter, etc. Web Crypto, modern globals
Python None append, update map, filter, any, all, open, input + ~30 others
Kotlin kotlinx.coroutines, Flow emit, collect, catch
C/C++ Linux kernel macros/functions get, put C++ stdlib, C++ specific
C# None Add, Remove, Count, Run Throw is not a method
PHP dd, dump (Laravel/Symfony)
Swift UIKit, Combine async, sync, post, prepare Server-side Swift has zero coverage
Rust None read, write
Ruby None map, reduce, find aliases
Dart Nearly all Flutter push, pop, listen Core Dart built-ins almost absent

@mrwogu

mrwogu commented Mar 26, 2026

Copy link
Copy Markdown
Contributor Author

Great audit. The per-language architecture is working as intended — this PR delivers the mechanism (provider-aware filtering, cross-language isolation, compile-time exhaustiveness). The audit now reveals that the content of each list needs work too.

Three categories of issues surfaced:

1. Framework entries mixed with language built-ins — React hooks in JS/TS, kotlinx.coroutines/Flow in Kotlin, UIKit/Combine in Swift, Laravel/Symfony dd/dump in PHP, Flutter lifecycle in Dart. These suppress legitimate user-defined names in non-framework codebases.

2. Overly generic nameslog, error, map, filter, push, pop, get, put, Add, Remove, Run etc. are Array/collection methods but match user-defined functions. The original global list had the same problem — we just moved it per-language without auditing.

3. Missing core built-ins — Python is missing ~30 stdlib functions (map, filter, any, all, open, input...), Ruby is missing common aliases (map, reduce, find), Dart has almost no core Dart entries.

@magyargergo — would you be OK with merging this PR as the architectural foundation and opening a follow-up for the content audit? The mechanism change (per-language isolation, provider field, call site updates) is clean and tested. The list curation is a separate concern that benefits from per-language domain expertise and can be iterated on incrementally.

@magyargergo

Copy link
Copy Markdown
Collaborator

I want to remove the file I pointed out in an in-line comment.

Per review feedback: delete noise-filter.ts entirely and move the check
into LanguageProvider as isBuiltInName(name) method, generated by
defineLanguage() from the builtInNames set.

Call sites now use provider.isBuiltInName(calledName) directly.
@mrwogu

mrwogu commented Mar 26, 2026

Copy link
Copy Markdown
Contributor Author

Done — addressed both inline review comments:

  • Deleted noise-filter.ts entirely
  • Added isBuiltInName(name) method to LanguageProvider interface, generated by defineLanguage() from the builtInNames set
  • All 3 call sites now use provider.isBuiltInName(calledName) directly
  • Tests updated accordingly

@magyargergo magyargergo merged commit 546128c into abhigyanpatwari:main Mar 26, 2026
9 of 10 checks passed
icodebuster pushed a commit to icodebuster/GitNexus that referenced this pull request Mar 31, 2026
* main: (114 commits)
  feat(csharp): C# MethodExtractor config (abhigyanpatwari#582)
  docs: add gitnexus-shared build step before gitnexus-web (abhigyanpatwari#585)
  chore: add enterprise offering section to README, ignore local_docs/ (abhigyanpatwari#579)
  fix(eval): exclude litellm 1.82.7 and 1.82.8 due to compatibility issues (abhigyanpatwari#580)
  feat(java,kotlin): MethodExtractor abstraction with per-language configs (abhigyanpatwari#576)
  feat: added skip-agents-md cli flag (abhigyanpatwari#517)
  feat(wiki): Azure OpenAI support for wiki command (abhigyanpatwari#562)
  refactor: reduce explicit any types (abhigyanpatwari#566)
  feat(java): method references, worker overload disambiguation, interface dispatch (abhigyanpatwari#540)
  feat: configure eslint with unused import removal (abhigyanpatwari#564)
  feat: configure prettier with pre-commit hook (abhigyanpatwari#563)
  feat: unify web and cli ingestion pipeline (abhigyanpatwari#536)
  fix/opencode mcp gitnexus timeout (abhigyanpatwari#363)
  chore: bump version to 1.4.10, update CHANGELOG
  fix: resolve tree-sitter peer dependency conflicts (abhigyanpatwari#538)
  chore: bump version to 1.4.9, add CHANGELOG.md
  refactor: Phase 8 & 9 — Field Types and Return-Type Binding (abhigyanpatwari#494)
  feat: add COBOL language support with regex extraction pipeline (abhigyanpatwari#498)
  fix: close remaining Dart language support gaps (abhigyanpatwari#524)
  refactor: split global BUILT_IN_NAMES into per-language provider fields (abhigyanpatwari#523)
  ...

# Conflicts:
#	gitnexus/src/core/wiki/llm-client.ts
motolese pushed a commit to motolese/datamoto-gitnexus that referenced this pull request Apr 23, 2026
…ds (abhigyanpatwari#523)

* refactor: make isBuiltInOrNoise provider-aware, remove global BUILT_IN_NAMES

Add builtInNames field to LanguageProviderConfig. Rewrite noise-filter.ts
to accept a LanguageProvider and check provider.builtInNames instead of
a global Set. Update all 3 call sites to pass their existing provider.

Built-in entries will be added per-language in subsequent commits.

* refactor(js/ts): add per-language builtInNames to JS/TS providers

* refactor(python): add per-language builtInNames

* refactor(kotlin): add per-language builtInNames

* refactor(c/cpp): add per-language builtInNames

* refactor(csharp): add per-language builtInNames

* refactor(php): add per-language builtInNames

* refactor(swift): add per-language builtInNames

* refactor(rust): add per-language builtInNames

* refactor(ruby): add per-language builtInNames

* refactor(dart): add per-language builtInNames

* test: update noise-filter tests for per-language API, add isolation tests

- Update ingestion-utils.test.ts to pass provider to isBuiltInOrNoise
- Add noise-filter.test.ts with 15 cross-language isolation tests
- Fix Java heritage test: serialize() is now correctly unfiltered for Java
  (was false-positive noise from global PHP serialize entry)

* refactor: remove noise-filter.ts, add provider.isBuiltInName() method

Per review feedback: delete noise-filter.ts entirely and move the check
into LanguageProvider as isBuiltInName(name) method, generated by
defineLanguage() from the builtInNames set.

Call sites now use provider.isBuiltInName(calledName) directly.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

refactor: split global BUILT_IN_NAMES into per-language noise filters

2 participants