fix(community): improve heuristic cluster labeling (#413)#457
fix(community): improve heuristic cluster labeling (#413)#457ShunsukeHayashi wants to merge 4 commits into
Conversation
…N names (#413) community-processor.ts: - Expand generic folder skip list from 7 to 22 entries (adds: internal, pkg, app, modules, components, services, controllers, models, views, types, interfaces, test, tests, __tests__, spec, specs, dist, build, out, bin, scripts) - Add grandparent directory fallback for deeply nested structures (e.g., src/auth/middleware/ now considers "auth" as a candidate) - Add dominant node type + symbol name as label before Cluster_N fallback (e.g., "Function_validateUser" instead of "Cluster_7") Before: Clusters defaulting to Cluster_N when parent folder was generic. After: Three additional labeling strategies before falling back to Cluster_N. Addresses #413
|
@ShunsukeHayashi is attempting to deploy a commit to the NexusCore Team on Vercel. A member of the Team first needs to authorize it. |
CI Report✅ All checks passed Pipeline
Tests
✅ All 3674 tests passed across 1059 files 20 test(s) skipped
Coverage
📋 Full run · Coverage from Ubuntu · Generated by CI |
|
@claude review performance and check what's stated in the summary is actually delivered in this PR. |
|
Claude finished @magyargergo's task in 1m 24s —— View job PR Review: Performance & Summary Verification
Summary VerificationAll three strategies described in the PR summary are implemented correctly.
Correctness Issues1. Skip list count is wrong in the PR description The PR claims the list grows from 7 → 22 entries, but the actual The PR description uses Performance Issues2. // community-processor.ts line 304
memberIds.forEach(nodeId => {
// ...
const GENERIC_FOLDERS = new Set([ // ← new Set() on EVERY node
'src', 'lib', 'core', ...28 items
]);For a cluster with N members, this allocates N // At module scope, outside the function
const GENERIC_FOLDERS = new Set([
'src', 'lib', 'core', 'utils', 'common', 'shared', 'helpers',
// ...
]);3. Three separate passes over
For large clusters these can be merged into a single pass. That said, community sizes are typically bounded and the per-iteration work is cheap (map lookups/sets), so this is a minor concern compared to issue #2. 4. // line 382
const sortedNames = names.sort((a, b) => a.length - b.length);
Minor Observations
VerdictThe implementation delivers what the summary promises and the logic is correct. The main actionable issue is moving |
There was a problem hiding this comment.
Pull request overview
Improves community (cluster) heuristic labeling to reduce generic Cluster_N names by extracting stronger signals from member file paths and symbol metadata, addressing issue #413.
Changes:
- Expands the list of “generic” folder names that should not be used as labels.
- Adds a grandparent-directory fallback (weighted) for deeply nested paths when the parent is generic.
- Adds a final fallback that combines the dominant node type with a short symbol name when no folder/prefix label is found.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| const GENERIC_FOLDERS = new Set([ | ||
| 'src', 'lib', 'core', 'utils', 'common', 'shared', 'helpers', | ||
| 'internal', 'pkg', 'app', 'modules', 'components', 'services', | ||
| 'controllers', 'models', 'views', 'types', 'interfaces', | ||
| 'test', 'tests', '__tests__', 'spec', 'specs', | ||
| 'dist', 'build', 'out', 'bin', 'scripts', | ||
| ]); |
There was a problem hiding this comment.
GENERIC_FOLDERS is re-created for every memberIds.forEach iteration, which adds avoidable overhead and makes the set harder to maintain. Define it once (module-level constant, or at least once per generateHeuristicLabel call) and reuse it inside the loop.
| const labels = graph.getNodeAttribute(nodeId, 'labels'); | ||
| const nodeType = Array.isArray(labels) ? labels[0] : (typeof labels === 'string' ? labels : ''); |
There was a problem hiding this comment.
The dominant-type labeling logic reads graph.getNodeAttribute(nodeId, 'labels'), but nodes added in buildGraphologyGraph use the attribute key type (and name/filePath). As written, labels will be undefined and dominantType will never be selected, so this fallback won’t work. Use the existing type attribute (or ensure labels is actually set on nodes) when counting types.
| const labels = graph.getNodeAttribute(nodeId, 'labels'); | |
| const nodeType = Array.isArray(labels) ? labels[0] : (typeof labels === 'string' ? labels : ''); | |
| // Prefer explicit `type` attribute; fall back to `labels` if present | |
| const typeAttr = graph.getNodeAttribute(nodeId, 'type'); | |
| const labelsAttr = typeAttr ? undefined : graph.getNodeAttribute(nodeId, 'labels'); | |
| const nodeType = typeof typeAttr === 'string' | |
| ? typeAttr | |
| : (Array.isArray(labelsAttr) | |
| ? labelsAttr[0] | |
| : (typeof labelsAttr === 'string' ? labelsAttr : '')); |
| // Use dominant type + most-connected symbol name | ||
| if (dominantType && names.length > 0) { | ||
| const sortedNames = names.sort((a, b) => a.length - b.length); | ||
| const shortestName = sortedNames[0]; | ||
| return `${dominantType}_${shortestName}`; |
There was a problem hiding this comment.
The comment says "Use dominant type + most-connected symbol name", but the implementation sorts by string length and picks the shortest name. Please update the comment to match the actual behavior (or adjust the implementation if “most-connected” was intended).
Resolves the critical issue flagged in review: the GENERIC_FOLDERS Set (28 entries) was being recreated inside memberIds.forEach on every node iteration. For a cluster with N members this allocates N Set objects unnecessarily. Moved to a module-level const so it is created once at load time and shared across all calls to generateHeuristicLabel.
Spread to a new array before sorting to avoid mutating the original names array, as noted in review feedback.
|
Both issues from the review have been addressed:
Thanks for the thorough review! |
|
@claude run a final review if it is delivered what's in the pr summary. Do we clustering properly? |
|
Claude finished @magyargergo's task in 1m 15s —— View job Final Review: Cluster Labeling Implementation
Summary Verification
Critical Bug: Strategy 3 Is Dead CodeThe dominant-type fallback at line 370 reads the // community-processor.ts line 370
const labels = graph.getNodeAttribute(nodeId, 'labels');
const nodeType = Array.isArray(labels) ? labels[0] : (typeof labels === 'string' ? labels : '');But nodes are added to the graphology graph in // community-processor.ts line 213–217
graph.addNode(node.id, {
name: node.properties.name,
filePath: node.properties.filePath,
type: node.label, // ← key is 'type'
});
The fix is a one-liner: // Change:
const labels = graph.getNodeAttribute(nodeId, 'labels');
const nodeType = Array.isArray(labels) ? labels[0] : (typeof labels === 'string' ? labels : '');
// To:
const nodeType = graph.getNodeAttribute(nodeId, 'type') ?? '';Minor IssuesComment mismatch (line 384): The comment says "Use dominant type + most-connected symbol name", but the code sorts by string length and picks the shortest name, not most-connected. Either update the comment or implement degree-based selection (most-connected by graph edges is arguably better for label quality). Low priority, but misleading. Cascade is weighted, not ordered: The PR description implies a strict priority order (grandparent only considered if parent fails), but the actual implementation blends parent and grandparent into the same
VerdictStrategies 1 and 2 are correctly implemented. Strategy 3 is broken due to a wrong attribute key ( |
|
@ShunsukeHayashi Could you please look at these findings? 🙏 |
The dominant node type fallback was reading `graph.getNodeAttribute(nodeId, 'labels')` but nodes are stored with key `type` (set in buildGraphologyGraph as `type: node.label`). This caused `labels` to always be undefined, `typeCounts` to always be empty, and Strategy 3 to silently no-op — falling straight through to `Cluster_N`. Fixes: - Use `graph.getNodeAttribute(nodeId, 'type')` instead of `labels` - Fix comment: 'most-connected symbol name' → 'shortest symbol name' - Use `[...names].sort()` to avoid mutating the original array Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
Thanks for the review @magyargergo! The findings are correct — Strategy 3 was indeed dead code. Fixes applied in
All three strategies in the cascade are now functional. |
|
@claude could you propose a good e2e test for this to usecase the strength of this algorithm? Also think about negative cases. |
|
Claude finished @magyargergo's task in 4m 13s —— View job E2E Test Proposal for Cluster Labeling
Proposed test file:
|
| Test | Strategy | Type |
|---|---|---|
| non-generic parent folder used as label | 1 | ✅ positive |
| folder name is capitalized | 1 | ✅ positive |
| most-common folder wins across mixed paths | 1 | ✅ positive |
generic folders (src, utils, lib, …) are never used |
1 | ❌ negative |
| grandparent used when parent is generic | 2 | ✅ positive |
| strong parent (×3) beats shared grandparent (×1.0) | 2 | ✅ positive |
| generic grandparent also skipped | 2 | ❌ negative |
| common prefix extracted from ≥3 names in generic folder | 3 | ✅ positive |
| prefix of ≤2 chars skipped | 3 | ❌ negative |
| prefix skipped when cluster has exactly 2 members | 3 | ❌ negative |
Function_<shortest> when no folder/prefix signal |
4 | ✅ positive |
Class_<shortest> when majority are Classes |
4 | ✅ positive |
| shortest name wins over alphabetically-first name | 4 | ❌ negative |
| folder wins over name prefix (cascade order) | cascade | ✅ positive |
| name prefix wins over dominant type (cascade order) | cascade | ✅ positive |
| singleton nodes excluded from results | edge | ✅ positive |
empty filePath handled gracefully |
edge | ✅ positive |
| two isolated cliques → two distinct descriptive labels | edge | ✅ positive |
Notes
-
Cluster_Nis now unreachable — all community members have a valid graphologytypeattribute set bybuildGraphologyGraph, sodominantTypeis always populated. TheCluster_Nfallback at line 394 is dead code. It might be worth adding an explicit test that confirms this and/or removing the dead branch. -
names.length > 2threshold is undocumented — the test for exactly-2-member clusters makes this guard explicit and prevents regressions if someone changes it. -
Grandparent blending vs strict cascade — the grandparent node can win outright over a parent if many nodes share the same grandparent. The test
'strong parent (×3) beats shared grandparent (×1.0)'documents this weighted behaviour so it doesn't get "fixed" away.
|
@ShunsukeHayashi can you please look into this? 🙏 |
|
Summary
Most clusters get generic
Cluster_Nnames because the heuristic labeling only looks at the immediate parent directory and skips a short list of 7 common folder names. This PR improves labeling with three strategies.Addresses #413
Changes
1. Expanded generic folder skip list (7 → 28)
Added:
internal,pkg,app,modules,components,services,controllers,models,views,types,interfaces,test(s),__tests__,spec(s),dist,build,out,bin,scripts2. Grandparent directory fallback
For deeply nested structures like
src/auth/middleware/validate.ts, the parent directory ismiddleware(generic). Now also considersauth(grandparent) as a candidate label, with 0.5 weight.3. Dominant node type + symbol name
When no folder-based label works and no common name prefix exists, instead of
Cluster_7:Function_validateUserLabeling cascade
Tests