Skip to content

fix(community): improve heuristic cluster labeling (#413)#457

Closed
ShunsukeHayashi wants to merge 4 commits into
abhigyanpatwari:mainfrom
Miyabi-G-K:fix/cluster-label-413
Closed

fix(community): improve heuristic cluster labeling (#413)#457
ShunsukeHayashi wants to merge 4 commits into
abhigyanpatwari:mainfrom
Miyabi-G-K:fix/cluster-label-413

Conversation

@ShunsukeHayashi

@ShunsukeHayashi ShunsukeHayashi commented Mar 22, 2026

Copy link
Copy Markdown
Contributor

Summary

Most clusters get generic Cluster_N names because the heuristic labeling only looks at the immediate parent directory and skips a short list of 7 common folder names. This PR improves labeling with three strategies.

Addresses #413

Changes

1. Expanded generic folder skip list (7 → 28)

Added: internal, pkg, app, modules, components, services, controllers, models, views, types, interfaces, test(s), __tests__, spec(s), dist, build, out, bin, scripts

2. Grandparent directory fallback

For deeply nested structures like src/auth/middleware/validate.ts, the parent directory is middleware (generic). Now also considers auth (grandparent) as a candidate label, with 0.5 weight.

3. Dominant node type + symbol name

When no folder-based label works and no common name prefix exists, instead of Cluster_7:

  • Check the dominant node type in the cluster (Function, Class, etc.)
  • Combine with the shortest symbol name: Function_validateUser

Labeling cascade

1. Most-common non-generic parent folder → "Auth"
2. Grandparent directory (0.5 weight) → "Auth"  
3. Common name prefix → "validate"
4. Dominant type + shortest name → "Function_validateUser"
5. Fallback → "Cluster_7"

Tests

1900 passed | 1 skipped
npx tsc --noEmit  # No errors

…N names (#413)

community-processor.ts:
- Expand generic folder skip list from 7 to 22 entries
  (adds: internal, pkg, app, modules, components, services, controllers,
  models, views, types, interfaces, test, tests, __tests__, spec, specs,
  dist, build, out, bin, scripts)
- Add grandparent directory fallback for deeply nested structures
  (e.g., src/auth/middleware/ now considers "auth" as a candidate)
- Add dominant node type + symbol name as label before Cluster_N fallback
  (e.g., "Function_validateUser" instead of "Cluster_7")

Before: Clusters defaulting to Cluster_N when parent folder was generic.
After: Three additional labeling strategies before falling back to Cluster_N.

Addresses #413
@vercel

vercel Bot commented Mar 22, 2026

Copy link
Copy Markdown

@ShunsukeHayashi is attempting to deploy a commit to the NexusCore Team on Vercel.

A member of the Team first needs to authorize it.

@github-actions

github-actions Bot commented Mar 22, 2026

Copy link
Copy Markdown
Contributor

CI Report

All checks passed6117f09

Pipeline

Stage Status Ubuntu Windows macOS
Typecheck success
Tests success

Tests

Metric Value
Total 3694
Passed 3674
Skipped 20
Files 1059
Duration 2m 0s

✅ All 3674 tests passed across 1059 files

20 test(s) skipped
  • buildTypeEnv > known limitations (documented skip tests) > Ruby block parameter: users.each { |user| } — closure param inference, different feature
  • Python match/case as-pattern type binding > resolves u.save() to User#save via match/case as-pattern binding
  • Python match/case as-pattern type binding > does NOT resolve u.save() to Repo#save (negative disambiguation)
  • Swift constructor-inferred type resolution > detects User and Repo classes, both with save methods
  • Swift constructor-inferred type resolution > resolves user.save() to Models/User.swift via constructor-inferred type
  • Swift constructor-inferred type resolution > resolves repo.save() to Models/Repo.swift via constructor-inferred type
  • Swift constructor-inferred type resolution > emits exactly 2 save() CALLS edges (one per receiver type)
  • Swift self resolution > detects User and Repo classes, each with a save function
  • Swift self resolution > resolves self.save() inside User.process to User.save, not Repo.save
  • Swift parent resolution > detects BaseModel and User classes plus Serializable protocol
  • Swift parent resolution > emits EXTENDS edge: User → BaseModel
  • Swift parent resolution > emits IMPLEMENTS edge: User → Serializable (protocol conformance)
  • Swift cross-file User.init() inference > resolves user.save() via User.init(name:) inference
  • Swift cross-file User.init() inference > resolves user.greet() via User.init(name:) inference
  • Swift return type inference > detects User class and getUser function
  • Swift return type inference > detects save function on User (Swift class methods are Function nodes)
  • Swift return type inference > resolves user.save() to User#save via return type of getUser() -> User
  • Swift return-type inference via function return type > resolves user.save() to User#save via return type of getUser()
  • Swift return-type inference via function return type > user.save() does NOT resolve to Repo#save
  • Swift return-type inference via function return type > resolves repo.save() to Repo#save via return type of getRepo()

Coverage

Metric Coverage Covered Base (main) Delta
Statements 69.29% 9253/13354 69.06% 📈 +0.2%
Branches 60.2% 6258/10394 60.01% 📈 +0.2%
Functions 72.39% 826/1141 71.88% 📈 +0.5%
Lines 71.54% 8256/11539 71.34% 📈 +0.2%

📋 Full run · Coverage from Ubuntu · Generated by CI

@magyargergo

Copy link
Copy Markdown
Collaborator

@claude review performance and check what's stated in the summary is actually delivered in this PR.

@magyargergo magyargergo requested a review from Copilot March 22, 2026 13:48
@github-actions

github-actions Bot commented Mar 22, 2026

Copy link
Copy Markdown
Contributor

Claude finished @magyargergo's task in 1m 24s —— View job


PR Review: Performance & Summary Verification

  • Gather context
  • Read changed file (community-processor.ts)
  • Verify PR summary claims are delivered in code
  • Analyze performance implications
  • Post full review

Summary Verification

All three strategies described in the PR summary are implemented correctly.

Claim Delivered? Notes
Expanded generic folder skip list ✅ Yes But count is wrong — see below
Grandparent directory fallback with 0.5 weight ✅ Yes Implemented at lines 322–328
Dominant node type + shortest symbol name ✅ Yes Implemented at lines 363–388
Labeling cascade (5-step order) ✅ Yes Order matches description

Correctness Issues

1. Skip list count is wrong in the PR description

The PR claims the list grows from 7 → 22 entries, but the actual Set in the code contains 28 entries:

src, lib, core, utils, common, shared, helpers,        ← original 7
internal, pkg, app, modules, components, services,
controllers, models, views, types, interfaces,
test, tests, __tests__, spec, specs,
dist, build, out, bin, scripts                         ← 21 additions = 28 total

The PR description uses test(s) and spec(s) as shorthand, but the code correctly adds both test/tests and spec/specs as separate entries (which is good). The count just needs correcting in the description — minor doc issue.


Performance Issues

2. GENERIC_FOLDERS Set is recreated on every loop iteration — highest impact issue

// community-processor.ts line 304
memberIds.forEach(nodeId => {
  // ...
  const GENERIC_FOLDERS = new Set([  // ← new Set() on EVERY node
    'src', 'lib', 'core', ...28 items
  ]);

For a cluster with N members, this allocates N Set objects, each with 28 string lookups during construction. This should be a module-level constant — it never changes:

// At module scope, outside the function
const GENERIC_FOLDERS = new Set([
  'src', 'lib', 'core', 'utils', 'common', 'shared', 'helpers',
  // ...
]);

Fix this →

3. Three separate passes over memberIds

memberIds is iterated three times in sequence:

  • Line 304: folder counting
  • Line 350: name collection
  • Line 365: type counting

For large clusters these can be merged into a single pass. That said, community sizes are typically bounded and the per-iteration work is cheap (map lookups/sets), so this is a minor concern compared to issue #2.

4. names.sort() mutates the array in-place

// line 382
const sortedNames = names.sort((a, b) => a.length - b.length);

Array.sort mutates names. The names array is only read once before this point (for findCommonPrefix), so there's no actual bug, but it's a subtle footgun. [...names].sort(...) or names.toSorted(...) (TS5+) would be safer.


Minor Observations

  • The PR description's labeling cascade describes grandparent as a standalone step 2, but in the code it's blended into the same folderCounts map as step 1 (just with 0.5 weight). This means grandparent can win over parent if multiple nodes share the same grandparent. The behavior is correct but the documentation implies a stricter priority order than what's implemented.
  • No new tests were added for the three new strategies. The PR cites 1900 passed (likely pre-existing tests) — it would be good to have at least one test covering the grandparent fallback and dominant-type paths.

Verdict

The implementation delivers what the summary promises and the logic is correct. The main actionable issue is moving GENERIC_FOLDERS outside the loop — it's a straightforward fix that eliminates unnecessary allocations per node in every cluster labeling call.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Improves community (cluster) heuristic labeling to reduce generic Cluster_N names by extracting stronger signals from member file paths and symbol metadata, addressing issue #413.

Changes:

  • Expands the list of “generic” folder names that should not be used as labels.
  • Adds a grandparent-directory fallback (weighted) for deeply nested paths when the parent is generic.
  • Adds a final fallback that combines the dominant node type with a short symbol name when no folder/prefix label is found.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +312 to +318
const GENERIC_FOLDERS = new Set([
'src', 'lib', 'core', 'utils', 'common', 'shared', 'helpers',
'internal', 'pkg', 'app', 'modules', 'components', 'services',
'controllers', 'models', 'views', 'types', 'interfaces',
'test', 'tests', '__tests__', 'spec', 'specs',
'dist', 'build', 'out', 'bin', 'scripts',
]);

Copilot AI Mar 22, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GENERIC_FOLDERS is re-created for every memberIds.forEach iteration, which adds avoidable overhead and makes the set harder to maintain. Define it once (module-level constant, or at least once per generateHeuristicLabel call) and reuse it inside the loop.

Copilot uses AI. Check for mistakes.
Comment on lines +366 to +367
const labels = graph.getNodeAttribute(nodeId, 'labels');
const nodeType = Array.isArray(labels) ? labels[0] : (typeof labels === 'string' ? labels : '');

Copilot AI Mar 22, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dominant-type labeling logic reads graph.getNodeAttribute(nodeId, 'labels'), but nodes added in buildGraphologyGraph use the attribute key type (and name/filePath). As written, labels will be undefined and dominantType will never be selected, so this fallback won’t work. Use the existing type attribute (or ensure labels is actually set on nodes) when counting types.

Suggested change
const labels = graph.getNodeAttribute(nodeId, 'labels');
const nodeType = Array.isArray(labels) ? labels[0] : (typeof labels === 'string' ? labels : '');
// Prefer explicit `type` attribute; fall back to `labels` if present
const typeAttr = graph.getNodeAttribute(nodeId, 'type');
const labelsAttr = typeAttr ? undefined : graph.getNodeAttribute(nodeId, 'labels');
const nodeType = typeof typeAttr === 'string'
? typeAttr
: (Array.isArray(labelsAttr)
? labelsAttr[0]
: (typeof labelsAttr === 'string' ? labelsAttr : ''));

Copilot uses AI. Check for mistakes.
Comment on lines +380 to +384
// Use dominant type + most-connected symbol name
if (dominantType && names.length > 0) {
const sortedNames = names.sort((a, b) => a.length - b.length);
const shortestName = sortedNames[0];
return `${dominantType}_${shortestName}`;

Copilot AI Mar 22, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment says "Use dominant type + most-connected symbol name", but the implementation sorts by string length and picks the shortest name. Please update the comment to match the actual behavior (or adjust the implementation if “most-connected” was intended).

Copilot uses AI. Check for mistakes.
Resolves the critical issue flagged in review: the GENERIC_FOLDERS Set
(28 entries) was being recreated inside memberIds.forEach on every node
iteration. For a cluster with N members this allocates N Set objects
unnecessarily.

Moved to a module-level const so it is created once at load time and
shared across all calls to generateHeuristicLabel.
Spread to a new array before sorting to avoid mutating the original
names array, as noted in review feedback.
@ShunsukeHayashi

Copy link
Copy Markdown
Contributor Author

Both issues from the review have been addressed:

  • GENERIC_FOLDERS module scope (commit 908b264): Moved from inside the memberIds.forEach loop to a module-level constant — resolves the N Set allocations per community.
  • names.sort() mutation (commit b6ff67f): Changed to [...names].sort() to avoid mutating the original array.

Thanks for the thorough review!

@magyargergo

Copy link
Copy Markdown
Collaborator

@claude run a final review if it is delivered what's in the pr summary. Do we clustering properly?

@github-actions

github-actions Bot commented Mar 22, 2026

Copy link
Copy Markdown
Contributor

Claude finished @magyargergo's task in 1m 15s —— View job


Final Review: Cluster Labeling Implementation

  • Gather context
  • Read changed file (community-processor.ts)
  • Verify all 3 PR summary strategies are delivered
  • Check for the labels vs type attribute bug flagged by Copilot
  • Assess overall clustering correctness
  • Post final review

Summary Verification

Strategy Status Notes
1. Expanded generic folder skip list ✅ Delivered 28 entries at module scope (fixed from inner loop in follow-up commits)
2. Grandparent directory fallback (0.5 weight) ✅ Delivered Lines 327–332, blended into same folderCounts map
3. Dominant node type + shortest symbol name Broken — dead code See critical bug below

Critical Bug: Strategy 3 Is Dead Code

The dominant-type fallback at line 370 reads the labels attribute:

// community-processor.ts line 370
const labels = graph.getNodeAttribute(nodeId, 'labels');
const nodeType = Array.isArray(labels) ? labels[0] : (typeof labels === 'string' ? labels : '');

But nodes are added to the graphology graph in buildGraphologyGraph with the attribute key type, not labels:

// community-processor.ts line 213–217
graph.addNode(node.id, {
  name: node.properties.name,
  filePath: node.properties.filePath,
  type: node.label,   // ← key is 'type'
});

graph.getNodeAttribute(nodeId, 'labels') will always return undefined. This means nodeType is always '', typeCounts is always empty, and dominantType is never set. The entire Strategy 3 fallback never fires — the code falls straight through to Cluster_N, the exact outcome this PR was meant to reduce.

The fix is a one-liner:

// Change:
const labels = graph.getNodeAttribute(nodeId, 'labels');
const nodeType = Array.isArray(labels) ? labels[0] : (typeof labels === 'string' ? labels : '');

// To:
const nodeType = graph.getNodeAttribute(nodeId, 'type') ?? '';

Fix this →


Minor Issues

Comment mismatch (line 384): The comment says "Use dominant type + most-connected symbol name", but the code sorts by string length and picks the shortest name, not most-connected. Either update the comment or implement degree-based selection (most-connected by graph edges is arguably better for label quality). Low priority, but misleading.

Cascade is weighted, not ordered: The PR description implies a strict priority order (grandparent only considered if parent fails), but the actual implementation blends parent and grandparent into the same folderCounts map with weights. A grandparent shared by many nodes can outrank a parent. The behavior is reasonable but differs from the documented cascade — worth noting in the PR description.

names.length > 2 threshold: The prefix check requires more than 2 names. This means a 2-member cluster skips the prefix step entirely. Likely intentional (prefix of 2 strings is noisy), but undocumented.


Verdict

Strategies 1 and 2 are correctly implemented. Strategy 3 is broken due to a wrong attribute key ('labels' vs 'type'). The PR will continue producing Cluster_N names for any community that has no clear folder signal or name prefix, because the dominant-type fallback silently no-ops. This needs to be fixed before merge.

@magyargergo

Copy link
Copy Markdown
Collaborator

@ShunsukeHayashi Could you please look at these findings? 🙏

The dominant node type fallback was reading `graph.getNodeAttribute(nodeId, 'labels')`
but nodes are stored with key `type` (set in buildGraphologyGraph as `type: node.label`).
This caused `labels` to always be undefined, `typeCounts` to always be empty, and
Strategy 3 to silently no-op — falling straight through to `Cluster_N`.

Fixes:
- Use `graph.getNodeAttribute(nodeId, 'type')` instead of `labels`
- Fix comment: 'most-connected symbol name' → 'shortest symbol name'
- Use `[...names].sort()` to avoid mutating the original array

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@ShunsukeHayashi

Copy link
Copy Markdown
Contributor Author

Thanks for the review @magyargergo! The findings are correct — Strategy 3 was indeed dead code.

Fixes applied in 6117f09:

  1. Critical bug fixed: Changed graph.getNodeAttribute(nodeId, 'labels')graph.getNodeAttribute(nodeId, 'type') ?? ''

    • Nodes are stored with key type in buildGraphologyGraph (type: node.label), so reading 'labels' always returned undefined
    • Strategy 3 will now correctly fire and produce labels like Function_validateUser
  2. Comment mismatch fixed: // Use dominant type + most-connected symbol name// Use dominant type + shortest symbol name

  3. Non-mutating sort: names.sort(...)[...names].sort(...) (the remote branch already had this from b6ff67f, now confirmed in the rebased commit)

All three strategies in the cascade are now functional.

@magyargergo

Copy link
Copy Markdown
Collaborator

@claude could you propose a good e2e test for this to usecase the strength of this algorithm? Also think about negative cases.

@github-actions

github-actions Bot commented Mar 23, 2026

Copy link
Copy Markdown
Contributor

Claude finished @magyargergo's task in 4m 13s —— View job


E2E Test Proposal for Cluster Labeling

  • Gather context
  • Read implementation and existing tests
  • Propose e2e tests (positive and negative cases)

Proposed test file: test/unit/cluster-labeling.test.ts

The existing cohesion-consistency.test.ts already establishes the right pattern: exercise generateHeuristicLabel indirectly through processCommunities using isolated fully-connected cliques that Leiden will deterministically keep together. The proposed tests follow that pattern.

/**
 * Unit tests for the heuristic cluster labeling strategies in generateHeuristicLabel.
 *
 * Since generateHeuristicLabel is not exported, tests exercise it indirectly through
 * processCommunities. Each test builds an isolated fully-connected clique so that
 * Leiden assigns all nodes to a single community deterministically.
 *
 * Cascade under test:
 *   1. Most-common non-generic parent folder  → e.g. "Auth"
 *   2. Grandparent directory (0.5 weight)     → e.g. "Auth" when parent is generic
 *   3. Common name prefix (> 2 members, > 2 chars) → e.g. "Validate"
 *   4. Dominant type + shortest symbol name   → e.g. "Function_dispatch"
 *   5. Fallback                               → "Cluster_N"
 */
import { describe, it, expect } from 'vitest';
import { createKnowledgeGraph } from '../../src/core/graph/graph.js';
import type { GraphNode, GraphRelationship } from '../../src/core/graph/types.js';
import { processCommunities } from '../../src/core/ingestion/community-processor.js';

// ============================================================================
// HELPERS
// ============================================================================

function makeNode(id: string, name: string, label: GraphNode['label'], filePath: string): GraphNode {
  return { id, label, properties: { name, filePath, startLine: 1, endLine: 10, isExported: false } };
}

function makeRel(id: string, src: string, tgt: string): GraphRelationship {
  return { id, sourceId: src, targetId: tgt, type: 'CALLS', confidence: 1.0, reason: '' };
}

/**
 * Build a fully-connected clique and return the heuristic label of the resulting community.
 * All nodes default to 'Function' unless overridden.
 */
async function getLabelForNodes(
  nodes: Array<{ id: string; name: string; filePath: string; label?: GraphNode['label'] }>,
): Promise<string> {
  const graph = createKnowledgeGraph();
  for (const n of nodes) {
    graph.addNode(makeNode(n.id, n.name, n.label ?? 'Function', n.filePath));
  }
  let relIdx = 0;
  for (let i = 0; i < nodes.length; i++) {
    for (let j = i + 1; j < nodes.length; j++) {
      graph.addRelationship(makeRel(`r${relIdx++}`, nodes[i].id, nodes[j].id));
    }
  }
  const result = await processCommunities(graph);
  // A single isolated clique should produce exactly one community
  expect(result.communities).toHaveLength(1);
  return result.communities[0].heuristicLabel;
}

// ============================================================================
// STRATEGY 1: Parent folder label
// ============================================================================

describe('generateHeuristicLabel — cluster labeling', () => {
  describe('Strategy 1: parent folder label', () => {
    it('uses a non-generic parent folder as the label', async () => {
      const label = await getLabelForNodes([
        { id: 'n1', name: 'login',   filePath: 'src/auth/login.ts' },
        { id: 'n2', name: 'logout',  filePath: 'src/auth/logout.ts' },
        { id: 'n3', name: 'refresh', filePath: 'src/auth/refresh.ts' },
      ]);
      expect(label).toBe('Auth');
    });

    it('capitalizes the first letter of the folder name', async () => {
      const label = await getLabelForNodes([
        { id: 'n1', name: 'charge', filePath: 'src/payment/charge.ts' },
        { id: 'n2', name: 'refund', filePath: 'src/payment/refund.ts' },
        { id: 'n3', name: 'verify', filePath: 'src/payment/verify.ts' },
      ]);
      expect(label).toBe('Payment');
    });

    it('picks the most-common folder when members span multiple non-generic folders', async () => {
      // 3 nodes in "auth", 1 node in "billing" → "Auth" wins
      const label = await getLabelForNodes([
        { id: 'n1', name: 'login',   filePath: 'src/auth/login.ts' },
        { id: 'n2', name: 'logout',  filePath: 'src/auth/logout.ts' },
        { id: 'n3', name: 'refresh', filePath: 'src/auth/refresh.ts' },
        { id: 'n4', name: 'charge',  filePath: 'src/billing/charge.ts' },
      ]);
      expect(label).toBe('Auth');
    });

    // ── NEGATIVE ──────────────────────────────────────────────────────────────
    it('does NOT produce a label from generic parent folders (utils, src, lib, …)', async () => {
      const genericFolders = [
        'src', 'lib', 'core', 'utils', 'common', 'shared', 'helpers',
        'components', 'services', 'controllers', 'models',
      ];
      for (const folder of genericFolders) {
        const label = await getLabelForNodes([
          { id: `${folder}_n1`, name: 'fn1', filePath: `${folder}/action1.ts` },
          { id: `${folder}_n2`, name: 'fn2', filePath: `${folder}/action2.ts` },
          { id: `${folder}_n3`, name: 'fn3', filePath: `${folder}/action3.ts` },
        ]);
        const capitalized = folder.charAt(0).toUpperCase() + folder.slice(1);
        expect(label, `folder "${folder}" should be skipped`).not.toBe(capitalized);
      }
    });
  });

  // ============================================================================
  // STRATEGY 2: Grandparent directory fallback
  // ============================================================================

  describe('Strategy 2: grandparent directory fallback', () => {
    it('falls back to grandparent when immediate parent is generic (e.g. middleware)', async () => {
      // src/auth/middleware/validate.ts → parent="middleware" (generic), grandparent="auth"
      const label = await getLabelForNodes([
        { id: 'n1', name: 'validateToken',    filePath: 'src/auth/middleware/validateToken.ts' },
        { id: 'n2', name: 'sanitizeInput',    filePath: 'src/auth/middleware/sanitizeInput.ts' },
        { id: 'n3', name: 'checkPermissions', filePath: 'src/auth/middleware/checkPermissions.ts' },
      ]);
      expect(label).toBe('Auth');
    });

    it('strong parent folder (count=3) beats a shared grandparent (count=1.0)', async () => {
      // payment parent direct × 3  vs  auth grandparent × 2 nodes (each 0.5 = 1.0)
      const label = await getLabelForNodes([
        { id: 'n1', name: 'charge',  filePath: 'src/payment/charge.ts' },
        { id: 'n2', name: 'refund',  filePath: 'src/payment/refund.ts' },
        { id: 'n3', name: 'verify',  filePath: 'src/payment/verify.ts' },
        { id: 'n4', name: 'login',   filePath: 'src/auth/services/login.ts' },  // grandparent=auth +0.5
        { id: 'n5', name: 'logout',  filePath: 'src/auth/services/logout.ts' }, // grandparent=auth +0.5 = total 1.0
      ]);
      // payment(3.0) > auth(1.0) → "Payment"
      expect(label).toBe('Payment');
    });

    // ── NEGATIVE ──────────────────────────────────────────────────────────────
    it('does NOT use grandparent when it is also generic', async () => {
      // src/lib/utils/validate.ts → parent="utils" (generic), grandparent="lib" (generic)
      const label = await getLabelForNodes([
        { id: 'n1', name: 'validateEmail', filePath: 'src/lib/utils/validateEmail.ts' },
        { id: 'n2', name: 'validatePhone', filePath: 'src/lib/utils/validatePhone.ts' },
        { id: 'n3', name: 'validateDate',  filePath: 'src/lib/utils/validateDate.ts' },
      ]);
      expect(label).not.toBe('Utils');
      expect(label).not.toBe('Lib');
    });
  });

  // ============================================================================
  // STRATEGY 3: Common name prefix
  // ============================================================================

  describe('Strategy 3: common name prefix', () => {
    it('extracts common prefix from 3+ symbol names in a generic folder', async () => {
      const label = await getLabelForNodes([
        { id: 'n1', name: 'validateUser',    filePath: 'src/utils/validateUser.ts' },
        { id: 'n2', name: 'validateOrder',   filePath: 'src/utils/validateOrder.ts' },
        { id: 'n3', name: 'validatePayment', filePath: 'src/utils/validatePayment.ts' },
      ]);
      expect(label).toBe('Validate');
    });

    // ── NEGATIVE ──────────────────────────────────────────────────────────────
    it('does NOT use a prefix of 2 characters or fewer', async () => {
      // Common prefix "fn" — only 2 chars, should be skipped
      const label = await getLabelForNodes([
        { id: 'n1', name: 'fnApply',    filePath: 'src/utils/fnApply.ts' },
        { id: 'n2', name: 'fnCompose',  filePath: 'src/utils/fnCompose.ts' },
        { id: 'n3', name: 'fnPipe',     filePath: 'src/utils/fnPipe.ts' },
      ]);
      expect(label).not.toBe('Fn');
    });

    // ── NEGATIVE ──────────────────────────────────────────────────────────────
    it('skips prefix strategy when the cluster has only 2 members (names.length <= 2 check)', async () => {
      // Exactly 2 nodes — the `names.length > 2` guard blocks prefix extraction
      const label = await getLabelForNodes([
        { id: 'n1', name: 'validateUser',  filePath: 'src/utils/validateUser.ts' },
        { id: 'n2', name: 'validateOrder', filePath: 'src/utils/validateOrder.ts' },
      ]);
      // Falls through to dominant type strategy
      expect(label).toMatch(/^Function_/);
    });
  });

  // ============================================================================
  // STRATEGY 4: Dominant type + shortest symbol name
  // ============================================================================

  describe('Strategy 4: dominant type + shortest symbol name', () => {
    it('produces Function_<shortest> when no folder or prefix signal exists', async () => {
      // All generic folders, names have no common prefix
      const label = await getLabelForNodes([
        { id: 'n1', name: 'processRequest', filePath: 'src/utils/processRequest.ts' },
        { id: 'n2', name: 'handleEvent',    filePath: 'lib/helpers/handleEvent.ts' },
        { id: 'n3', name: 'dispatch',       filePath: 'src/core/dispatch.ts' },
      ]);
      // Sorted by length: dispatch(8) < handleEvent(11) < processRequest(14)
      expect(label).toBe('Function_dispatch');
    });

    it('uses the dominant Class type when the majority of nodes are Classes', async () => {
      const label = await getLabelForNodes([
        { id: 'n1', name: 'UserService', label: 'Class',    filePath: 'src/services/UserService.ts' },
        { id: 'n2', name: 'AuthService', label: 'Class',    filePath: 'src/services/AuthService.ts' },
        { id: 'n3', name: 'configure',   label: 'Function', filePath: 'src/services/configure.ts' },
      ]);
      // 2 Class, 1 Function — dominant=Class, shortest name = "configure" (9 chars)
      expect(label).toBe('Class_configure');
    });

    // ── NEGATIVE ──────────────────────────────────────────────────────────────
    it('picks the shortest name, not the alphabetically first name', async () => {
      const label = await getLabelForNodes([
        { id: 'n1', name: 'aaaVeryLongFunctionName', filePath: 'src/utils/a.ts' },
        { id: 'n2', name: 'zzz',                     filePath: 'lib/helpers/b.ts' },
        { id: 'n3', name: 'mmmmmmm',                 filePath: 'src/core/c.ts' },
      ]);
      // "zzz" is shortest (3 chars) even though it sorts last alphabetically
      expect(label).toBe('Function_zzz');
    });
  });

  // ============================================================================
  // CASCADE PRIORITY
  // ============================================================================

  describe('Cascade priority', () => {
    it('folder label takes precedence over a common name prefix', async () => {
      // Both signals present: nodes in "auth/" AND names share "validate" prefix
      const label = await getLabelForNodes([
        { id: 'n1', name: 'validateToken',   filePath: 'src/auth/validateToken.ts' },
        { id: 'n2', name: 'validateSession', filePath: 'src/auth/validateSession.ts' },
        { id: 'n3', name: 'validateCookie',  filePath: 'src/auth/validateCookie.ts' },
      ]);
      expect(label).toBe('Auth');   // not "Validate"
    });

    it('name prefix takes precedence over dominant-type label', async () => {
      const label = await getLabelForNodes([
        { id: 'n1', name: 'parseJSON', filePath: 'src/utils/parseJSON.ts' },
        { id: 'n2', name: 'parseXML',  filePath: 'src/utils/parseXML.ts' },
        { id: 'n3', name: 'parseCSV',  filePath: 'src/utils/parseCSV.ts' },
      ]);
      // "parse" (5 chars > 2) wins over Function_parseCSV
      expect(label).toBe('Parse');
    });
  });

  // ============================================================================
  // EDGE CASES
  // ============================================================================

  describe('Edge cases', () => {
    it('singleton nodes are excluded from community results', async () => {
      const graph = createKnowledgeGraph();
      graph.addNode(makeNode('lone', 'loneFunction', 'Function', 'src/auth/lone.ts'));
      // No edges — will be a singleton and filtered out
      const result = await processCommunities(graph);
      expect(result.communities).toHaveLength(0);
    });

    it('nodes with empty filePath fall back gracefully to type-based label', async () => {
      const label = await getLabelForNodes([
        { id: 'n1', name: 'anonymousFn', filePath: '' },
        { id: 'n2', name: 'anotherFn',   filePath: '' },
        { id: 'n3', name: 'thirdFn',     filePath: '' },
      ]);
      // No path signal, no common prefix → type-based label
      expect(label).toMatch(/^(Function_|Cluster_)/);
    });

    it('two isolated cliques in one graph get distinct, descriptive labels', async () => {
      const graph = createKnowledgeGraph();

      // Clique 1: auth
      for (let i = 0; i < 3; i++) {
        graph.addNode(makeNode(`a${i}`, `authFn${i}`, 'Function', `src/auth/f${i}.ts`));
      }
      graph.addRelationship(makeRel('r1', 'a0', 'a1'));
      graph.addRelationship(makeRel('r2', 'a1', 'a2'));
      graph.addRelationship(makeRel('r3', 'a0', 'a2'));

      // Clique 2: payment (isolated — no cross edges)
      for (let i = 0; i < 3; i++) {
        graph.addNode(makeNode(`p${i}`, `paymentFn${i}`, 'Function', `src/payment/f${i}.ts`));
      }
      graph.addRelationship(makeRel('r4', 'p0', 'p1'));
      graph.addRelationship(makeRel('r5', 'p1', 'p2'));
      graph.addRelationship(makeRel('r6', 'p0', 'p2'));

      const result = await processCommunities(graph);
      expect(result.communities).toHaveLength(2);

      const labels = result.communities.map(c => c.heuristicLabel);
      expect(labels).toContain('Auth');
      expect(labels).toContain('Payment');
    });
  });
});

Coverage summary

Test Strategy Type
non-generic parent folder used as label 1 ✅ positive
folder name is capitalized 1 ✅ positive
most-common folder wins across mixed paths 1 ✅ positive
generic folders (src, utils, lib, …) are never used 1 ❌ negative
grandparent used when parent is generic 2 ✅ positive
strong parent (×3) beats shared grandparent (×1.0) 2 ✅ positive
generic grandparent also skipped 2 ❌ negative
common prefix extracted from ≥3 names in generic folder 3 ✅ positive
prefix of ≤2 chars skipped 3 ❌ negative
prefix skipped when cluster has exactly 2 members 3 ❌ negative
Function_<shortest> when no folder/prefix signal 4 ✅ positive
Class_<shortest> when majority are Classes 4 ✅ positive
shortest name wins over alphabetically-first name 4 ❌ negative
folder wins over name prefix (cascade order) cascade ✅ positive
name prefix wins over dominant type (cascade order) cascade ✅ positive
singleton nodes excluded from results edge ✅ positive
empty filePath handled gracefully edge ✅ positive
two isolated cliques → two distinct descriptive labels edge ✅ positive

Notes

  1. Cluster_N is now unreachable — all community members have a valid graphology type attribute set by buildGraphologyGraph, so dominantType is always populated. The Cluster_N fallback at line 394 is dead code. It might be worth adding an explicit test that confirms this and/or removing the dead branch.

  2. names.length > 2 threshold is undocumented — the test for exactly-2-member clusters makes this guard explicit and prevents regressions if someone changes it.

  3. Grandparent blending vs strict cascade — the grandparent node can win outright over a parent if many nodes share the same grandparent. The test 'strong parent (×3) beats shared grandparent (×1.0)' documents this weighted behaviour so it doesn't get "fixed" away.

@magyargergo

Copy link
Copy Markdown
Collaborator

@ShunsukeHayashi can you please look into this? 🙏

@magyargergo

Copy link
Copy Markdown
Collaborator

⚠️ Upcoming Prettier formatting — rebase instructions

PR #563 adds Prettier as the code formatter for the repo. When it merges, the bulk format commit will touch ~350 files (style-only: whitespace, quotes, trailing commas). Your branch will likely conflict.

After #563 merges, rebase your branch:

git fetch origin
git checkout <your-branch>
git rebase origin/main

# Conflicts will be formatting-only — accept your version:
git checkout --theirs .
git add .
git rebase --continue

# Then re-format your branch to match the new style:
npx prettier --write .
git add -A
git commit -m "style: apply prettier formatting"
git push --force-with-lease

New setup step: Run npm install at the repo root (not just in gitnexus/) to get prettier + activate the pre-commit hook. The hook auto-formats staged files on every commit going forward.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants