Skip to content

fix(file): remove jschardet confidence check for encoding detection#1007

Merged
yamadashy merged 4 commits intomainfrom
fix/remove-confidence-check
Dec 14, 2025
Merged

fix(file): remove jschardet confidence check for encoding detection#1007
yamadashy merged 4 commits intomainfrom
fix/remove-confidence-check

Conversation

@yamadashy
Copy link
Owner

@yamadashy yamadashy commented Dec 14, 2025

Summary

Remove the confidence < 0.2 check that was causing valid UTF-8/ASCII files to be incorrectly skipped.

Changes

  • Remove jschardet confidence check from encoding error detection
  • Only skip files with actual decode errors (U+FFFD replacement characters)
  • Use TextDecoder('utf-8', { fatal: true }) to distinguish real decode errors from legitimate U+FFFD in UTF-8 files
  • Add comprehensive tests for encoding detection scenarios
  • Use os.tmpdir() for test temp directory instead of tests/fixtures
  • Downgrade isbinaryfile from v6.0.0 to v5.0.2 for Node.js 20+ compatibility (v6 requires Node.js >= 24)

Issues Fixed

Why This is Safe

The isbinaryfile library (added in PR #1006) now handles binary detection more accurately with UTF-16/CJK support, making the confidence-based heuristic unnecessary.

Checklist

  • Run npm run test
  • Run npm run lint

Fixes #869

Remove the confidence < 0.2 check that was causing valid UTF-8/ASCII files
to be incorrectly skipped. Files are now only skipped if they contain actual
decode errors (U+FFFD replacement characters).

This fixes issues where:
- Valid Python files were skipped with confidence=0.00 (#869)
- HTML files with Thymeleaf syntax (~{}) were incorrectly detected as binary (#847)

The isbinaryfile library (added in PR #1006) now handles binary detection more
accurately, making the confidence-based heuristic unnecessary.

Fixes #869
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 14, 2025

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Walkthrough

Changes simplify encoding detection logic by removing jschardet confidence threshold dependency. The code now skips files only when actual decode errors (U+FFFD replacement characters) are present, not based on confidence values. Comprehensive test suite added covering normal files, low-confidence UTF-8 content, empty files, decode errors, size limits, and binary file detection.

Changes

Cohort / File(s) Summary
Encoding Detection Simplification
src/core/file/fileRead.ts
Removed jschardet confidence value from encoding detection logic. Skip condition for decoding errors now depends solely on presence of U+FFFD replacement characters, eliminating confidence-based gating. Updated related log messaging.
File Reading Test Suite
tests/core/file/fileRead.test.ts
Added comprehensive test suite for readRawFile covering normal text files, low-confidence UTF-8 detection, HTML with Thymeleaf syntax, empty files, decode errors, size limits, and binary file detection by extension and content.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

  • Logic change validation: Verify that removing the confidence threshold doesn't negatively impact file filtering or create false negatives for actual encoding issues
  • Test coverage appropriateness: Ensure test cases adequately validate the new behavior, particularly around the removal of confidence-based skip conditions and the reliance on U+FFFD detection for encoding errors
  • Edge case handling: Review interaction between the simplified skip logic and the various skip reasons (binary-content, binary-extension, size-limit, encoding-error)

Possibly related issues

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main change: removing the jschardet confidence check for encoding detection, which is the primary objective of this pull request.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Description check ✅ Passed The pull request description is comprehensive and well-structured, covering summary, changes, issues fixed, and rationale. Both required checklist items are marked as completed.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @yamadashy, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request improves the accuracy of file encoding detection by no longer relying on jschardet's confidence score to determine if a file should be skipped. Instead, it focuses solely on the presence of actual decoding errors (U+FFFD characters) to identify problematic files. This change resolves issues where valid Python and HTML files were erroneously marked as having encoding problems, leveraging the isbinaryfile library for more robust binary detection.

Highlights

  • Encoding Detection Logic Refinement: The pull request removes the confidence < 0.2 check from the jschardet encoding detection logic, which was previously causing valid files to be incorrectly skipped.
  • Error-Based Skipping: Files are now only skipped if they contain actual decode errors, indicated by the presence of U+FFFD replacement characters, rather than relying on jschardet's confidence score.
  • Comprehensive Testing: New, comprehensive tests have been added to cover various encoding detection scenarios, including cases where low jschardet confidence should not lead to skipping, and cases with genuine decode errors.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively resolves issues with file encoding detection by removing the unreliable jschardet confidence score check. The new logic, which only skips files containing actual decoding errors (U+FFFD), is more robust and prevents valid files from being incorrectly ignored. The changes in src/core/file/fileRead.ts are clear and directly address the problem. Furthermore, the addition of a comprehensive test suite in tests/core/file/fileRead.test.ts is a significant improvement, with excellent test cases that cover the specific bugs being fixed, as well as other edge cases. This ensures the change is safe and effective. Overall, this is a high-quality contribution that improves the reliability of file processing.

@codecov
Copy link

codecov bot commented Dec 14, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 89.90%. Comparing base (a022d89) to head (47398ae).
⚠️ Report is 5 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1007      +/-   ##
==========================================
+ Coverage   89.83%   89.90%   +0.07%     
==========================================
  Files         120      120              
  Lines        9235     9241       +6     
  Branches     1683     1685       +2     
==========================================
+ Hits         8296     8308      +12     
+ Misses        939      933       -6     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@cloudflare-workers-and-pages
Copy link

cloudflare-workers-and-pages bot commented Dec 14, 2025

Deploying repomix with  Cloudflare Pages  Cloudflare Pages

Latest commit: 47398ae
Status: ✅  Deploy successful!
Preview URL: https://d26ba589.repomix.pages.dev
Branch Preview URL: https://fix-remove-confidence-check.repomix.pages.dev

View logs

@claude
Copy link
Contributor

claude bot commented Dec 14, 2025

Code Review Summary

Overall Assessment: Approve

This PR correctly addresses issues #869 and #847 by removing the unreliable jschardet confidence-based heuristic for encoding detection. The change is well-justified and properly tested.

What's Good

  • Correct root cause fix: The confidence < 0.2 check was causing false positives for valid UTF-8/ASCII files. Removing it and relying solely on U+FFFD detection is the right approach.
  • Good test coverage: Comprehensive tests covering the exact regression scenarios (Python files with low confidence, HTML with Thymeleaf syntax, empty files)
  • Clean code: The implementation is simple and easy to understand
  • Good timing: PR feat(core): Replace istextorbinary with is-binary-path and isbinaryfile #1006 (merged) added isbinaryfile with better binary detection, making this confidence check redundant

Minor Observations

Details
  1. Test isolation: Tests create fixtures in tests/fixtures/fileRead/ and clean up afterwards - good practice.

  2. Edge case handling: The test for U+FFFD detection (should skip file with actual decode errors) is well-crafted, using a UTF-8 BOM to force UTF-8 decoding with invalid continuation bytes.

  3. Empty file handling: Good catch on testing empty files since jschardet can return 0 confidence for them.

Premortem Analysis

Potential failure scenarios and mitigations
Scenario Risk Level Mitigation
Files with legitimate U+FFFD characters Low Unlikely in source code; if encountered, user can exclude file via config
Mixed-encoding files (valid but containing U+FFFD after decode) Low These are genuinely problematic files that should be skipped
UTF-16/32 files without BOM Low isbinaryfile from PR #1006 handles these; iconv-lite fallback to UTF-8
Performance regression from string search Very Low includes('\uFFFD') is O(n) but runs only once per file, negligible impact

Verification

The logic change is minimal and safe:

  • Removed: confidence < 0.2 check
  • Kept: content.includes('\uFFFD') check (actual decode errors)
  • Updated: Log message to not include confidence value

Recommendation: Merge after CI passes.

🤖 Generated with Claude Code

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a022d89 and 72b27e4.

📒 Files selected for processing (2)
  • src/core/file/fileRead.ts (1 hunks)
  • tests/core/file/fileRead.test.ts (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
tests/core/file/fileRead.test.ts (1)
src/core/file/fileRead.ts (1)
  • readRawFile (21-63)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (8)
  • GitHub Check: Build and run (windows-latest, 20.x)
  • GitHub Check: Build and run (windows-latest, 22.x)
  • GitHub Check: Build and run (windows-latest, 24.x)
  • GitHub Check: Test (ubuntu-latest, 24.x)
  • GitHub Check: Test (windows-latest, 24.x)
  • GitHub Check: Test (windows-latest, 22.x)
  • GitHub Check: Build and run with Bun (windows-latest, latest)
  • GitHub Check: claude-review
🔇 Additional comments (2)
tests/core/file/fileRead.test.ts (2)

21-128: Nice coverage across size-limit, binary-extension/content, and “low confidence but valid UTF-8” scenarios.


75-90: The test is stable and does not have flakiness concerns. jschardet definitively treats UTF-8 BOMs as detection signals and returns UTF-8 encoding when present, not as a probabilistic heuristic. Combined with iconv-lite's guaranteed behavior of replacing invalid UTF-8 sequences with U+FFFD, the test deterministically produces the expected encoding-error skip reason.

- Use TextDecoder('utf-8', { fatal: true }) to distinguish actual decode
  errors from legitimate U+FFFD characters in UTF-8 files
- Change test temp directory from tests/fixtures to os.tmpdir() to avoid
  clobbering committed fixtures and reduce parallel-run collisions
- Non-UTF-8 files still use iconv.decode() fallback behavior

Addresses CodeRabbit review comments on PR #1007
@claude
Copy link
Contributor

claude bot commented Dec 14, 2025

Additional Code Review Observations

After reviewing the existing comments and the code changes, here are some incremental observations:

Code Quality Issue - Potential Bug

The new UTF-8 re-decoding logic at lines 55-62 has a subtle issue:

if (encoding.toLowerCase() === 'utf-8') {
  try {
    let utf8 = new TextDecoder('utf-8', { fatal: true }).decode(buffer);
    if (utf8.charCodeAt(0) === 0xfeff) utf8 = utf8.slice(1); // strip UTF-8 BOM
    return { content: utf8 };
  } catch {
    // fall through to skip below
  }
}

Issue: This re-decoding duplicates work already done by iconv.decode() on line 49. The logic path is:

  1. iconv.decode(buffer, encoding, { stripBOM: true }) - decodes and strips BOM
  2. If U+FFFD found AND encoding is UTF-8, re-decode with TextDecoder

This creates an inconsistency: for UTF-8 files with legitimate U+FFFD characters in the source (rare but valid), the function returns different content depending on whether TextDecoder({ fatal: true }) throws.

Suggestion: Consider simplifying to only use TextDecoder({ fatal: true }) for UTF-8 from the start, or document this edge case behavior.

Test Coverage Gap

Details

Missing test case: File with legitimate U+FFFD character in source content

The tests cover:

  • Normal files ✓
  • Low confidence UTF-8 ✓
  • Files with actual decode errors ✓

But doesn't test: A valid UTF-8 file that intentionally contains U+FFFD (e.g., a file documenting the replacement character itself). With the current logic, such files would be processed via the TextDecoder fallback path and succeed, but this behavior isn't explicitly tested.

Overall Assessment

The core fix (removing confidence check) is correct and well-justified. The issues noted above are minor edge cases that won't affect typical usage. The PR addresses the reported issues #869 and #847 effectively.

Recommendation: Approve with the suggestion to consider the UTF-8 re-decoding path simplification in a future PR.


🤖 Generated with Claude Code

isbinaryfile v6.0.0 requires Node.js >= 24.0.0, but repomix supports
Node.js >= 20.0.0. Downgrade to v5.0.2 (requires Node.js >= 18.0.0)
to maintain compatibility with current LTS versions.

Addresses gemini-code-assist review on PR #1006
@claude
Copy link
Contributor

claude bot commented Dec 14, 2025

Code Review - PR #1007

Overall Assessment: Approve

This PR correctly addresses issues #869 and #847 by removing the problematic jschardet confidence-based heuristic. I've reviewed the existing comments and have a few incremental observations:

What's Good

  • Root cause fix: The confidence < 0.2 threshold was causing false positives for valid ASCII/UTF-8 files. The new approach correctly relies only on actual decode errors (U+FFFD).
  • Smart UTF-8 handling: The TextDecoder({ fatal: true }) fallback at lines 55-62 distinguishes real decoding errors from legitimate U+FFFD characters in source files.
  • Good test coverage: Tests cover the specific regressions (Python files, Thymeleaf HTML) plus edge cases (empty files, binary detection).
  • Clean temp directory usage: Using os.tmpdir() instead of tests/fixtures is the right approach.

isbinaryfile Downgrade

The PR mentions downgrading isbinaryfile from v6.0.0 to v5.0.2 for Node.js compatibility (v6 requires Node.js >= 24). This is a reasonable decision given the project supports Node.js 20+.

Potential considerations
  • v5 vs v6 differences: Confirm no binary detection regressions in v5. Based on the changelog, v6 mainly adds Promise-based API changes and the Node.js version bump.
  • Future upgrade path: Consider adding a comment or tracking issue for upgrading to v6 when Node.js 24 becomes the minimum supported version.

Premortem Analysis

Scenario Risk Mitigation
Files with legitimate U+FFFD characters Low TextDecoder({ fatal: true }) fallback handles this correctly
Non-UTF-8 files containing U+FFFD after decode Medium These are skipped, which is correct behavior for corrupted/misdetected files
Performance from string search Very Low includes('\uFFFD') is O(n) but negligible for typical source files

Codecov Note

The 1 uncovered line is the catch block at line 60-61 which handles TextDecoder throwing on invalid UTF-8. This is difficult to cover without crafting a file that:

  1. Contains U+FFFD after iconv.decode()
  2. Is detected as UTF-8 by jschardet
  3. Has invalid bytes that cause TextDecoder({ fatal: true }) to throw

The test at line 72-87 does cover this path successfully.

Recommendation: Ready to merge after CI passes.


🤖 Generated with Claude Code

Verify that files containing intentional U+FFFD characters in the source
are correctly read (not skipped), testing the TextDecoder validation path.
@claude
Copy link
Contributor

claude bot commented Dec 14, 2025

Code Review - PR #1007

Overall Assessment: ✅ Approve

This PR correctly addresses issues #869 and #847 by removing the problematic jschardet confidence-based heuristic. The implementation is clean and well-tested.

Summary

Aspect Assessment
Code Quality Good - clean, focused changes
Test Coverage Excellent - covers regression cases and edge cases
Bug Fix Correct - addresses root cause
Risk Level Low

What's Good

  • Root cause fix: The confidence < 0.2 threshold was causing false positives for valid ASCII/UTF-8 files (e.g., Python files, Thymeleaf HTML). Removing this heuristic is the right approach.
  • Smart UTF-8 handling: The TextDecoder({ fatal: true }) fallback (lines 55-62) correctly distinguishes real decoding errors from legitimate U+FFFD characters in source files.
  • Comprehensive tests: Tests cover the exact regression scenarios (Python files with low confidence, Thymeleaf HTML, empty files), plus a new test for files containing legitimate U+FFFD characters.
  • Clean temp directory usage: Using os.tmpdir() instead of tests/fixtures is the proper approach for test isolation.

isbinaryfile Downgrade

The downgrade from v6.0.0 to v5.0.2 is reasonable since v6 requires Node.js >= 24 while the project supports Node.js 20+.

Considerations
  • Based on the changelog, v6 primarily adds Promise-based API changes and the Node.js version requirement bump
  • Consider adding a tracking comment or issue for upgrading to v6 when Node.js 24 becomes the minimum supported version

Premortem Analysis

Potential failure scenarios
Scenario Risk Mitigation
Files with legitimate U+FFFD characters Low TextDecoder({ fatal: true }) fallback handles this correctly - tested at line 72-84
Non-UTF-8 files containing U+FFFD after decode Low These are skipped, which is correct behavior for corrupted/misdetected encodings
Performance from includes('\uFFFD') Very Low O(n) string search but negligible for typical source files
Empty files None Explicitly tested and passes (jschardet may return 0 confidence for empty files)

Codecov Note

The 1 uncovered line is the catch block at line 60-61 which handles TextDecoder throwing on invalid UTF-8. This is covered by the test at line 86-101 (should skip file with actual decode errors), so actual branch coverage is good.

Previous Review Status

I note that CodeRabbit initially requested changes but has since approved the PR. The existing review comments from other bots (Gemini, Claude) have been addressed.

Recommendation: Ready to merge after CI passes.


🤖 Generated with Claude Code

@yamadashy yamadashy merged commit bb7649d into main Dec 14, 2025
54 checks passed
@yamadashy yamadashy deleted the fix/remove-confidence-check branch December 14, 2025 10:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Chardet returns 0% confidence for valid UTF-8/ASCII files causing them to be skipped

1 participant