Skip to content

t3227: fix output-format duplication in paddleocr-helper.sh inline Python script#4685

Merged
alex-solovyev merged 1 commit intomainfrom
bugfix/t3227-paddleocr-helper-critical
Mar 14, 2026
Merged

t3227: fix output-format duplication in paddleocr-helper.sh inline Python script#4685
alex-solovyev merged 1 commit intomainfrom
bugfix/t3227-paddleocr-helper-critical

Conversation

@marcusquinn
Copy link
Owner

@marcusquinn marcusquinn commented Mar 14, 2026

Summary

  • Refactors the inline Python OCR script in paddleocr-helper.sh to normalize both API paths into a common entries list before output formatting
  • Eliminates the triplicated json/tsv/plain output-format logic that existed separately in both the new API (.predict()) and legacy API (.ocr()) branches
  • Dismisses three false-positive date findings from the PR t1369.5: Verify PaddleOCR installation on Linux and fix 3.4.0 API breaking changes #2671 review (bot training cutoff predates 2026)

Changes

.agents/scripts/paddleocr-helper.sh — inline Python script refactor:

Before: output-format handling (json/tsv/plain) was duplicated inside both if use_new_api: and else: branches — 6 separate output blocks.

After: both API paths populate a common entries = [] list, then a single set of output-format handlers runs once. Net: -24 lines, zero behaviour change.

Review Findings Addressed

Finding Severity Action
Code duplication between new/legacy API output blocks CRITICAL Fixed — normalized to common entries list
Date 2026-03-01 flagged as future MEDIUM Dismissed — false positive, today is 2026-03-14
Date Jan 2026 flagged as future MEDIUM Dismissed — false positive, PaddleOCR 3.4.0 released Jan 2026

Closes #3227

Summary by CodeRabbit

  • Refactor
    • Enhanced OCR result processing consistency across different API versions.
    • Streamlined result formatting for improved reliability in confidence scoring and coordinate handling.
    • Maintained backward compatibility with legacy systems while optimizing normalization logic.

…ython script

Normalize both new API (.predict()) and legacy API (.ocr()) paths into a
common entries list before handling json/tsv/plain output. This removes the
triplicated output-format logic that existed in both branches, addressing
the code duplication finding from PR #2671 review.

False-positive date findings (2026-03-01, Jan 2026) dismissed with
explanatory comments — bot training cutoff predates 2026.

Closes #3227
@github-actions github-actions bot added the bug Auto-created from TODO.md tag label Mar 14, 2026
@gemini-code-assist
Copy link

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly improves the maintainability and readability of the paddleocr-helper.sh script by refactoring its inline Python component. The core change involves centralizing the output formatting logic, which was previously duplicated across different API branches. This consolidation reduces code redundancy and simplifies future modifications, while also addressing specific review feedback regarding false-positive date detections.

Highlights

  • Code Refactoring: Refactored the inline Python OCR script within paddleocr-helper.sh to normalize results from both new and legacy API paths into a common entries list before output formatting.
  • Duplication Elimination: Eliminated duplicated JSON, TSV, and plain text output-format logic that previously existed separately in both the new API (.predict()) and legacy API (.ocr()) branches.
  • Review Findings Addressed: Dismissed three false-positive date findings that were flagged during a previous review, clarifying that bot training cutoff predates 2026.
Changelog
  • .agents/scripts/paddleocr-helper.sh
    • Fixed critical code duplication between new/legacy API output blocks by normalizing to a common entries list.
    • Dismissed false-positive date finding '2026-03-01' as the bot training cutoff predates 2026.
    • Dismissed false-positive date finding 'Jan 2026' as PaddleOCR 3.4.0 was released in Jan 2026.
Activity
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 14, 2026

Walkthrough

This refactor consolidates duplicated OCR result processing logic from two API response paths (new and legacy formats) into a unified entries structure, eliminating parallel formatting branches and centralizing output rendering.

Changes

Cohort / File(s) Summary
OCR Result Processing Unification
.agents/scripts/paddleocr-helper.sh
Normalizes both new API (rec_texts/rec_scores/rec_polys) and legacy API response formats into a common entries list containing text, confidence, and box fields. Replaces duplicated JSON/TSV/plain-text formatting logic with a single iteration over unified entries structure. Net reduction of 24 lines eliminates prior branching divergence.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Suggested labels

refactor

Poem

📋 Two APIs, once divided—now as one,
Duplicate paths converge before output is spun,
Text and confidence dance in entries so clean,
Where boxes align and the logic convenes. ✨

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: fixing output-format duplication in the paddleocr-helper script by normalizing API results.
Linked Issues check ✅ Passed The PR successfully addresses all objectives from issue #3227: eliminates code duplication by normalizing API results into common entries list, and dismisses false-positive future-date findings with explanations.
Out of Scope Changes check ✅ Passed All changes are directly scoped to issue #3227 requirements: refactoring inline Python OCR result processing and addressing code review feedback; no extraneous modifications detected.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch bugfix/t3227-paddleocr-helper-critical
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Contributor

🔍 Code Quality Report

�[0;35m[MONITOR]�[0m Code Review Monitoring Report

�[0;34m[INFO]�[0m Latest Quality Status:
SonarCloud: 0 bugs, 0 vulnerabilities, 347 code smells

�[0;34m[INFO]�[0m Recent monitoring activity:
Sat Mar 14 07:09:08 UTC 2026: Code review monitoring started
Sat Mar 14 07:09:08 UTC 2026: SonarCloud - Bugs: 0, Vulnerabilities: 0, Code Smells: 347

📈 Current Quality Metrics

  • BUGS: 0
  • CODE SMELLS: 347
  • VULNERABILITIES: 0

Generated on: Sat Mar 14 07:09:11 UTC 2026


Generated by AI DevOps Framework Code Review Monitoring

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively refactors the inline Python script in paddleocr-helper.sh to remove significant code duplication. By normalizing the data from both new and legacy API paths into a common list before formatting the output, the code is now much more maintainable. The refactoring is well-executed. I have one security-related suggestion to further improve the robustness and integrity of the TSV output by sanitizing the text field.

elif output_format == "tsv":
print("text\tconfidence\tx1\ty1\tx2\ty2\tx3\ty3\tx4\ty4")
for entry in entries:
text = entry.get("text", "")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-medium medium

To ensure the integrity of the TSV output, it's important to sanitize the text field by removing any tab or newline characters. These characters can break the column structure of the TSV format, leading to parsing errors downstream.

Suggested change
text = entry.get("text", "")
text = entry.get("text", "").replace("\t", " ").replace("\n", " ")
References
  1. Employ a defense-in-depth strategy for handling data. Sanitize data at each point of use, especially before outputting it in a specific format like TSV, to prevent parsing errors and maintain integrity.

@sonarqubecloud
Copy link

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
.agents/scripts/paddleocr-helper.sh (1)

515-523: Stabilize TSV column count for non-quad polygons.

The header is fixed to 4 points, but coordinate emission currently depends on the runtime polygon length. Consider normalizing to exactly 4 points (pad/truncate) before printing to keep downstream TSV parsing deterministic.

Proposed hardening
 elif output_format == "tsv":
     print("text\tconfidence\tx1\ty1\tx2\ty2\tx3\ty3\tx4\ty4")
     for entry in entries:
         text = entry.get("text", "")
         confidence = entry.get("confidence", 0.0)
         box = entry.get("box")
         if box:
-            coords = "\t".join(f"{p[0]:.0f}\t{p[1]:.0f}" for p in box)
+            pts = list(box)[:4]
+            while len(pts) < 4:
+                pts.append((0, 0))
+            coords = "\t".join(f"{p[0]:.0f}\t{p[1]:.0f}" for p in pts)
         else:
             coords = "\t".join(["0"] * 8)
         print(f"{text}\t{confidence:.4f}\t{coords}")

As per coding guidelines, "Automation scripts - focus on: Reliability and robustness".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.agents/scripts/paddleocr-helper.sh around lines 515 - 523, The TSV output
can emit a variable number of coordinate columns because box may have fewer or
more than 4 points; normalize the polygon to exactly 4 points before building
coords in the loop over entries by padding with (0,0) points or truncating extra
points as needed so coords is always 8 values; update the code around the
entries loop (variables: entries, entry, box, coords) to create a normalized_box
of length 4 and then join its coordinates to produce a stable header-aligned TSV
row.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In @.agents/scripts/paddleocr-helper.sh:
- Around line 515-523: The TSV output can emit a variable number of coordinate
columns because box may have fewer or more than 4 points; normalize the polygon
to exactly 4 points before building coords in the loop over entries by padding
with (0,0) points or truncating extra points as needed so coords is always 8
values; update the code around the entries loop (variables: entries, entry, box,
coords) to create a normalized_box of length 4 and then join its coordinates to
produce a stable header-aligned TSV row.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 50189040-41c8-42bc-9212-1b9f8fd1244b

📥 Commits

Reviewing files that changed from the base of the PR and between 88a1ad6 and 69ed924.

📒 Files selected for processing (1)
  • .agents/scripts/paddleocr-helper.sh

@alex-solovyev alex-solovyev merged commit db3dcab into main Mar 14, 2026
29 checks passed
@alex-solovyev alex-solovyev deleted the bugfix/t3227-paddleocr-helper-critical branch March 14, 2026 07:16
alex-solovyev pushed a commit that referenced this pull request Mar 14, 2026
…ython script (#4685)

Normalize both new API (.predict()) and legacy API (.ocr()) paths into a
common entries list before handling json/tsv/plain output. This removes the
triplicated output-format logic that existed in both branches, addressing
the code duplication finding from PR #2671 review.

False-positive date findings (2026-03-01, Jan 2026) dismissed with
explanatory comments — bot training cutoff predates 2026.

Closes #3227
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Auto-created from TODO.md tag

Projects

None yet

Development

Successfully merging this pull request may close these issues.

quality-debt: .agents/scripts/paddleocr-helper.sh — PR #2671 review feedback (critical)

2 participants