t3227: fix output-format duplication in paddleocr-helper.sh inline Python script#4685
Conversation
…ython script Normalize both new API (.predict()) and legacy API (.ocr()) paths into a common entries list before handling json/tsv/plain output. This removes the triplicated output-format logic that existed in both branches, addressing the code duplication finding from PR #2671 review. False-positive date findings (2026-03-01, Jan 2026) dismissed with explanatory comments — bot training cutoff predates 2026. Closes #3227
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly improves the maintainability and readability of the Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
WalkthroughThis refactor consolidates duplicated OCR result processing logic from two API response paths (new and legacy formats) into a unified entries structure, eliminating parallel formatting branches and centralizing output rendering. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes Suggested labels
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
📝 Coding Plan
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
🔍 Code Quality Report�[0;35m[MONITOR]�[0m Code Review Monitoring Report �[0;34m[INFO]�[0m Latest Quality Status: �[0;34m[INFO]�[0m Recent monitoring activity: 📈 Current Quality Metrics
Generated on: Sat Mar 14 07:09:11 UTC 2026 Generated by AI DevOps Framework Code Review Monitoring |
There was a problem hiding this comment.
Code Review
This pull request effectively refactors the inline Python script in paddleocr-helper.sh to remove significant code duplication. By normalizing the data from both new and legacy API paths into a common list before formatting the output, the code is now much more maintainable. The refactoring is well-executed. I have one security-related suggestion to further improve the robustness and integrity of the TSV output by sanitizing the text field.
| elif output_format == "tsv": | ||
| print("text\tconfidence\tx1\ty1\tx2\ty2\tx3\ty3\tx4\ty4") | ||
| for entry in entries: | ||
| text = entry.get("text", "") |
There was a problem hiding this comment.
To ensure the integrity of the TSV output, it's important to sanitize the text field by removing any tab or newline characters. These characters can break the column structure of the TSV format, leading to parsing errors downstream.
| text = entry.get("text", "") | |
| text = entry.get("text", "").replace("\t", " ").replace("\n", " ") |
References
- Employ a defense-in-depth strategy for handling data. Sanitize data at each point of use, especially before outputting it in a specific format like TSV, to prevent parsing errors and maintain integrity.
|
There was a problem hiding this comment.
🧹 Nitpick comments (1)
.agents/scripts/paddleocr-helper.sh (1)
515-523: Stabilize TSV column count for non-quad polygons.The header is fixed to 4 points, but coordinate emission currently depends on the runtime polygon length. Consider normalizing to exactly 4 points (pad/truncate) before printing to keep downstream TSV parsing deterministic.
Proposed hardening
elif output_format == "tsv": print("text\tconfidence\tx1\ty1\tx2\ty2\tx3\ty3\tx4\ty4") for entry in entries: text = entry.get("text", "") confidence = entry.get("confidence", 0.0) box = entry.get("box") if box: - coords = "\t".join(f"{p[0]:.0f}\t{p[1]:.0f}" for p in box) + pts = list(box)[:4] + while len(pts) < 4: + pts.append((0, 0)) + coords = "\t".join(f"{p[0]:.0f}\t{p[1]:.0f}" for p in pts) else: coords = "\t".join(["0"] * 8) print(f"{text}\t{confidence:.4f}\t{coords}")As per coding guidelines, "Automation scripts - focus on: Reliability and robustness".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.agents/scripts/paddleocr-helper.sh around lines 515 - 523, The TSV output can emit a variable number of coordinate columns because box may have fewer or more than 4 points; normalize the polygon to exactly 4 points before building coords in the loop over entries by padding with (0,0) points or truncating extra points as needed so coords is always 8 values; update the code around the entries loop (variables: entries, entry, box, coords) to create a normalized_box of length 4 and then join its coordinates to produce a stable header-aligned TSV row.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In @.agents/scripts/paddleocr-helper.sh:
- Around line 515-523: The TSV output can emit a variable number of coordinate
columns because box may have fewer or more than 4 points; normalize the polygon
to exactly 4 points before building coords in the loop over entries by padding
with (0,0) points or truncating extra points as needed so coords is always 8
values; update the code around the entries loop (variables: entries, entry, box,
coords) to create a normalized_box of length 4 and then join its coordinates to
produce a stable header-aligned TSV row.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 50189040-41c8-42bc-9212-1b9f8fd1244b
📒 Files selected for processing (1)
.agents/scripts/paddleocr-helper.sh
…ython script (#4685) Normalize both new API (.predict()) and legacy API (.ocr()) paths into a common entries list before handling json/tsv/plain output. This removes the triplicated output-format logic that existed in both branches, addressing the code duplication finding from PR #2671 review. False-positive date findings (2026-03-01, Jan 2026) dismissed with explanatory comments — bot training cutoff predates 2026. Closes #3227



Summary
paddleocr-helper.shto normalize both API paths into a commonentrieslist before output formatting.predict()) and legacy API (.ocr()) branchesChanges
.agents/scripts/paddleocr-helper.sh— inline Python script refactor:Before: output-format handling (json/tsv/plain) was duplicated inside both
if use_new_api:andelse:branches — 6 separate output blocks.After: both API paths populate a common
entries = []list, then a single set of output-format handlers runs once. Net: -24 lines, zero behaviour change.Review Findings Addressed
2026-03-01flagged as futureJan 2026flagged as futureCloses #3227
Summary by CodeRabbit