feat: add document extraction subagent, workflow, and CLI helper (t073) by marcusquinn · Pull Request #667 · marcusquinn/aidevops

marcusquinn · 2026-02-08T21:33:35Z

Summary

Add document-extraction-helper.sh CLI wrapper for the Docling+ExtractThinker+Presidio document extraction pipeline
Add extraction-workflow.md subagent with tool selection decision tree and pipeline orchestration guide
Update document-extraction.md with quick start, helper script references, and workflow link
Update subagent-index.toon to register new workflow subagent and helper script

Details

Helper Script (`document-extraction-helper.sh`)

Full CLI for document extraction with commands:

extract <file> - Structured extraction with schema (invoice, receipt, contract, id-document)
batch <dir> - Batch process all documents in a directory
pii-scan <file> - Scan text for PII entities (Presidio)
pii-redact <file> - Redact PII from text files
convert <file> - Convert documents to markdown/JSON (Docling, no LLM needed)
install - Install dependencies into isolated venv
status - Check installed components
schemas - List available extraction schemas

Privacy modes: local (Ollama), edge (Cloudflare), cloud (OpenAI/Anthropic), none (auto)

Workflow Subagent (`extraction-workflow.md`)

Decision tree for selecting the right extraction tool:

Docling+ExtractThinker+Presidio: Custom pipelines with PII redaction
DocStrange: Quick extraction, good OCR, simpler setup
Unstract: Enterprise ETL with visual schema builder
MinerU: PDF to markdown (layout-aware)
Pandoc: Simple format conversion
GLM-OCR: Local OCR only

Quality

ShellCheck: zero violations
Markdownlint: zero violations
Follows existing script patterns (shared-constants.sh, local var="$1", explicit returns)

Closes #504

Summary by CodeRabbit

New Features
- Introduced document extraction helper tool with commands for document conversion, structured data extraction, batch processing, and PII scanning/redaction.
- Support for multiple output formats (JSON, Markdown, CSV, Text) and privacy modes.
Documentation
- Added comprehensive extraction workflow guide and updated installation instructions.

- Add document-extraction-helper.sh: CLI wrapper for Docling+ExtractThinker+Presidio pipeline with extract, batch, pii-scan, pii-redact, convert, install, status commands - Add extraction-workflow.md: tool selection decision tree and pipeline orchestration guide - Update document-extraction.md: add quick start, helper script refs, workflow link - Update subagent-index.toon: register new workflow subagent and helper script Chose Docling+ExtractThinker+Presidio as primary stack — matches existing PRD and provides PII redaction that alternatives (DocStrange, Unstract) lack. Helper script uses isolated venv at ~/.aidevops/.agent-workspace/python-env/document-extraction/ to avoid conflicts.

gemini-code-assist · 2026-02-08T21:33:38Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

coderabbitai · 2026-02-08T21:34:01Z

Walkthrough

This PR introduces a new Bash orchestration script for privacy-preserving document extraction workflows. The script exposes commands for parsing, PII detection/redaction, structured extraction, and batch processing across multiple LLM backends. Supporting documentation and subagent index entries describe the workflow architecture and integration points.

Changes

Cohort / File(s)	Summary
Document Extraction Helper Script `.agents/scripts/document-extraction-helper.sh`	New 779-line Bash script providing CLI orchestration for document extraction workflows. Implements workspace/venv setup, dependency installation (core/PII/LLM), status reporting, document conversion via Docling, PII scanning/redaction via Presidio, structured extraction via ExtractThinker with schema support, batch processing, and help utilities. Supports local, edge, cloud, and headless privacy modes with multiple output formats.
Subagent Registry & Helper Index `.agents/subagent-index.toon`	Updated TOON scripts index from [45] to [46]; added 16 new script/helper entries including document-extraction-helper.sh with PII redaction capabilities and related automation helpers (privacy-filter, self-improve, agent-test, cron, runner, matrix-dispatch, supervisor, objective-runner, schema-validator, speech-to-speech, voice, seo-content-analyzer). Updated document tool description to include workflow orchestration.
Extraction Workflow Documentation `.agents/tools/document/extraction-workflow.md`	New 224-line comprehensive guide documenting the document extraction pipeline. Details workflow steps (parsing, PII scanning, anonymization, extraction, output), pipeline architecture, tool comparison matrix, custom schema usage with Python examples, troubleshooting guidance for common failures (Docling parsing, Ollama, PII scanning, memory errors), and subagent command specifications with privacy mode options.
Document Extraction Quick Reference `.agents/tools/document/document-extraction.md`	Enhanced user-facing documentation with expanded Quick Start guide (install, extract, pii-scan, status), detailed installation sections distinguishing helper script (recommended) vs. manual setup, venv path documentation, and updated Related links to extraction workflow and helper script.

Sequence Diagrams

sequenceDiagram
    actor User as User/CLI
    participant Helper as document-extraction-helper.sh
    participant Env as Python Environment
    participant Docling as Docling (Parser)
    participant Presidio as Presidio (PII)
    participant ExtractThinker as ExtractThinker
    participant LLM as LLM Backend<br/>(Ollama/Cloud)

    User->>Helper: extract <file> <schema>
    Helper->>Env: activate venv
    Env-->>Helper: ready
    Helper->>Docling: parse document
    Docling-->>Helper: structured markdown/json
    
    alt Privacy Mode: Local/Edge
        Helper->>Presidio: scan PII (local)
        Presidio-->>Helper: entities detected
        Helper->>Presidio: anonymize
        Presidio-->>Helper: sanitized content
    else Privacy Mode: None
        Helper->>Helper: skip PII operations
    end
    
    Helper->>ExtractThinker: extract(schema, content)
    ExtractThinker->>LLM: invoke extraction request
    LLM-->>ExtractThinker: structured output
    ExtractThinker-->>Helper: validated extraction
    Helper->>User: json/markdown/csv/text output

sequenceDiagram
    actor User as User/CLI
    participant Helper as document-extraction-helper.sh
    participant Installer as Installation Manager
    participant VEnv as Virtual Environment
    participant Pip as pip/Package Manager
    participant Deps as Dependencies<br/>(docling, presidio,<br/>extractthinker, etc.)

    User->>Helper: install
    Helper->>Helper: parse options (core/pii/llm)
    
    alt First Run
        Helper->>Helper: ensure_workspace()
        Helper->>VEnv: create ~/.aidevops/document-extraction/
        VEnv-->>Helper: venv created
        Helper->>Helper: activate_venv()
    else Existing
        Helper->>VEnv: activate existing
        VEnv-->>Helper: activated
    end
    
    Helper->>Installer: install_core()
    Installer->>Pip: install docling, extractthinker
    Pip->>Deps: resolve + fetch
    Deps-->>Pip: installed
    Pip-->>Installer: success
    
    Helper->>Installer: install_pii()
    Installer->>Pip: install presidio-analyzer/anonymizer
    Pip->>Deps: resolve + fetch
    Deps-->>Pip: installed
    
    Helper->>Helper: check Python version (3.10+)
    Helper->>Helper: do_status() report
    Helper->>User: installation complete

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

📄 Documents dance through parsing flames,
PII shadows vanish by Presidio's names,
Schemas extract with LLM's grace,
Local or cloud—choose your safe space! 🔐

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	Title accurately and concisely describes the primary addition: a document extraction subagent with workflow and CLI helper tool, including the issue reference (t073).
Linked Issues check	✅ Passed	The PR implements all core coding objectives from issue `#504`: document-extraction-helper.sh with extract/batch/pii-scan/pii-redact/convert/install/status/schemas commands, extraction-workflow.md describing tool selection and pipeline, documentation updates, and subagent-index registration.
Out of Scope Changes check	✅ Passed	All changes are in-scope: new helper script, workflow documentation, documentation updates, and subagent index registration directly align with issue `#504` objectives and do not introduce unrelated functionality.
Docstring Coverage	✅ Passed	Docstring coverage is 81.25% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch feature/t073

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-02-08T21:34:08Z

🔍 Code Quality Report

�[0;35m[MONITOR]�[0m Code Review Monitoring Report

�[0;34m[INFO]�[0m Latest Quality Status:
SonarCloud: 0 bugs, 0 vulnerabilities, 40 code smells

�[0;34m[INFO]�[0m Recent monitoring activity:
Sun Feb 8 21:34:03 UTC 2026: Code review monitoring started
Sun Feb 8 21:34:03 UTC 2026: SonarCloud - Bugs: 0, Vulnerabilities: 0, Code Smells: 40

📈 Current Quality Metrics

BUGS: 0
CODE SMELLS: 40
VULNERABILITIES: 0

Generated on: Sun Feb 8 21:34:06 UTC 2026

Generated by AI DevOps Framework Code Review Monitoring

sonarqubecloud · 2026-02-08T21:34:30Z

Quality Gate passed

Issues
1 New issue
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

coderabbitai

Actionable comments posted: 6

🤖 Fix all issues with AI agents

In @.agents/scripts/document-extraction-helper.sh:
- Line 22: The script advertises "csv" output but neither do_convert nor
do_extract support it; update the code so csv is either actually implemented or
removed from the advertised formats. To implement CSV: add "csv" as a valid
branch in do_convert (alongside json/markdown/text) and wire it to a CSV
serializer, then modify do_extract to conditionally write CSV when output_format
== "csv" (serialize extracted records into comma-separated rows with a header
derived from keys or a configured schema). Alternatively, if you choose to
remove CSV, delete "csv" from the header comment and the help text reference and
ensure do_convert rejects only the remaining advertised formats; update
references to output_format in do_extract to only handle the supported formats.
Ensure you modify the functions named do_convert and do_extract referenced in
the diff so behavior and help text stay consistent.
- Around line 205-211: The current ollama model count uses
ollama_models="$(ollama list ... | grep -c "." || echo "0")" which counts the
header row; change the pipeline to exclude the header before counting (e.g.,
pipe through tail -n +2 or awk 'NR>1' to drop the first line) and then compute
ollama_models so the reported number equals actual models; update the echo
message to use the corrected ollama_models variable (symbol: ollama_models and
the ollama list invocation).
- Around line 269-292: The inline Python here interpolates shell variables
directly (e.g., '${input_file}', '${output_file}', '${output_format}'), which
allows injection and breaks on filenames with quotes; fix by exporting the
values to environment variables (e.g., INPUT_FILE, OUTPUT_FILE, OUTPUT_FORMAT)
before invoking the embedded Python and then read them inside Python using
os.environ (use DocumentConverter() and call
converter.convert(os.environ['INPUT_FILE']), choose format based on
os.environ['OUTPUT_FORMAT'], write to os.environ['OUTPUT_FILE']); apply the same
environment-variable pattern to the other inline blocks referenced by
do_pii_scan, do_pii_redact, and do_extract to eliminate direct shell-to-Python
interpolation and handle arbitrary filenames safely.
- Around line 503-508: The catch-all case block ending with "auto|*)" currently
sets schema_code to a blank schema and silently accepts unknown schema names;
update that case to detect unrecognized schema values and emit a clear warning
(e.g., using echo to stderr) indicating the provided schema was unknown and that
auto mode will be used, and keep the schema_code assignment as fallback; also
reconcile the header advertisement and the do_schemas() implementation by either
removing "custom" from the header comment on line 23 or adding explicit handling
for "custom" in the schema dispatch (ensure references to do_schemas(), the
"auto|*" case, and the schema_code variable are updated consistently).

In @.agents/subagent-index.toon:
- Line 83: Update the TOON metadata header to reflect the correct count of
entries: change the token "TOON:scripts[46]{name,purpose}" to use the actual
number of scripts present in the block (the entries spanning lines 84–130),
e.g., "TOON:scripts[47]{name,purpose}", so the bracketed count matches the
actual number of entries.

In @.agents/tools/document/extraction-workflow.md:
- Around line 209-213: The doc suggests switching models but the helper script
currently hardcodes "ollama/llama3.2" and offers no --model flag; either add a
--model CLI option to the helper script (parse a --model argument, default to
"ollama/llama3.2" and propagate it where the helper constructs or calls the
model) or update the markdown at the recommended line to explicitly state that
choosing a different Ollama model requires using the Python ExtractThinker API
directly; reference the hardcoded "ollama/llama3.2" string, the helper script's
CLI parsing logic, and ExtractThinker as the places to change.

🧹 Nitpick comments (2)

.agents/scripts/document-extraction-helper.sh (2)
690-720: Argument parsing works but has a subtle ordering quirk.

In each option handler (e.g., --schema), the value is assigned with a default (schema="${2:-auto}") before shift 2 validates that $2 exists. This is functionally safe since the || return 1 aborts the function on failure, but the pre-assignment is misleading during debugging. A minor nit — not blocking.

596-601: do_batch returns 0 even when all files fail.

The function always exits with return 0 regardless of the failed count. This means set -e and calling scripts won't detect batch failures. Consider returning non-zero when failed > 0.
♻️ Proposed fix
     print_success "Batch complete: ${count} succeeded, ${failed} failed"
     print_info "Output directory: ${WORKSPACE_DIR}"
-    return 0
+    if [[ "$failed" -gt 0 ]] && [[ "$count" -eq 0 ]]; then
+        return 1
+    fi
+    return 0
 }

coderabbitai · 2026-02-08T21:44:10Z

+#   help                                 Show this help
+#
+# Privacy modes: local (Ollama), edge (Cloudflare), cloud (OpenAI/Anthropic), none
+# Output formats: json, markdown, csv, text


⚠️ Potential issue | 🟡 Minor

csv output format is advertised but not implemented.

The header (line 22) and help text (line 657) list csv as a supported output format, but neither do_convert nor do_extract handle it. do_convert will return "Unsupported output format: csv" and do_extract always writes JSON regardless of output_format. Either implement CSV output or remove it from the advertised formats.

🤖 Prompt for AI Agents

In @.agents/scripts/document-extraction-helper.sh at line 22, The script advertises "csv" output but neither do_convert nor do_extract support it; update the code so csv is either actually implemented or removed from the advertised formats. To implement CSV: add "csv" as a valid branch in do_convert (alongside json/markdown/text) and wire it to a CSV serializer, then modify do_extract to conditionally write CSV when output_format == "csv" (serialize extracted records into comma-separated rows with a header derived from keys or a configured schema). Alternatively, if you choose to remove CSV, delete "csv" from the header comment and the help text reference and ensure do_convert rejects only the remaining advertised formats; update references to output_format in do_extract to only handle the supported formats. Ensure you modify the functions named do_convert and do_extract referenced in the diff so behavior and help text stay consistent.

coderabbitai · 2026-02-08T21:44:10Z

+    if command -v ollama &>/dev/null; then
+        local ollama_models
+        ollama_models="$(ollama list 2>/dev/null | grep -c "." || echo "0")"
+        echo "  ollama:         installed (${ollama_models} models)"
+    else
+        echo "  ollama:         not installed"
+    fi


⚠️ Potential issue | 🟡 Minor

Off-by-one: ollama list includes a header row, inflating the model count.

grep -c "." counts every non-empty line including the header. A user with 2 models sees "3 models."

🐛 Proposed fix

- ollama_models="$(ollama list 2>/dev/null | grep -c "." || echo "0")" + ollama_models="$(ollama list 2>/dev/null | tail -n +2 | grep -c "." || echo "0")"

🤖 Prompt for AI Agents

In @.agents/scripts/document-extraction-helper.sh around lines 205 - 211, The current ollama model count uses ollama_models="$(ollama list ... | grep -c "." || echo "0")" which counts the header row; change the pipeline to exclude the header before counting (e.g., pipe through tail -n +2 or awk 'NR>1' to drop the first line) and then compute ollama_models so the reported number equals actual models; update the echo message to use the corrected ollama_models variable (symbol: ollama_models and the ollama list invocation).

coderabbitai · 2026-02-08T21:44:10Z

+    "${VENV_DIR}/bin/python3" -c "
+import sys
+from docling.document_converter import DocumentConverter
+
+converter = DocumentConverter()
+result = converter.convert('${input_file}')
+
+output_format = '${output_format}'
+if output_format in ('markdown', 'md'):
+    content = result.document.export_to_markdown()
+elif output_format == 'json':
+    import json
+    content = json.dumps(result.document.export_to_dict(), indent=2)
+else:
+    content = result.document.export_to_markdown()
+
+with open('${output_file}', 'w') as f:
+    f.write(content)
+
+print(f'Converted: ${output_file}')
+" || {
+        print_error "Conversion failed"
+        return 1
+    }


⚠️ Potential issue | 🔴 Critical

Critical: Shell variable interpolation into inline Python enables injection and breaks on common filenames.

Filenames containing single quotes (e.g., John's Invoice.pdf) will produce a Python SyntaxError. In batch mode (where filenames come from directory iteration), a crafted filename could inject arbitrary Python. This same pattern repeats in do_pii_scan, do_pii_redact, and do_extract.

Pass values via environment variables and read them with os.environ in Python:

🔒 Proposed fix for do_convert (apply same pattern to all inline Python blocks)

- "${VENV_DIR}/bin/python3" -c " -import sys -from docling.document_converter import DocumentConverter - -converter = DocumentConverter() -result = converter.convert('${input_file}') - -output_format = '${output_format}' -if output_format in ('markdown', 'md'): - content = result.document.export_to_markdown() -elif output_format == 'json': - import json - content = json.dumps(result.document.export_to_dict(), indent=2) -else: - content = result.document.export_to_markdown() - -with open('${output_file}', 'w') as f: - f.write(content) - -print(f'Converted: ${output_file}') -" || { + INPUT_FILE="$input_file" OUTPUT_FILE="$output_file" OUTPUT_FORMAT="$output_format" \ + "${VENV_DIR}/bin/python3" -c " +import os, sys +from docling.document_converter import DocumentConverter + +input_file = os.environ['INPUT_FILE'] +output_file = os.environ['OUTPUT_FILE'] +output_format = os.environ['OUTPUT_FORMAT'] + +converter = DocumentConverter() +result = converter.convert(input_file) + +if output_format in ('markdown', 'md'): + content = result.document.export_to_markdown() +elif output_format == 'json': + import json + content = json.dumps(result.document.export_to_dict(), indent=2) +else: + content = result.document.export_to_markdown() + +with open(output_file, 'w') as f: + f.write(content) + +print(f'Converted: {output_file}') +" || {

The same fix is needed in do_pii_scan (lines 312–336), do_pii_redact (lines 364–385), and do_extract (lines 510–543). As per coding guidelines, .agents/scripts/*.sh scripts must focus on reliability and robustness.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

"${VENV_DIR}/bin/python3" -c "

import sys

from docling.document_converter import DocumentConverter

converter = DocumentConverter()

result = converter.convert('${input_file}')

output_format = '${output_format}'

if output_format in ('markdown', 'md'):

content = result.document.export_to_markdown()

elif output_format == 'json':

import json

content = json.dumps(result.document.export_to_dict(), indent=2)

else:

content = result.document.export_to_markdown()

with open('${output_file}', 'w') as f:

f.write(content)

print(f'Converted: ${output_file}')

" || {

print_error "Conversion failed"

return 1

}

INPUT_FILE="$input_file" OUTPUT_FILE="$output_file" OUTPUT_FORMAT="$output_format" \

"${VENV_DIR}/bin/python3" -c "

import os, sys

from docling.document_converter import DocumentConverter

input_file = os.environ['INPUT_FILE']

output_file = os.environ['OUTPUT_FILE']

output_format = os.environ['OUTPUT_FORMAT']

converter = DocumentConverter()

result = converter.convert(input_file)

if output_format in ('markdown', 'md'):

content = result.document.export_to_markdown()

elif output_format == 'json':

import json

content = json.dumps(result.document.export_to_dict(), indent=2)

else:

content = result.document.export_to_markdown()

with open(output_file, 'w') as f:

f.write(content)

print(f'Converted: {output_file}')

" || {

print_error "Conversion failed"

return 1

}

🤖 Prompt for AI Agents

In @.agents/scripts/document-extraction-helper.sh around lines 269 - 292, The inline Python here interpolates shell variables directly (e.g., '${input_file}', '${output_file}', '${output_format}'), which allows injection and breaks on filenames with quotes; fix by exporting the values to environment variables (e.g., INPUT_FILE, OUTPUT_FILE, OUTPUT_FORMAT) before invoking the embedded Python and then read them inside Python using os.environ (use DocumentConverter() and call converter.convert(os.environ['INPUT_FILE']), choose format based on os.environ['OUTPUT_FORMAT'], write to os.environ['OUTPUT_FILE']); apply the same environment-variable pattern to the other inline blocks referenced by do_pii_scan, do_pii_redact, and do_extract to eliminate direct shell-to-Python interpolation and handle arbitrary filenames safely.

coderabbitai · 2026-02-08T21:44:10Z

+        auto|*)
+            schema_code="
+schema_class = None
+"
+            ;;
+    esac


⚠️ Potential issue | 🟡 Minor

Catch-all auto|* silently swallows unrecognized schema names.

A typo like --schema invioce falls through to auto mode with no warning. Consider emitting a warning for unknown schemas so users know they didn't get the structured extraction they expected.

♻️ Proposed fix

- auto|*) + auto) schema_code=" schema_class = None " ;; + *) + print_warning "Unknown schema '${schema}', falling back to auto mode" + schema_code=" +schema_class = None +" + ;;

Also, line 23 in the header advertises custom as a schema, but do_schemas() (line 604) doesn't list it and there's no dedicated handling. Align the header comment with the actual supported schemas.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

auto|*)

schema_code="

schema_class = None

"

;;

esac

auto)

schema_code="

schema_class = None

"

;;

*)

print_warning "Unknown schema '${schema}', falling back to auto mode"

schema_code="

schema_class = None

"

;;

esac

🤖 Prompt for AI Agents

In @.agents/scripts/document-extraction-helper.sh around lines 503 - 508, The catch-all case block ending with "auto|*)" currently sets schema_code to a blank schema and silently accepts unknown schema names; update that case to detect unrecognized schema values and emit a clear warning (e.g., using echo to stderr) indicating the provided schema was unknown and that auto mode will be used, and keep the schema_code assignment as fallback; also reconcile the header advertisement and the do_schemas() implementation by either removing "custom" from the header comment on line 23 or adding explicit handling for "custom" in the schema dispatch (ensure references to do_schemas(), the "auto|*" case, and the schema_code variable are updated consistently).

coderabbitai · 2026-02-08T21:44:11Z

 -->

-<!--TOON:scripts[45]{name,purpose}:
+<!--TOON:scripts[46]{name,purpose}:


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Locate and examine the subagent-index.toon file fd 'subagent-index.toon' -type f

Repository: marcusquinn/aidevops

Length of output: 234

🏁 Script executed:

#!/bin/bash # Read the file and count TOON:scripts entries if [ -f ".agents/subagent-index.toon" ]; then echo "=== Extracting TOON:scripts block ===" awk ' // && found { print "Total entries: " count exit } found && NF { count++ if (count <= 5 || count > 42) { print "Line " NR ": " $0 } } ' ".agents/subagent-index.toon" fi

Repository: marcusquinn/aidevops

Length of output: 1051

TOON scripts count is off-by-one: header states [46] but there are 47 entries.

The TOON metadata at line 83 needs updating to reflect the actual number of script entries in the block (lines 84–130).

-<!--TOON:scripts[46]{name,purpose}: +<!--TOON:scripts[47]{name,purpose}:

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

<!--TOON:scripts[46]{name,purpose}:

<!--TOON:scripts[47]{name,purpose}:

🤖 Prompt for AI Agents

In @.agents/subagent-index.toon at line 83, Update the TOON metadata header to reflect the correct count of entries: change the token "TOON:scripts[46]{name,purpose}" to use the actual number of scripts present in the block (the entries spanning lines 84–130), e.g., "TOON:scripts[47]{name,purpose}", so the bracketed count matches the actual number of entries.

coderabbitai · 2026-02-08T21:44:11Z

+### Out of memory during extraction
+
+- Use smaller Ollama models (e.g., `phi-4` instead of `llama3.2:70b`)
+- Process documents one at a time instead of batch
+- Use `cloud` privacy mode to offload to API


⚠️ Potential issue | 🟡 Minor

Troubleshooting suggests model alternatives the helper script can't actually use.

Line 211 recommends phi-4 instead of llama3.2:70b for OOM, but the helper script hardcodes ollama/llama3.2 with no --model flag. Either add a --model CLI option to the helper or note that model selection requires direct Python/ExtractThinker usage.

🤖 Prompt for AI Agents

In @.agents/tools/document/extraction-workflow.md around lines 209 - 213, The doc suggests switching models but the helper script currently hardcodes "ollama/llama3.2" and offers no --model flag; either add a --model CLI option to the helper script (parse a --model argument, default to "ollama/llama3.2" and propagate it where the helper constructs or calls the model) or update the markdown at the recommended line to explicitly state that choosing a different Ollama model requires using the Python ExtractThinker API directly; reference the hardcoded "ollama/llama3.2" string, the helper script's CLI parsing logic, and ExtractThinker as the places to change.

coderabbitai bot requested changes Feb 8, 2026

View reviewed changes

marcusquinn merged commit 7ceec14 into main Feb 8, 2026
19 checks passed

marcusquinn added a commit that referenced this pull request Feb 8, 2026

docs: mark t073 complete (PR #667 merged), fix corrupted task line

6fa873e

This was referenced Feb 11, 2026

t012.3: Implement OCR extraction pipeline with accounting support #1082

Closed

fix: resolve SonarCloud quality gate failure and all 16 code smells #1286

Merged

marcusquinn deleted the feature/t073 branch February 21, 2026 01:59

coderabbitai bot mentioned this pull request Feb 28, 2026

t1357.6: Add milestone validation worker for mission orchestrator #2519

Merged

marcusquinn added the code-reviews-actioned All review feedback has been actioned label Mar 7, 2026

alex-solovyev mentioned this pull request Mar 22, 2026

t073: Document Extraction Subagent & Workflow — Docling, ExtractThinker, Presidio PII #5470

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add document extraction subagent, workflow, and CLI helper (t073)#667

feat: add document extraction subagent, workflow, and CLI helper (t073)#667
marcusquinn merged 1 commit intomainfrom
feature/t073

marcusquinn commented Feb 8, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

gemini-code-assist bot commented Feb 8, 2026

Uh oh!

coderabbitai bot commented Feb 8, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Feb 8, 2026

Uh oh!

sonarqubecloud bot commented Feb 8, 2026

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Feb 8, 2026

Uh oh!

coderabbitai bot Feb 8, 2026

Uh oh!

coderabbitai bot Feb 8, 2026

Uh oh!

coderabbitai bot Feb 8, 2026

Uh oh!

coderabbitai bot Feb 8, 2026

Uh oh!

coderabbitai bot Feb 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	<!--TOON:scripts[46]{name,purpose}:
	<!--TOON:scripts[47]{name,purpose}:

Conversation

marcusquinn commented Feb 8, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Details

Helper Script (document-extraction-helper.sh)

Workflow Subagent (extraction-workflow.md)

Quality

Summary by CodeRabbit

Uh oh!

gemini-code-assist bot commented Feb 8, 2026

Uh oh!

coderabbitai bot commented Feb 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagrams

Estimated code review effort

Poem

Uh oh!

github-actions bot commented Feb 8, 2026

🔍 Code Quality Report

📈 Current Quality Metrics

Uh oh!

sonarqubecloud bot commented Feb 8, 2026

Quality Gate passed

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 8, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 8, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 8, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 8, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 8, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 8, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

marcusquinn commented Feb 8, 2026 •

edited by coderabbitai bot

Loading

Helper Script (`document-extraction-helper.sh`)

Workflow Subagent (`extraction-workflow.md`)

coderabbitai bot commented Feb 8, 2026 •

edited

Loading