feat: add document extraction subagent, workflow, and CLI helper (t073)#667
feat: add document extraction subagent, workflow, and CLI helper (t073)#667marcusquinn merged 1 commit intomainfrom
Conversation
- Add document-extraction-helper.sh: CLI wrapper for Docling+ExtractThinker+Presidio pipeline with extract, batch, pii-scan, pii-redact, convert, install, status commands - Add extraction-workflow.md: tool selection decision tree and pipeline orchestration guide - Update document-extraction.md: add quick start, helper script refs, workflow link - Update subagent-index.toon: register new workflow subagent and helper script Chose Docling+ExtractThinker+Presidio as primary stack — matches existing PRD and provides PII redaction that alternatives (DocStrange, Unstract) lack. Helper script uses isolated venv at ~/.aidevops/.agent-workspace/python-env/document-extraction/ to avoid conflicts.
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
WalkthroughThis PR introduces a new Bash orchestration script for privacy-preserving document extraction workflows. The script exposes commands for parsing, PII detection/redaction, structured extraction, and batch processing across multiple LLM backends. Supporting documentation and subagent index entries describe the workflow architecture and integration points. Changes
Sequence DiagramssequenceDiagram
actor User as User/CLI
participant Helper as document-extraction-helper.sh
participant Env as Python Environment
participant Docling as Docling (Parser)
participant Presidio as Presidio (PII)
participant ExtractThinker as ExtractThinker
participant LLM as LLM Backend<br/>(Ollama/Cloud)
User->>Helper: extract <file> <schema>
Helper->>Env: activate venv
Env-->>Helper: ready
Helper->>Docling: parse document
Docling-->>Helper: structured markdown/json
alt Privacy Mode: Local/Edge
Helper->>Presidio: scan PII (local)
Presidio-->>Helper: entities detected
Helper->>Presidio: anonymize
Presidio-->>Helper: sanitized content
else Privacy Mode: None
Helper->>Helper: skip PII operations
end
Helper->>ExtractThinker: extract(schema, content)
ExtractThinker->>LLM: invoke extraction request
LLM-->>ExtractThinker: structured output
ExtractThinker-->>Helper: validated extraction
Helper->>User: json/markdown/csv/text output
sequenceDiagram
actor User as User/CLI
participant Helper as document-extraction-helper.sh
participant Installer as Installation Manager
participant VEnv as Virtual Environment
participant Pip as pip/Package Manager
participant Deps as Dependencies<br/>(docling, presidio,<br/>extractthinker, etc.)
User->>Helper: install
Helper->>Helper: parse options (core/pii/llm)
alt First Run
Helper->>Helper: ensure_workspace()
Helper->>VEnv: create ~/.aidevops/document-extraction/
VEnv-->>Helper: venv created
Helper->>Helper: activate_venv()
else Existing
Helper->>VEnv: activate existing
VEnv-->>Helper: activated
end
Helper->>Installer: install_core()
Installer->>Pip: install docling, extractthinker
Pip->>Deps: resolve + fetch
Deps-->>Pip: installed
Pip-->>Installer: success
Helper->>Installer: install_pii()
Installer->>Pip: install presidio-analyzer/anonymizer
Pip->>Deps: resolve + fetch
Deps-->>Pip: installed
Helper->>Helper: check Python version (3.10+)
Helper->>Helper: do_status() report
Helper->>User: installation complete
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
🔍 Code Quality Report�[0;35m[MONITOR]�[0m Code Review Monitoring Report �[0;34m[INFO]�[0m Latest Quality Status: �[0;34m[INFO]�[0m Recent monitoring activity: 📈 Current Quality Metrics
Generated on: Sun Feb 8 21:34:06 UTC 2026 Generated by AI DevOps Framework Code Review Monitoring |
|
There was a problem hiding this comment.
Actionable comments posted: 6
🤖 Fix all issues with AI agents
In @.agents/scripts/document-extraction-helper.sh:
- Line 22: The script advertises "csv" output but neither do_convert nor
do_extract support it; update the code so csv is either actually implemented or
removed from the advertised formats. To implement CSV: add "csv" as a valid
branch in do_convert (alongside json/markdown/text) and wire it to a CSV
serializer, then modify do_extract to conditionally write CSV when output_format
== "csv" (serialize extracted records into comma-separated rows with a header
derived from keys or a configured schema). Alternatively, if you choose to
remove CSV, delete "csv" from the header comment and the help text reference and
ensure do_convert rejects only the remaining advertised formats; update
references to output_format in do_extract to only handle the supported formats.
Ensure you modify the functions named do_convert and do_extract referenced in
the diff so behavior and help text stay consistent.
- Around line 205-211: The current ollama model count uses
ollama_models="$(ollama list ... | grep -c "." || echo "0")" which counts the
header row; change the pipeline to exclude the header before counting (e.g.,
pipe through tail -n +2 or awk 'NR>1' to drop the first line) and then compute
ollama_models so the reported number equals actual models; update the echo
message to use the corrected ollama_models variable (symbol: ollama_models and
the ollama list invocation).
- Around line 269-292: The inline Python here interpolates shell variables
directly (e.g., '${input_file}', '${output_file}', '${output_format}'), which
allows injection and breaks on filenames with quotes; fix by exporting the
values to environment variables (e.g., INPUT_FILE, OUTPUT_FILE, OUTPUT_FORMAT)
before invoking the embedded Python and then read them inside Python using
os.environ (use DocumentConverter() and call
converter.convert(os.environ['INPUT_FILE']), choose format based on
os.environ['OUTPUT_FORMAT'], write to os.environ['OUTPUT_FILE']); apply the same
environment-variable pattern to the other inline blocks referenced by
do_pii_scan, do_pii_redact, and do_extract to eliminate direct shell-to-Python
interpolation and handle arbitrary filenames safely.
- Around line 503-508: The catch-all case block ending with "auto|*)" currently
sets schema_code to a blank schema and silently accepts unknown schema names;
update that case to detect unrecognized schema values and emit a clear warning
(e.g., using echo to stderr) indicating the provided schema was unknown and that
auto mode will be used, and keep the schema_code assignment as fallback; also
reconcile the header advertisement and the do_schemas() implementation by either
removing "custom" from the header comment on line 23 or adding explicit handling
for "custom" in the schema dispatch (ensure references to do_schemas(), the
"auto|*" case, and the schema_code variable are updated consistently).
In @.agents/subagent-index.toon:
- Line 83: Update the TOON metadata header to reflect the correct count of
entries: change the token "TOON:scripts[46]{name,purpose}" to use the actual
number of scripts present in the block (the entries spanning lines 84–130),
e.g., "TOON:scripts[47]{name,purpose}", so the bracketed count matches the
actual number of entries.
In @.agents/tools/document/extraction-workflow.md:
- Around line 209-213: The doc suggests switching models but the helper script
currently hardcodes "ollama/llama3.2" and offers no --model flag; either add a
--model CLI option to the helper script (parse a --model argument, default to
"ollama/llama3.2" and propagate it where the helper constructs or calls the
model) or update the markdown at the recommended line to explicitly state that
choosing a different Ollama model requires using the Python ExtractThinker API
directly; reference the hardcoded "ollama/llama3.2" string, the helper script's
CLI parsing logic, and ExtractThinker as the places to change.
🧹 Nitpick comments (2)
.agents/scripts/document-extraction-helper.sh (2)
690-720: Argument parsing works but has a subtle ordering quirk.In each option handler (e.g.,
--schema), the value is assigned with a default (schema="${2:-auto}") beforeshift 2validates that$2exists. This is functionally safe since the|| return 1aborts the function on failure, but the pre-assignment is misleading during debugging. A minor nit — not blocking.
596-601:do_batchreturns 0 even when all files fail.The function always exits with
return 0regardless of thefailedcount. This meansset -eand calling scripts won't detect batch failures. Consider returning non-zero whenfailed > 0.♻️ Proposed fix
print_success "Batch complete: ${count} succeeded, ${failed} failed" print_info "Output directory: ${WORKSPACE_DIR}" - return 0 + if [[ "$failed" -gt 0 ]] && [[ "$count" -eq 0 ]]; then + return 1 + fi + return 0 }
| # help Show this help | ||
| # | ||
| # Privacy modes: local (Ollama), edge (Cloudflare), cloud (OpenAI/Anthropic), none | ||
| # Output formats: json, markdown, csv, text |
There was a problem hiding this comment.
csv output format is advertised but not implemented.
The header (line 22) and help text (line 657) list csv as a supported output format, but neither do_convert nor do_extract handle it. do_convert will return "Unsupported output format: csv" and do_extract always writes JSON regardless of output_format. Either implement CSV output or remove it from the advertised formats.
🤖 Prompt for AI Agents
In @.agents/scripts/document-extraction-helper.sh at line 22, The script
advertises "csv" output but neither do_convert nor do_extract support it; update
the code so csv is either actually implemented or removed from the advertised
formats. To implement CSV: add "csv" as a valid branch in do_convert (alongside
json/markdown/text) and wire it to a CSV serializer, then modify do_extract to
conditionally write CSV when output_format == "csv" (serialize extracted records
into comma-separated rows with a header derived from keys or a configured
schema). Alternatively, if you choose to remove CSV, delete "csv" from the
header comment and the help text reference and ensure do_convert rejects only
the remaining advertised formats; update references to output_format in
do_extract to only handle the supported formats. Ensure you modify the functions
named do_convert and do_extract referenced in the diff so behavior and help text
stay consistent.
| if command -v ollama &>/dev/null; then | ||
| local ollama_models | ||
| ollama_models="$(ollama list 2>/dev/null | grep -c "." || echo "0")" | ||
| echo " ollama: installed (${ollama_models} models)" | ||
| else | ||
| echo " ollama: not installed" | ||
| fi |
There was a problem hiding this comment.
Off-by-one: ollama list includes a header row, inflating the model count.
grep -c "." counts every non-empty line including the header. A user with 2 models sees "3 models."
🐛 Proposed fix
- ollama_models="$(ollama list 2>/dev/null | grep -c "." || echo "0")"
+ ollama_models="$(ollama list 2>/dev/null | tail -n +2 | grep -c "." || echo "0")"🤖 Prompt for AI Agents
In @.agents/scripts/document-extraction-helper.sh around lines 205 - 211, The
current ollama model count uses ollama_models="$(ollama list ... | grep -c "."
|| echo "0")" which counts the header row; change the pipeline to exclude the
header before counting (e.g., pipe through tail -n +2 or awk 'NR>1' to drop the
first line) and then compute ollama_models so the reported number equals actual
models; update the echo message to use the corrected ollama_models variable
(symbol: ollama_models and the ollama list invocation).
| "${VENV_DIR}/bin/python3" -c " | ||
| import sys | ||
| from docling.document_converter import DocumentConverter | ||
|
|
||
| converter = DocumentConverter() | ||
| result = converter.convert('${input_file}') | ||
|
|
||
| output_format = '${output_format}' | ||
| if output_format in ('markdown', 'md'): | ||
| content = result.document.export_to_markdown() | ||
| elif output_format == 'json': | ||
| import json | ||
| content = json.dumps(result.document.export_to_dict(), indent=2) | ||
| else: | ||
| content = result.document.export_to_markdown() | ||
|
|
||
| with open('${output_file}', 'w') as f: | ||
| f.write(content) | ||
|
|
||
| print(f'Converted: ${output_file}') | ||
| " || { | ||
| print_error "Conversion failed" | ||
| return 1 | ||
| } |
There was a problem hiding this comment.
Critical: Shell variable interpolation into inline Python enables injection and breaks on common filenames.
Filenames containing single quotes (e.g., John's Invoice.pdf) will produce a Python SyntaxError. In batch mode (where filenames come from directory iteration), a crafted filename could inject arbitrary Python. This same pattern repeats in do_pii_scan, do_pii_redact, and do_extract.
Pass values via environment variables and read them with os.environ in Python:
🔒 Proposed fix for do_convert (apply same pattern to all inline Python blocks)
- "${VENV_DIR}/bin/python3" -c "
-import sys
-from docling.document_converter import DocumentConverter
-
-converter = DocumentConverter()
-result = converter.convert('${input_file}')
-
-output_format = '${output_format}'
-if output_format in ('markdown', 'md'):
- content = result.document.export_to_markdown()
-elif output_format == 'json':
- import json
- content = json.dumps(result.document.export_to_dict(), indent=2)
-else:
- content = result.document.export_to_markdown()
-
-with open('${output_file}', 'w') as f:
- f.write(content)
-
-print(f'Converted: ${output_file}')
-" || {
+ INPUT_FILE="$input_file" OUTPUT_FILE="$output_file" OUTPUT_FORMAT="$output_format" \
+ "${VENV_DIR}/bin/python3" -c "
+import os, sys
+from docling.document_converter import DocumentConverter
+
+input_file = os.environ['INPUT_FILE']
+output_file = os.environ['OUTPUT_FILE']
+output_format = os.environ['OUTPUT_FORMAT']
+
+converter = DocumentConverter()
+result = converter.convert(input_file)
+
+if output_format in ('markdown', 'md'):
+ content = result.document.export_to_markdown()
+elif output_format == 'json':
+ import json
+ content = json.dumps(result.document.export_to_dict(), indent=2)
+else:
+ content = result.document.export_to_markdown()
+
+with open(output_file, 'w') as f:
+ f.write(content)
+
+print(f'Converted: {output_file}')
+" || {The same fix is needed in do_pii_scan (lines 312–336), do_pii_redact (lines 364–385), and do_extract (lines 510–543). As per coding guidelines, .agents/scripts/*.sh scripts must focus on reliability and robustness.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| "${VENV_DIR}/bin/python3" -c " | |
| import sys | |
| from docling.document_converter import DocumentConverter | |
| converter = DocumentConverter() | |
| result = converter.convert('${input_file}') | |
| output_format = '${output_format}' | |
| if output_format in ('markdown', 'md'): | |
| content = result.document.export_to_markdown() | |
| elif output_format == 'json': | |
| import json | |
| content = json.dumps(result.document.export_to_dict(), indent=2) | |
| else: | |
| content = result.document.export_to_markdown() | |
| with open('${output_file}', 'w') as f: | |
| f.write(content) | |
| print(f'Converted: ${output_file}') | |
| " || { | |
| print_error "Conversion failed" | |
| return 1 | |
| } | |
| INPUT_FILE="$input_file" OUTPUT_FILE="$output_file" OUTPUT_FORMAT="$output_format" \ | |
| "${VENV_DIR}/bin/python3" -c " | |
| import os, sys | |
| from docling.document_converter import DocumentConverter | |
| input_file = os.environ['INPUT_FILE'] | |
| output_file = os.environ['OUTPUT_FILE'] | |
| output_format = os.environ['OUTPUT_FORMAT'] | |
| converter = DocumentConverter() | |
| result = converter.convert(input_file) | |
| if output_format in ('markdown', 'md'): | |
| content = result.document.export_to_markdown() | |
| elif output_format == 'json': | |
| import json | |
| content = json.dumps(result.document.export_to_dict(), indent=2) | |
| else: | |
| content = result.document.export_to_markdown() | |
| with open(output_file, 'w') as f: | |
| f.write(content) | |
| print(f'Converted: {output_file}') | |
| " || { | |
| print_error "Conversion failed" | |
| return 1 | |
| } |
🤖 Prompt for AI Agents
In @.agents/scripts/document-extraction-helper.sh around lines 269 - 292, The
inline Python here interpolates shell variables directly (e.g., '${input_file}',
'${output_file}', '${output_format}'), which allows injection and breaks on
filenames with quotes; fix by exporting the values to environment variables
(e.g., INPUT_FILE, OUTPUT_FILE, OUTPUT_FORMAT) before invoking the embedded
Python and then read them inside Python using os.environ (use
DocumentConverter() and call converter.convert(os.environ['INPUT_FILE']), choose
format based on os.environ['OUTPUT_FORMAT'], write to
os.environ['OUTPUT_FILE']); apply the same environment-variable pattern to the
other inline blocks referenced by do_pii_scan, do_pii_redact, and do_extract to
eliminate direct shell-to-Python interpolation and handle arbitrary filenames
safely.
| auto|*) | ||
| schema_code=" | ||
| schema_class = None | ||
| " | ||
| ;; | ||
| esac |
There was a problem hiding this comment.
Catch-all auto|* silently swallows unrecognized schema names.
A typo like --schema invioce falls through to auto mode with no warning. Consider emitting a warning for unknown schemas so users know they didn't get the structured extraction they expected.
♻️ Proposed fix
- auto|*)
+ auto)
schema_code="
schema_class = None
"
;;
+ *)
+ print_warning "Unknown schema '${schema}', falling back to auto mode"
+ schema_code="
+schema_class = None
+"
+ ;;Also, line 23 in the header advertises custom as a schema, but do_schemas() (line 604) doesn't list it and there's no dedicated handling. Align the header comment with the actual supported schemas.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| auto|*) | |
| schema_code=" | |
| schema_class = None | |
| " | |
| ;; | |
| esac | |
| auto) | |
| schema_code=" | |
| schema_class = None | |
| " | |
| ;; | |
| *) | |
| print_warning "Unknown schema '${schema}', falling back to auto mode" | |
| schema_code=" | |
| schema_class = None | |
| " | |
| ;; | |
| esac |
🤖 Prompt for AI Agents
In @.agents/scripts/document-extraction-helper.sh around lines 503 - 508, The
catch-all case block ending with "auto|*)" currently sets schema_code to a blank
schema and silently accepts unknown schema names; update that case to detect
unrecognized schema values and emit a clear warning (e.g., using echo to stderr)
indicating the provided schema was unknown and that auto mode will be used, and
keep the schema_code assignment as fallback; also reconcile the header
advertisement and the do_schemas() implementation by either removing "custom"
from the header comment on line 23 or adding explicit handling for "custom" in
the schema dispatch (ensure references to do_schemas(), the "auto|*" case, and
the schema_code variable are updated consistently).
| --> | ||
|
|
||
| <!--TOON:scripts[45]{name,purpose}: | ||
| <!--TOON:scripts[46]{name,purpose}: |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Locate and examine the subagent-index.toon file
fd 'subagent-index.toon' -type fRepository: marcusquinn/aidevops
Length of output: 234
🏁 Script executed:
#!/bin/bash
# Read the file and count TOON:scripts entries
if [ -f ".agents/subagent-index.toon" ]; then
echo "=== Extracting TOON:scripts block ==="
awk '
/<!--TOON:scripts/ {
print "Header line: " $0
found=1
next
}
/^-->/ && found {
print "Total entries: " count
exit
}
found && NF {
count++
if (count <= 5 || count > 42) {
print "Line " NR ": " $0
}
}
' ".agents/subagent-index.toon"
fiRepository: marcusquinn/aidevops
Length of output: 1051
TOON scripts count is off-by-one: header states [46] but there are 47 entries.
The TOON metadata at line 83 needs updating to reflect the actual number of script entries in the block (lines 84–130).
-<!--TOON:scripts[46]{name,purpose}:
+<!--TOON:scripts[47]{name,purpose}:📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| <!--TOON:scripts[46]{name,purpose}: | |
| <!--TOON:scripts[47]{name,purpose}: |
🤖 Prompt for AI Agents
In @.agents/subagent-index.toon at line 83, Update the TOON metadata header to
reflect the correct count of entries: change the token
"TOON:scripts[46]{name,purpose}" to use the actual number of scripts present in
the block (the entries spanning lines 84–130), e.g.,
"TOON:scripts[47]{name,purpose}", so the bracketed count matches the actual
number of entries.
| ### Out of memory during extraction | ||
|
|
||
| - Use smaller Ollama models (e.g., `phi-4` instead of `llama3.2:70b`) | ||
| - Process documents one at a time instead of batch | ||
| - Use `cloud` privacy mode to offload to API |
There was a problem hiding this comment.
Troubleshooting suggests model alternatives the helper script can't actually use.
Line 211 recommends phi-4 instead of llama3.2:70b for OOM, but the helper script hardcodes ollama/llama3.2 with no --model flag. Either add a --model CLI option to the helper or note that model selection requires direct Python/ExtractThinker usage.
🤖 Prompt for AI Agents
In @.agents/tools/document/extraction-workflow.md around lines 209 - 213, The
doc suggests switching models but the helper script currently hardcodes
"ollama/llama3.2" and offers no --model flag; either add a --model CLI option to
the helper script (parse a --model argument, default to "ollama/llama3.2" and
propagate it where the helper constructs or calls the model) or update the
markdown at the recommended line to explicitly state that choosing a different
Ollama model requires using the Python ExtractThinker API directly; reference
the hardcoded "ollama/llama3.2" string, the helper script's CLI parsing logic,
and ExtractThinker as the places to change.



Summary
document-extraction-helper.shCLI wrapper for the Docling+ExtractThinker+Presidio document extraction pipelineextraction-workflow.mdsubagent with tool selection decision tree and pipeline orchestration guidedocument-extraction.mdwith quick start, helper script references, and workflow linksubagent-index.toonto register new workflow subagent and helper scriptDetails
Helper Script (
document-extraction-helper.sh)Full CLI for document extraction with commands:
extract <file>- Structured extraction with schema (invoice, receipt, contract, id-document)batch <dir>- Batch process all documents in a directorypii-scan <file>- Scan text for PII entities (Presidio)pii-redact <file>- Redact PII from text filesconvert <file>- Convert documents to markdown/JSON (Docling, no LLM needed)install- Install dependencies into isolated venvstatus- Check installed componentsschemas- List available extraction schemasPrivacy modes:
local(Ollama),edge(Cloudflare),cloud(OpenAI/Anthropic),none(auto)Workflow Subagent (
extraction-workflow.md)Decision tree for selecting the right extraction tool:
Quality
local var="$1", explicit returns)Closes #504
Summary by CodeRabbit
New Features
Documentation