Skip to content

feat: add document extraction subagent, workflow, and CLI helper (t073)#667

Merged
marcusquinn merged 1 commit intomainfrom
feature/t073
Feb 8, 2026
Merged

feat: add document extraction subagent, workflow, and CLI helper (t073)#667
marcusquinn merged 1 commit intomainfrom
feature/t073

Conversation

@marcusquinn
Copy link
Copy Markdown
Owner

@marcusquinn marcusquinn commented Feb 8, 2026

Summary

  • Add document-extraction-helper.sh CLI wrapper for the Docling+ExtractThinker+Presidio document extraction pipeline
  • Add extraction-workflow.md subagent with tool selection decision tree and pipeline orchestration guide
  • Update document-extraction.md with quick start, helper script references, and workflow link
  • Update subagent-index.toon to register new workflow subagent and helper script

Details

Helper Script (document-extraction-helper.sh)

Full CLI for document extraction with commands:

  • extract <file> - Structured extraction with schema (invoice, receipt, contract, id-document)
  • batch <dir> - Batch process all documents in a directory
  • pii-scan <file> - Scan text for PII entities (Presidio)
  • pii-redact <file> - Redact PII from text files
  • convert <file> - Convert documents to markdown/JSON (Docling, no LLM needed)
  • install - Install dependencies into isolated venv
  • status - Check installed components
  • schemas - List available extraction schemas

Privacy modes: local (Ollama), edge (Cloudflare), cloud (OpenAI/Anthropic), none (auto)

Workflow Subagent (extraction-workflow.md)

Decision tree for selecting the right extraction tool:

  • Docling+ExtractThinker+Presidio: Custom pipelines with PII redaction
  • DocStrange: Quick extraction, good OCR, simpler setup
  • Unstract: Enterprise ETL with visual schema builder
  • MinerU: PDF to markdown (layout-aware)
  • Pandoc: Simple format conversion
  • GLM-OCR: Local OCR only

Quality

  • ShellCheck: zero violations
  • Markdownlint: zero violations
  • Follows existing script patterns (shared-constants.sh, local var="$1", explicit returns)

Closes #504

Summary by CodeRabbit

  • New Features

    • Introduced document extraction helper tool with commands for document conversion, structured data extraction, batch processing, and PII scanning/redaction.
    • Support for multiple output formats (JSON, Markdown, CSV, Text) and privacy modes.
  • Documentation

    • Added comprehensive extraction workflow guide and updated installation instructions.

- Add document-extraction-helper.sh: CLI wrapper for Docling+ExtractThinker+Presidio
  pipeline with extract, batch, pii-scan, pii-redact, convert, install, status commands
- Add extraction-workflow.md: tool selection decision tree and pipeline orchestration guide
- Update document-extraction.md: add quick start, helper script refs, workflow link
- Update subagent-index.toon: register new workflow subagent and helper script

Chose Docling+ExtractThinker+Presidio as primary stack — matches existing PRD and provides
PII redaction that alternatives (DocStrange, Unstract) lack. Helper script uses isolated
venv at ~/.aidevops/.agent-workspace/python-env/document-extraction/ to avoid conflicts.
@gemini-code-assist
Copy link
Copy Markdown

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Feb 8, 2026

Walkthrough

This PR introduces a new Bash orchestration script for privacy-preserving document extraction workflows. The script exposes commands for parsing, PII detection/redaction, structured extraction, and batch processing across multiple LLM backends. Supporting documentation and subagent index entries describe the workflow architecture and integration points.

Changes

Cohort / File(s) Summary
Document Extraction Helper Script
.agents/scripts/document-extraction-helper.sh
New 779-line Bash script providing CLI orchestration for document extraction workflows. Implements workspace/venv setup, dependency installation (core/PII/LLM), status reporting, document conversion via Docling, PII scanning/redaction via Presidio, structured extraction via ExtractThinker with schema support, batch processing, and help utilities. Supports local, edge, cloud, and headless privacy modes with multiple output formats.
Subagent Registry & Helper Index
.agents/subagent-index.toon
Updated TOON scripts index from [45] to [46]; added 16 new script/helper entries including document-extraction-helper.sh with PII redaction capabilities and related automation helpers (privacy-filter, self-improve, agent-test, cron, runner, matrix-dispatch, supervisor, objective-runner, schema-validator, speech-to-speech, voice, seo-content-analyzer). Updated document tool description to include workflow orchestration.
Extraction Workflow Documentation
.agents/tools/document/extraction-workflow.md
New 224-line comprehensive guide documenting the document extraction pipeline. Details workflow steps (parsing, PII scanning, anonymization, extraction, output), pipeline architecture, tool comparison matrix, custom schema usage with Python examples, troubleshooting guidance for common failures (Docling parsing, Ollama, PII scanning, memory errors), and subagent command specifications with privacy mode options.
Document Extraction Quick Reference
.agents/tools/document/document-extraction.md
Enhanced user-facing documentation with expanded Quick Start guide (install, extract, pii-scan, status), detailed installation sections distinguishing helper script (recommended) vs. manual setup, venv path documentation, and updated Related links to extraction workflow and helper script.

Sequence Diagrams

sequenceDiagram
    actor User as User/CLI
    participant Helper as document-extraction-helper.sh
    participant Env as Python Environment
    participant Docling as Docling (Parser)
    participant Presidio as Presidio (PII)
    participant ExtractThinker as ExtractThinker
    participant LLM as LLM Backend<br/>(Ollama/Cloud)

    User->>Helper: extract <file> <schema>
    Helper->>Env: activate venv
    Env-->>Helper: ready
    Helper->>Docling: parse document
    Docling-->>Helper: structured markdown/json
    
    alt Privacy Mode: Local/Edge
        Helper->>Presidio: scan PII (local)
        Presidio-->>Helper: entities detected
        Helper->>Presidio: anonymize
        Presidio-->>Helper: sanitized content
    else Privacy Mode: None
        Helper->>Helper: skip PII operations
    end
    
    Helper->>ExtractThinker: extract(schema, content)
    ExtractThinker->>LLM: invoke extraction request
    LLM-->>ExtractThinker: structured output
    ExtractThinker-->>Helper: validated extraction
    Helper->>User: json/markdown/csv/text output
Loading
sequenceDiagram
    actor User as User/CLI
    participant Helper as document-extraction-helper.sh
    participant Installer as Installation Manager
    participant VEnv as Virtual Environment
    participant Pip as pip/Package Manager
    participant Deps as Dependencies<br/>(docling, presidio,<br/>extractthinker, etc.)

    User->>Helper: install
    Helper->>Helper: parse options (core/pii/llm)
    
    alt First Run
        Helper->>Helper: ensure_workspace()
        Helper->>VEnv: create ~/.aidevops/document-extraction/
        VEnv-->>Helper: venv created
        Helper->>Helper: activate_venv()
    else Existing
        Helper->>VEnv: activate existing
        VEnv-->>Helper: activated
    end
    
    Helper->>Installer: install_core()
    Installer->>Pip: install docling, extractthinker
    Pip->>Deps: resolve + fetch
    Deps-->>Pip: installed
    Pip-->>Installer: success
    
    Helper->>Installer: install_pii()
    Installer->>Pip: install presidio-analyzer/anonymizer
    Pip->>Deps: resolve + fetch
    Deps-->>Pip: installed
    
    Helper->>Helper: check Python version (3.10+)
    Helper->>Helper: do_status() report
    Helper->>User: installation complete
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

📄 Documents dance through parsing flames,
PII shadows vanish by Presidio's names,
Schemas extract with LLM's grace,
Local or cloud—choose your safe space! 🔐

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed Title accurately and concisely describes the primary addition: a document extraction subagent with workflow and CLI helper tool, including the issue reference (t073).
Linked Issues check ✅ Passed The PR implements all core coding objectives from issue #504: document-extraction-helper.sh with extract/batch/pii-scan/pii-redact/convert/install/status/schemas commands, extraction-workflow.md describing tool selection and pipeline, documentation updates, and subagent-index registration.
Out of Scope Changes check ✅ Passed All changes are in-scope: new helper script, workflow documentation, documentation updates, and subagent index registration directly align with issue #504 objectives and do not introduce unrelated functionality.
Docstring Coverage ✅ Passed Docstring coverage is 81.25% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feature/t073

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Feb 8, 2026

🔍 Code Quality Report

�[0;35m[MONITOR]�[0m Code Review Monitoring Report

�[0;34m[INFO]�[0m Latest Quality Status:
SonarCloud: 0 bugs, 0 vulnerabilities, 40 code smells

�[0;34m[INFO]�[0m Recent monitoring activity:
Sun Feb 8 21:34:03 UTC 2026: Code review monitoring started
Sun Feb 8 21:34:03 UTC 2026: SonarCloud - Bugs: 0, Vulnerabilities: 0, Code Smells: 40

📈 Current Quality Metrics

  • BUGS: 0
  • CODE SMELLS: 40
  • VULNERABILITIES: 0

Generated on: Sun Feb 8 21:34:06 UTC 2026


Generated by AI DevOps Framework Code Review Monitoring

@sonarqubecloud
Copy link
Copy Markdown

sonarqubecloud bot commented Feb 8, 2026

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🤖 Fix all issues with AI agents
In @.agents/scripts/document-extraction-helper.sh:
- Line 22: The script advertises "csv" output but neither do_convert nor
do_extract support it; update the code so csv is either actually implemented or
removed from the advertised formats. To implement CSV: add "csv" as a valid
branch in do_convert (alongside json/markdown/text) and wire it to a CSV
serializer, then modify do_extract to conditionally write CSV when output_format
== "csv" (serialize extracted records into comma-separated rows with a header
derived from keys or a configured schema). Alternatively, if you choose to
remove CSV, delete "csv" from the header comment and the help text reference and
ensure do_convert rejects only the remaining advertised formats; update
references to output_format in do_extract to only handle the supported formats.
Ensure you modify the functions named do_convert and do_extract referenced in
the diff so behavior and help text stay consistent.
- Around line 205-211: The current ollama model count uses
ollama_models="$(ollama list ... | grep -c "." || echo "0")" which counts the
header row; change the pipeline to exclude the header before counting (e.g.,
pipe through tail -n +2 or awk 'NR>1' to drop the first line) and then compute
ollama_models so the reported number equals actual models; update the echo
message to use the corrected ollama_models variable (symbol: ollama_models and
the ollama list invocation).
- Around line 269-292: The inline Python here interpolates shell variables
directly (e.g., '${input_file}', '${output_file}', '${output_format}'), which
allows injection and breaks on filenames with quotes; fix by exporting the
values to environment variables (e.g., INPUT_FILE, OUTPUT_FILE, OUTPUT_FORMAT)
before invoking the embedded Python and then read them inside Python using
os.environ (use DocumentConverter() and call
converter.convert(os.environ['INPUT_FILE']), choose format based on
os.environ['OUTPUT_FORMAT'], write to os.environ['OUTPUT_FILE']); apply the same
environment-variable pattern to the other inline blocks referenced by
do_pii_scan, do_pii_redact, and do_extract to eliminate direct shell-to-Python
interpolation and handle arbitrary filenames safely.
- Around line 503-508: The catch-all case block ending with "auto|*)" currently
sets schema_code to a blank schema and silently accepts unknown schema names;
update that case to detect unrecognized schema values and emit a clear warning
(e.g., using echo to stderr) indicating the provided schema was unknown and that
auto mode will be used, and keep the schema_code assignment as fallback; also
reconcile the header advertisement and the do_schemas() implementation by either
removing "custom" from the header comment on line 23 or adding explicit handling
for "custom" in the schema dispatch (ensure references to do_schemas(), the
"auto|*" case, and the schema_code variable are updated consistently).

In @.agents/subagent-index.toon:
- Line 83: Update the TOON metadata header to reflect the correct count of
entries: change the token "TOON:scripts[46]{name,purpose}" to use the actual
number of scripts present in the block (the entries spanning lines 84–130),
e.g., "TOON:scripts[47]{name,purpose}", so the bracketed count matches the
actual number of entries.

In @.agents/tools/document/extraction-workflow.md:
- Around line 209-213: The doc suggests switching models but the helper script
currently hardcodes "ollama/llama3.2" and offers no --model flag; either add a
--model CLI option to the helper script (parse a --model argument, default to
"ollama/llama3.2" and propagate it where the helper constructs or calls the
model) or update the markdown at the recommended line to explicitly state that
choosing a different Ollama model requires using the Python ExtractThinker API
directly; reference the hardcoded "ollama/llama3.2" string, the helper script's
CLI parsing logic, and ExtractThinker as the places to change.
🧹 Nitpick comments (2)
.agents/scripts/document-extraction-helper.sh (2)

690-720: Argument parsing works but has a subtle ordering quirk.

In each option handler (e.g., --schema), the value is assigned with a default (schema="${2:-auto}") before shift 2 validates that $2 exists. This is functionally safe since the || return 1 aborts the function on failure, but the pre-assignment is misleading during debugging. A minor nit — not blocking.


596-601: do_batch returns 0 even when all files fail.

The function always exits with return 0 regardless of the failed count. This means set -e and calling scripts won't detect batch failures. Consider returning non-zero when failed > 0.

♻️ Proposed fix
     print_success "Batch complete: ${count} succeeded, ${failed} failed"
     print_info "Output directory: ${WORKSPACE_DIR}"
-    return 0
+    if [[ "$failed" -gt 0 ]] && [[ "$count" -eq 0 ]]; then
+        return 1
+    fi
+    return 0
 }

# help Show this help
#
# Privacy modes: local (Ollama), edge (Cloudflare), cloud (OpenAI/Anthropic), none
# Output formats: json, markdown, csv, text
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

csv output format is advertised but not implemented.

The header (line 22) and help text (line 657) list csv as a supported output format, but neither do_convert nor do_extract handle it. do_convert will return "Unsupported output format: csv" and do_extract always writes JSON regardless of output_format. Either implement CSV output or remove it from the advertised formats.

🤖 Prompt for AI Agents
In @.agents/scripts/document-extraction-helper.sh at line 22, The script
advertises "csv" output but neither do_convert nor do_extract support it; update
the code so csv is either actually implemented or removed from the advertised
formats. To implement CSV: add "csv" as a valid branch in do_convert (alongside
json/markdown/text) and wire it to a CSV serializer, then modify do_extract to
conditionally write CSV when output_format == "csv" (serialize extracted records
into comma-separated rows with a header derived from keys or a configured
schema). Alternatively, if you choose to remove CSV, delete "csv" from the
header comment and the help text reference and ensure do_convert rejects only
the remaining advertised formats; update references to output_format in
do_extract to only handle the supported formats. Ensure you modify the functions
named do_convert and do_extract referenced in the diff so behavior and help text
stay consistent.

Comment on lines +205 to +211
if command -v ollama &>/dev/null; then
local ollama_models
ollama_models="$(ollama list 2>/dev/null | grep -c "." || echo "0")"
echo " ollama: installed (${ollama_models} models)"
else
echo " ollama: not installed"
fi
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Off-by-one: ollama list includes a header row, inflating the model count.

grep -c "." counts every non-empty line including the header. A user with 2 models sees "3 models."

🐛 Proposed fix
-        ollama_models="$(ollama list 2>/dev/null | grep -c "." || echo "0")"
+        ollama_models="$(ollama list 2>/dev/null | tail -n +2 | grep -c "." || echo "0")"
🤖 Prompt for AI Agents
In @.agents/scripts/document-extraction-helper.sh around lines 205 - 211, The
current ollama model count uses ollama_models="$(ollama list ... | grep -c "."
|| echo "0")" which counts the header row; change the pipeline to exclude the
header before counting (e.g., pipe through tail -n +2 or awk 'NR>1' to drop the
first line) and then compute ollama_models so the reported number equals actual
models; update the echo message to use the corrected ollama_models variable
(symbol: ollama_models and the ollama list invocation).

Comment on lines +269 to +292
"${VENV_DIR}/bin/python3" -c "
import sys
from docling.document_converter import DocumentConverter

converter = DocumentConverter()
result = converter.convert('${input_file}')

output_format = '${output_format}'
if output_format in ('markdown', 'md'):
content = result.document.export_to_markdown()
elif output_format == 'json':
import json
content = json.dumps(result.document.export_to_dict(), indent=2)
else:
content = result.document.export_to_markdown()

with open('${output_file}', 'w') as f:
f.write(content)

print(f'Converted: ${output_file}')
" || {
print_error "Conversion failed"
return 1
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Critical: Shell variable interpolation into inline Python enables injection and breaks on common filenames.

Filenames containing single quotes (e.g., John's Invoice.pdf) will produce a Python SyntaxError. In batch mode (where filenames come from directory iteration), a crafted filename could inject arbitrary Python. This same pattern repeats in do_pii_scan, do_pii_redact, and do_extract.

Pass values via environment variables and read them with os.environ in Python:

🔒 Proposed fix for do_convert (apply same pattern to all inline Python blocks)
-    "${VENV_DIR}/bin/python3" -c "
-import sys
-from docling.document_converter import DocumentConverter
-
-converter = DocumentConverter()
-result = converter.convert('${input_file}')
-
-output_format = '${output_format}'
-if output_format in ('markdown', 'md'):
-    content = result.document.export_to_markdown()
-elif output_format == 'json':
-    import json
-    content = json.dumps(result.document.export_to_dict(), indent=2)
-else:
-    content = result.document.export_to_markdown()
-
-with open('${output_file}', 'w') as f:
-    f.write(content)
-
-print(f'Converted: ${output_file}')
-" || {
+    INPUT_FILE="$input_file" OUTPUT_FILE="$output_file" OUTPUT_FORMAT="$output_format" \
+    "${VENV_DIR}/bin/python3" -c "
+import os, sys
+from docling.document_converter import DocumentConverter
+
+input_file = os.environ['INPUT_FILE']
+output_file = os.environ['OUTPUT_FILE']
+output_format = os.environ['OUTPUT_FORMAT']
+
+converter = DocumentConverter()
+result = converter.convert(input_file)
+
+if output_format in ('markdown', 'md'):
+    content = result.document.export_to_markdown()
+elif output_format == 'json':
+    import json
+    content = json.dumps(result.document.export_to_dict(), indent=2)
+else:
+    content = result.document.export_to_markdown()
+
+with open(output_file, 'w') as f:
+    f.write(content)
+
+print(f'Converted: {output_file}')
+" || {

The same fix is needed in do_pii_scan (lines 312–336), do_pii_redact (lines 364–385), and do_extract (lines 510–543). As per coding guidelines, .agents/scripts/*.sh scripts must focus on reliability and robustness.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
"${VENV_DIR}/bin/python3" -c "
import sys
from docling.document_converter import DocumentConverter
converter = DocumentConverter()
result = converter.convert('${input_file}')
output_format = '${output_format}'
if output_format in ('markdown', 'md'):
content = result.document.export_to_markdown()
elif output_format == 'json':
import json
content = json.dumps(result.document.export_to_dict(), indent=2)
else:
content = result.document.export_to_markdown()
with open('${output_file}', 'w') as f:
f.write(content)
print(f'Converted: ${output_file}')
" || {
print_error "Conversion failed"
return 1
}
INPUT_FILE="$input_file" OUTPUT_FILE="$output_file" OUTPUT_FORMAT="$output_format" \
"${VENV_DIR}/bin/python3" -c "
import os, sys
from docling.document_converter import DocumentConverter
input_file = os.environ['INPUT_FILE']
output_file = os.environ['OUTPUT_FILE']
output_format = os.environ['OUTPUT_FORMAT']
converter = DocumentConverter()
result = converter.convert(input_file)
if output_format in ('markdown', 'md'):
content = result.document.export_to_markdown()
elif output_format == 'json':
import json
content = json.dumps(result.document.export_to_dict(), indent=2)
else:
content = result.document.export_to_markdown()
with open(output_file, 'w') as f:
f.write(content)
print(f'Converted: {output_file}')
" || {
print_error "Conversion failed"
return 1
}
🤖 Prompt for AI Agents
In @.agents/scripts/document-extraction-helper.sh around lines 269 - 292, The
inline Python here interpolates shell variables directly (e.g., '${input_file}',
'${output_file}', '${output_format}'), which allows injection and breaks on
filenames with quotes; fix by exporting the values to environment variables
(e.g., INPUT_FILE, OUTPUT_FILE, OUTPUT_FORMAT) before invoking the embedded
Python and then read them inside Python using os.environ (use
DocumentConverter() and call converter.convert(os.environ['INPUT_FILE']), choose
format based on os.environ['OUTPUT_FORMAT'], write to
os.environ['OUTPUT_FILE']); apply the same environment-variable pattern to the
other inline blocks referenced by do_pii_scan, do_pii_redact, and do_extract to
eliminate direct shell-to-Python interpolation and handle arbitrary filenames
safely.

Comment on lines +503 to +508
auto|*)
schema_code="
schema_class = None
"
;;
esac
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Catch-all auto|* silently swallows unrecognized schema names.

A typo like --schema invioce falls through to auto mode with no warning. Consider emitting a warning for unknown schemas so users know they didn't get the structured extraction they expected.

♻️ Proposed fix
-        auto|*)
+        auto)
             schema_code="
 schema_class = None
 "
             ;;
+        *)
+            print_warning "Unknown schema '${schema}', falling back to auto mode"
+            schema_code="
+schema_class = None
+"
+            ;;

Also, line 23 in the header advertises custom as a schema, but do_schemas() (line 604) doesn't list it and there's no dedicated handling. Align the header comment with the actual supported schemas.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
auto|*)
schema_code="
schema_class = None
"
;;
esac
auto)
schema_code="
schema_class = None
"
;;
*)
print_warning "Unknown schema '${schema}', falling back to auto mode"
schema_code="
schema_class = None
"
;;
esac
🤖 Prompt for AI Agents
In @.agents/scripts/document-extraction-helper.sh around lines 503 - 508, The
catch-all case block ending with "auto|*)" currently sets schema_code to a blank
schema and silently accepts unknown schema names; update that case to detect
unrecognized schema values and emit a clear warning (e.g., using echo to stderr)
indicating the provided schema was unknown and that auto mode will be used, and
keep the schema_code assignment as fallback; also reconcile the header
advertisement and the do_schemas() implementation by either removing "custom"
from the header comment on line 23 or adding explicit handling for "custom" in
the schema dispatch (ensure references to do_schemas(), the "auto|*" case, and
the schema_code variable are updated consistently).

-->

<!--TOON:scripts[45]{name,purpose}:
<!--TOON:scripts[46]{name,purpose}:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Locate and examine the subagent-index.toon file
fd 'subagent-index.toon' -type f

Repository: marcusquinn/aidevops

Length of output: 234


🏁 Script executed:

#!/bin/bash
# Read the file and count TOON:scripts entries
if [ -f ".agents/subagent-index.toon" ]; then
  echo "=== Extracting TOON:scripts block ==="
  awk '
    /<!--TOON:scripts/ {
      print "Header line: " $0
      found=1
      next
    }
    /^-->/ && found {
      print "Total entries: " count
      exit
    }
    found && NF {
      count++
      if (count <= 5 || count > 42) {
        print "Line " NR ": " $0
      }
    }
  ' ".agents/subagent-index.toon"
fi

Repository: marcusquinn/aidevops

Length of output: 1051


TOON scripts count is off-by-one: header states [46] but there are 47 entries.

The TOON metadata at line 83 needs updating to reflect the actual number of script entries in the block (lines 84–130).

-<!--TOON:scripts[46]{name,purpose}:
+<!--TOON:scripts[47]{name,purpose}:
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
<!--TOON:scripts[46]{name,purpose}:
<!--TOON:scripts[47]{name,purpose}:
🤖 Prompt for AI Agents
In @.agents/subagent-index.toon at line 83, Update the TOON metadata header to
reflect the correct count of entries: change the token
"TOON:scripts[46]{name,purpose}" to use the actual number of scripts present in
the block (the entries spanning lines 84–130), e.g.,
"TOON:scripts[47]{name,purpose}", so the bracketed count matches the actual
number of entries.

Comment on lines +209 to +213
### Out of memory during extraction

- Use smaller Ollama models (e.g., `phi-4` instead of `llama3.2:70b`)
- Process documents one at a time instead of batch
- Use `cloud` privacy mode to offload to API
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Troubleshooting suggests model alternatives the helper script can't actually use.

Line 211 recommends phi-4 instead of llama3.2:70b for OOM, but the helper script hardcodes ollama/llama3.2 with no --model flag. Either add a --model CLI option to the helper or note that model selection requires direct Python/ExtractThinker usage.

🤖 Prompt for AI Agents
In @.agents/tools/document/extraction-workflow.md around lines 209 - 213, The
doc suggests switching models but the helper script currently hardcodes
"ollama/llama3.2" and offers no --model flag; either add a --model CLI option to
the helper script (parse a --model argument, default to "ollama/llama3.2" and
propagate it where the helper constructs or calls the model) or update the
markdown at the recommended line to explicitly state that choosing a different
Ollama model requires using the Python ExtractThinker API directly; reference
the hardcoded "ollama/llama3.2" string, the helper script's CLI parsing logic,
and ExtractThinker as the places to change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

code-reviews-actioned All review feedback has been actioned

Projects

None yet

Development

Successfully merging this pull request may close these issues.

t073: Document Extraction Subagent & Workflow

1 participant