Skip to content

feat: add Unstract MCP integration for document processing#185

Merged
marcusquinn merged 5 commits intomainfrom
feature/add-unstract-mcp-integration
Jan 25, 2026
Merged

feat: add Unstract MCP integration for document processing#185
marcusquinn merged 5 commits intomainfrom
feature/add-unstract-mcp-integration

Conversation

@marcusquinn
Copy link
Owner

@marcusquinn marcusquinn commented Jan 24, 2026

Summary

  • Adds Unstract as an on-demand MCP server for LLM-powered structured data extraction from unstructured documents (PDF, images, DOCX, etc.)
  • MCP is disabled globally in OpenCode and loads only when document processing tasks are detected (on-demand pattern)
  • Docker-based execution via unstract/mcp-server container

Changes

File Purpose
.agent/services/document-processing/unstract.md New subagent with full setup docs, tool reference, and use cases
configs/mcp-templates/unstract.json OpenCode MCP config template (disabled by default)
.agent/aidevops/mcp-integrations.md Added Unstract to integrations list and quick reference
.agent/subagent-index.toon Added document-processing service entry

Design Decisions

  • On-demand only: Follows the established pattern (like FluentCRM) where MCP is enabled: false globally and activated per-agent via subagent frontmatter
  • Docker-based: Uses official unstract/mcp-server Docker image for isolation and easy updates
  • Credential pattern: UNSTRACT_API_KEY + UNSTRACT_API_BASE_URL stored in ~/.config/aidevops/mcp-env.sh
  • New service category: Created services/document-processing/ for document extraction tools

Testing

  • Preflight linters pass (pre-existing ShellCheck/secretlint issues unrelated)
  • TOON syntax validation passes
  • Config follows established JSON template pattern

Summary by CodeRabbit

  • New Features

    • Added Unstract MCP integration for document processing with support for invoices, statements, claims, KYC, and contract analysis.
  • Documentation

    • Added comprehensive setup guides, configuration templates, and deployment instructions for the new document processing integration.

✏️ Tip: You can customize this high-level summary in your review settings.

Add support for Unstract (https://github.com/Zipstack/unstract) as an
on-demand MCP server for LLM-powered structured data extraction from
unstructured documents. Configured for OpenCode with disabled-by-default
pattern, loading only when document processing tasks are detected.

- New subagent: services/document-processing/unstract.md
- New MCP template: configs/mcp-templates/unstract.json (Docker-based)
- Updated mcp-integrations.md with setup instructions
- Updated subagent-index.toon with new entry
@gemini-code-assist
Copy link

Summary of Changes

Hello @marcusquinn, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the system's capabilities by integrating Unstract, a powerful LLM-driven platform for extracting structured data from diverse document types. This integration allows AI agents to perform complex document analysis tasks, such as invoice processing or KYC, by dynamically enabling the Unstract service only when needed, ensuring efficient resource utilization and expanding the range of automated workflows.

Highlights

  • Unstract MCP Integration: Introduces Unstract as a new MCP (Managed Compute Provider) for LLM-powered structured data extraction from various unstructured documents (PDFs, images, DOCX, etc.).
  • On-Demand Loading Pattern: The Unstract MCP is configured to be disabled globally and loads only when document processing tasks are detected, following an on-demand pattern similar to existing integrations like FluentCRM.
  • Docker-Based Execution: Utilizes the official unstract/mcp-server Docker container for isolated and easily updateable execution of the Unstract service.
  • New Service Category: A new services/document-processing/ category has been created to house tools related to document extraction, with Unstract being the first entry.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 24, 2026

Warning

Rate limit exceeded

@marcusquinn has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 3 minutes and 8 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

Walkthrough

This pull request introduces the Unstract MCP (Model Context Protocol) integration to the aidevops framework, adding comprehensive documentation, configuration templates, and subagent registry entries to enable LLM-powered document processing capabilities across the platform.

Changes

Cohort / File(s) Summary
Unstract MCP Integration Documentation
.agent/aidevops/mcp-integrations.md, .agent/services/document-processing/unstract.md
Introduces Unstract MCP as a new document processing integration with setup instructions, API key/base URL environment variables, deployment configurations (Docker), and usage examples for OpenCode and Claude Desktop environments. Includes supported file types, workflow details, and integration use cases.
Subagent Registry Update
.agent/subagent-index.toon
Updates the TOON subagents index count from [31] to [32] and registers the new document-processing subagent with key file reference to unstract.
MCP Configuration Template
configs/mcp-templates/unstract.json
Adds a local MCP configuration template for running the Unstract service via Docker, disabled by default, with environment variable sourcing and volume mount specifications.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related PRs

  • #140: Adds MCP on-demand/indexing tooling, docs, and generator hooks that complement the MCP integration surface expanded by this PR.
  • #64: Adds a different MCP integration (FluentCRM) with parallel updates to MCP integrations documentation and mcp-templates configuration area.

Poem

📄✨ Unstract arrives, with documents to parse,
LLM-powered extraction, intelligent and fast,
Templates configured, subagents aligned,
Processing flows that beautifully intertwine! 🚀

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and concisely describes the main change: adding Unstract MCP integration for document processing, which directly aligns with all file additions and modifications.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request successfully adds a new integration for Unstract MCP, enabling LLM-powered document processing. The changes include new documentation, configuration templates, and updates to the subagent index, all following the established on-demand loading pattern. My review focuses on improving the clarity and consistency of the documentation to ensure a smooth setup experience for users. I've identified a couple of minor inconsistencies in environment variable naming and placeholder values within the configuration examples that should be addressed.

- **Purpose**: Extract structured data from unstructured documents (PDFs, images, DOCX, etc.)
- **MCP Server**: `unstract/mcp-server` (Docker) or `@unstract/mcp-server` (npx)
- **Tool**: `unstract_tool` - submits files to Unstract API, polls for completion, returns structured JSON
- **Credentials**: `UNSTRACT_API_KEY` + `API_BASE_URL` in `~/.config/aidevops/mcp-env.sh`

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There's an inconsistency in the environment variable name for the Unstract API base URL. This line mentions API_BASE_URL, but the configuration example in lines 78-79 and the PR description specify UNSTRACT_API_BASE_URL. To avoid confusion for users, it's best to consistently use UNSTRACT_API_BASE_URL in the user-facing documentation. The OpenCode configuration correctly handles mapping this to API_BASE_URL for the container, but the documentation should be consistent.

Suggested change
- **Credentials**: `UNSTRACT_API_KEY` + `API_BASE_URL` in `~/.config/aidevops/mcp-env.sh`
- **Credentials**: UNSTRACT_API_KEY + UNSTRACT_API_BASE_URL in ~/.config/aidevops/mcp-env.sh

"unstract"
],
"env": {
"UNSTRACT_API_KEY": "",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The placeholder for UNSTRACT_API_KEY is an empty string, which could be confusing as it's not immediately obvious that it requires a value. Using a more descriptive placeholder like your_api_key_here would make it clearer for the user that they need to replace it with their actual key. This aligns with the placeholder style used elsewhere in the documentation (e.g., line 78).

Suggested change
"UNSTRACT_API_KEY": "",
"UNSTRACT_API_KEY": "your_api_key_here",

@github-actions
Copy link

🔍 Code Quality Report

�[0;35m[MONITOR]�[0m Code Review Monitoring Report

�[0;34m[INFO]�[0m Latest Quality Status:
SonarCloud: 0 bugs, 0 vulnerabilities, 447 code smells

�[0;34m[INFO]�[0m Recent monitoring activity:
Sat Jan 24 23:00:47 UTC 2026: Code review monitoring started
Sat Jan 24 23:00:47 UTC 2026: SonarCloud - Bugs: 0, Vulnerabilities: 0, Code Smells: 447
Sat Jan 24 23:00:47 UTC 2026: Qlty - 0 issues found, auto-formatting applied
Sat Jan 24 23:00:49 UTC 2026: Codacy analysis completed with auto-fixes

📈 Current Quality Metrics

  • BUGS: 0
  • CODE SMELLS: 447
  • VULNERABILITIES: 0

Generated on: Sat Jan 24 23:02:11 UTC 2026


Generated by AI DevOps Framework Code Review Monitoring

- Document both cloud and self-hosted (local Docker Compose) deployment options
- Pass DISABLE_TELEMETRY=true in MCP Docker command
- Add analytics/telemetry section confirming MCP server has no telemetry
- Note REACT_APP_ENABLE_POSTHOG=false for self-hosted frontend
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In @.agent/aidevops/mcp-integrations.md:
- Around line 212-261: Update .agent/aidevops/mcp-integrations.md to explicitly
document the host-to-container env var remapping and list the UNSTRACT env vars:
add a short note under the OpenCode example stating that UNSTRACT_API_BASE_URL
on the host is passed into the container as API_BASE_URL (reference the OpenCode
JSON block and the API_BASE_URL/UNSTRACT_API_BASE_URL symbols), and add the two
exports (UNSTRACT_API_KEY and UNSTRACT_API_BASE_URL) to the Environment
Variables section (the section that lists other MCP env exports and references
services/document-processing/unstract.md) so the variables are consistent with
the Claude Desktop mcpServers example and the unstract_tool usage.

In @.agent/services/document-processing/unstract.md:
- Around line 23-27: Update the Setup section to use the official environment
variable name API_BASE_URL (instead of UNSTRACT_API_BASE_URL) so it matches the
Quick Reference and Unstract MCP spec; adjust any example export/usage lines
that reference UNSTRACT_API_BASE_URL to API_BASE_URL and ensure UNSTRACT_API_KEY
remains unchanged, and verify references to unstract_tool and the MCP Server
description still reference the corrected variable name.
🧹 Nitpick comments (2)
.agent/services/document-processing/unstract.md (1)

74-80: Add a file-permissions hardening step for API keys.
Given the plaintext API key storage, a simple chmod 600 keeps credentials from leaking via permissive perms.

🔐 Proposed doc addition
 export UNSTRACT_API_KEY="your_api_key_here"
 export UNSTRACT_API_BASE_URL="https://us-central.unstract.com/deployment/api/your-deployment-id/"
+chmod 600 ~/.config/aidevops/mcp-env.sh
configs/mcp-templates/unstract.json (1)

4-8: Pin the Docker image tag for supply-chain reproducibility.
Relying on the implicit latest tag introduces unpredictability into automation. Unstract does not publish semver versions in public docs, so pin either by digest (OCI best practice) or parameterize with an environment override.

🔧 Proposed refactor (environment-parameterized fallback)
-      "source ~/.config/aidevops/mcp-env.sh && docker run -i --rm -v /tmp:/tmp -e UNSTRACT_API_KEY -e API_BASE_URL=\"$UNSTRACT_API_BASE_URL\" unstract/mcp-server unstract"
+      "source ~/.config/aidevops/mcp-env.sh && docker run -i --rm -v /tmp:/tmp -e UNSTRACT_API_KEY -e API_BASE_URL=\"$UNSTRACT_API_BASE_URL\" unstract/mcp-server:${UNSTRACT_MCP_IMAGE_TAG:-latest} unstract"

Alternatively, pull a known-good digest locally and reference it by SHA256 for guaranteed reproducibility across deployments.

@github-actions
Copy link

🔍 Code Quality Report

�[0;35m[MONITOR]�[0m Code Review Monitoring Report

�[0;34m[INFO]�[0m Latest Quality Status:
SonarCloud: 0 bugs, 0 vulnerabilities, 447 code smells

�[0;34m[INFO]�[0m Recent monitoring activity:
Sat Jan 24 23:07:10 UTC 2026: Code review monitoring started
Sat Jan 24 23:07:10 UTC 2026: SonarCloud - Bugs: 0, Vulnerabilities: 0, Code Smells: 447
Sat Jan 24 23:07:10 UTC 2026: Qlty - 0 issues found, auto-formatting applied
Sat Jan 24 23:07:12 UTC 2026: Codacy analysis completed with auto-fixes

📈 Current Quality Metrics

  • BUGS: 0
  • CODE SMELLS: 447
  • VULNERABILITIES: 0

Generated on: Sat Jan 24 23:08:32 UTC 2026


Generated by AI DevOps Framework Code Review Monitoring

- New helper script: install/start/stop/status/logs/uninstall/configure-llm
- Installs to ~/.aidevops/unstract/ with analytics disabled by default
- configure-llm command detects existing API keys and shows how to add as adapters
- Supports Ollama for fully local/offline operation (no cloud keys needed)
- Added to setup-mcp-integrations.sh as 'unstract' target
- Updated MCP template to default to local self-hosted endpoint
- Updated subagent docs with adapter key table and management commands
@github-actions
Copy link

🔍 Code Quality Report

�[0;35m[MONITOR]�[0m Code Review Monitoring Report

�[0;34m[INFO]�[0m Latest Quality Status:
SonarCloud: 0 bugs, 0 vulnerabilities, 447 code smells

�[0;34m[INFO]�[0m Recent monitoring activity:
Sat Jan 24 23:12:03 UTC 2026: Code review monitoring started
Sat Jan 24 23:12:03 UTC 2026: SonarCloud - Bugs: 0, Vulnerabilities: 0, Code Smells: 447
Sat Jan 24 23:12:03 UTC 2026: Qlty - 0 issues found, auto-formatting applied
Sat Jan 24 23:12:05 UTC 2026: Codacy analysis completed with auto-fixes
Sat Jan 24 23:12:06 UTC 2026: Applied 1 automatic fixes

📈 Current Quality Metrics

  • BUGS: 0
  • CODE SMELLS: 447
  • VULNERABILITIES: 0

Generated on: Sat Jan 24 23:13:25 UTC 2026


Generated by AI DevOps Framework Code Review Monitoring

- Use API_BASE_URL (not UNSTRACT_API_BASE_URL) to match Unstract MCP spec
- Add chmod 600 for mcp-env.sh in setup examples
- Add UNSTRACT_IMAGE_TAG env var for Docker image pinning
- Add DISABLE_TELEMETRY=true to Claude Desktop example
- Update Environment Variables section with Unstract entries
- Fix helper script to use correct env var names
@sonarqubecloud
Copy link

@github-actions
Copy link

🔍 Code Quality Report

�[0;35m[MONITOR]�[0m Code Review Monitoring Report

�[0;34m[INFO]�[0m Latest Quality Status:
SonarCloud: 0 bugs, 0 vulnerabilities, 447 code smells

�[0;34m[INFO]�[0m Recent monitoring activity:
Sun Jan 25 00:22:28 UTC 2026: Code review monitoring started
Sun Jan 25 00:22:28 UTC 2026: SonarCloud - Bugs: 0, Vulnerabilities: 0, Code Smells: 447
Sun Jan 25 00:22:28 UTC 2026: Qlty - 0 issues found, auto-formatting applied
Sun Jan 25 00:22:30 UTC 2026: Codacy analysis completed with auto-fixes
Sun Jan 25 00:22:31 UTC 2026: Applied 1 automatic fixes

📈 Current Quality Metrics

  • BUGS: 0
  • CODE SMELLS: 447
  • VULNERABILITIES: 0

Generated on: Sun Jan 25 00:23:51 UTC 2026


Generated by AI DevOps Framework Code Review Monitoring

@marcusquinn
Copy link
Owner Author

@coderabbitai review

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 25, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@marcusquinn
Copy link
Owner Author

@coderabbitai resolve

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 25, 2026

✅ Actions performed

Comments resolved and changes approved.

@marcusquinn marcusquinn merged commit b7adf90 into main Jan 25, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant