Skip to content

Commit f14e7c7

Browse files
Merge branch 'main' into UN-2717-Sigterm-handling-in-tool-sidecar
2 parents d2ae867 + a9b090a commit f14e7c7

File tree

84 files changed

+8035
-3630
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

84 files changed

+8035
-3630
lines changed

README.md

Lines changed: 29 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -3,9 +3,8 @@
33

44
# Unstract
55

6-
## No-code LLM Platform to launch APIs and ETL Pipelines to structure unstructured documents
6+
## The Data Layer for your Agentic Workflows—Automate Document-based workflows with close to 100% accuracy!
77

8-
##
98

109
![Python Version from PEP 621 TOML](https://img.shields.io/python/required-version-toml?tomlFilePath=https%3A%2F%2Fraw.githubusercontent.com%2FZipstack%2Funstract%2Frefs%2Fheads%2Fmain%2Fpyproject.toml)
1110
[![uv](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/uv/main/assets/badge/v0.json)](https://github.com/astral-sh/uv)
@@ -23,26 +22,44 @@
2322

2423
## 🤖 Prompt Studio
2524

26-
Prompt Studio's primary reason for existence is so you can develop the necessary prompts for document data extraction super efficiently. It is a purpose-built environment that makes this not just easy for you—but, lot of fun! The document sample, its variants, the prompts you're developing, outputs from different LLMs, the schema you're developing, costing details of the extraction and various tools that let you measure the effectiveness of your prompts are just a click away and easily accessible. Prompt Studio is designed for effective and high speed development and iteration of prompts for document data extraction. Welcome to IDP 2.0!
27-
25+
Prompt Studio is a purpose-built environment that supercharges your schema definition efforts. Compare outputs from different LLMs side-by-side, keep tab on costs while you develop generic prompts that work across wide-ranging document variations. And when you're ready, launch extraction APIs with a single click.
2826

2927
![img Prompt Studio](docs/assets/prompt_studio.png)
3028

31-
## 🧘‍♀️ Three step nirvana with Workflow Studio
29+
## 🔌 Integrations that suit your environment
30+
31+
Once you've used Prompt Studio to define your schema, Unstract makes it easy to integrate into your existing workflows. Simply choose the integration type that best fits your environment:
32+
33+
| Integration Type | Description | Best For | Documentation |
34+
|------------------|-------------|----------|---------------|
35+
| 🖥️ **MCP Servers** | Run Unstract as an MCP Server to provide structured data extraction to Agents or LLMs in your ecosystem. | Developers building **Agentic/LLM apps/tools** that speak MCP. | [Unstract MCP Server Docs](https://docs.unstract.com/unstract/unstract_platform/mcp/unstract_platform_mcp_server/) |
36+
| 🌐 **API Deployments** | Turn any document into JSON with an API call. Deploy any Prompt Studio project as a REST API endpoint with a single click. | Teams needing **programmatic access** in apps, services, or custom tooling. | [API Deployment Docs](https://docs.unstract.com/unstract/unstract_platform/api_deployment/unstract_api_deployment_intro/) |
37+
| ⚙️ **ETL Pipelines** | Embed Unstract directly into your ETL jobs to transform unstructured data before loading it into your warehouse / database. | **Engineering and Data engineering teams** that need to batch process documents into clean JSON. | [ETL Pipelines Docs](https://docs.unstract.com/unstract/unstract_platform/etl_pipeline/unstract_etl_pipeline_intro/) |
38+
| 🧩 **n8n Nodes** | Use Unstract as ready-made nodes in n8n workflows for drag-and-drop automation. | **Low-code users** and **ops teams** automating workflows. | [Unstract n8n Nodes Docs](https://docs.unstract.com/unstract/unstract_platform/api_deployment/unstract_api_deployment_n8n_custom_node/) |
39+
40+
## ☁️ Getting Started (Cloud / Enterprise)
3241

33-
Automate critical business processes that involve complex documents with a human in the loop. Go beyond RPA with the power of Large Language Models.
42+
The easy-peasy way to try Unstract is to [sign up for a **14-day free trial**](https://unstract.com/start-for-free/). Give Unstract a spin now!
3443

35-
🌟 **Step 1**: Add documents to no-code Prompt Studio and do prompt engineering to extract required fields <br>
36-
🌟 **Step 2**: Configure Prompt Studio project as API deployment or configure input source and output destination for ETL Pipeline<br>
37-
🌟 **Step 3**: Deploy Workflows as unstructured data APIs or unstructured data ETL Pipelines!
44+
Unstract Cloud also comes with some really awesome features that give serious accuracy boosts to agentic/LLM-powered document-centric workflows in the enterprise.
3845

39-
![img Using Unstract](docs/assets/Using_Unstract.png)
46+
| Feature | Description | Documentation |
47+
|---------|-------------|---------------|
48+
| 🧪 **LLMChallenge** | Uses two Large Language Models to ensure trustworthy output. You either get the right response or no response at all. | [Docs](https://docs.unstract.com/unstract/unstract_platform/features/llm_challenge/llm_challenge_intro/) |
49+
|**SinglePass Extraction** | Reduces LLM token usage by up to **8x**, dramatically cutting costs. | [Docs](https://docs.unstract.com/unstract/editions/cloud_edition/#singlepass-extraction) |
50+
| 📉 **SummarizedExtraction** | Reduces LLM token usage by up to **6x**, saving costs while keeping accuracy. | [Docs](https://docs.unstract.com/unstract/unstract_platform/features/summarized_extraction/summarized_extraction_intro/) |
51+
| 👀 **Human-In-The-Loop** | Side-by-side comparison of extracted value and source document, with highlighting for human review and tweaking. | [Docs](https://docs.unstract.com/unstract/unstract_platform/human_quality_review/human_quality_review_intro/) |
52+
| 🔐 **SSO Support** | Enterprise-ready authentication options for seamless onboarding and off-boarding. | [Docs](https://docs.unstract.com/unstract/editions/cloud_edition/#enterprise-features) |
53+
54+
## ⏩ Quick Start Guide
55+
56+
Unstract comes well documented. You can get introduced to the [basics of Unstract](https://docs.unstract.com/unstract/), and [learn how to connect](https://docs.unstract.com/unstract/unstract_platform/setup_accounts/whats_needed) various systems like LLMs, Vector Databases, Embedding Models and Text Extractors to it. The easiest way to wet your feet is to go through our [Quick Start Guide](https://docs.unstract.com/unstract/unstract_platform/quick_start) where you actually get to do some prompt engineering in Prompt Studio and launch an API to structure varied credit card statements!
4057

41-
## 🚀 Getting started
58+
## 🚀 Getting started (self-hosted)
4259

4360
### System Requirements
4461

45-
- 8GB RAM (recommended)
62+
- 8GB RAM (minimum)
4663

4764
### Prerequisites
4865

@@ -57,7 +74,6 @@ Next, either download a release or clone this repo and do the following:
5774
✅ Now visit [http://frontend.unstract.localhost](http://frontend.unstract.localhost) in your browser <br>
5875
✅ Use username and password `unstract` to login
5976

60-
6177
That's all there is to it!
6278

6379
Follow [these steps](backend/README.md#authentication) to change the default username and password.
@@ -93,10 +109,6 @@ Unstract supports a wide range of file formats for document processing:
93109
| | TIFF | Tagged Image File Format |
94110
| | WEBP | Web Picture Format |
95111

96-
## ⏩ Quick Start Guide
97-
98-
Unstract comes well documented. You can get introduced to the [basics of Unstract](https://docs.unstract.com/unstract/), and [learn how to connect](https://docs.unstract.com/unstract/unstract_platform/setup_accounts/whats_needed) various systems like LLMs, Vector Databases, Embedding Models and Text Extractors to it. The easiest way to wet your feet is to go through our [Quick Start Guide](https://docs.unstract.com/unstract/unstract_platform/quick_start) where you actually get to do some prompt engineering in Prompt Studio and launch an API to structure varied credit card statements!
99-
100112
## 🤝 Ecosystem support
101113

102114
### LLM Providers
@@ -113,7 +125,6 @@ Unstract comes well documented. You can get introduced to the [basics of Unstrac
113125
| <img src="docs/assets/3rd_party/anyscale.png" width="32"/> | Anyscale | ✅ Working |
114126
| <img src="docs/assets/3rd_party/mistral_ai.png" width="32"/> | Mistral AI | ✅ Working |
115127

116-
117128
### Vector Databases
118129

119130
|| Provider | Status |
@@ -124,8 +135,6 @@ Unstract comes well documented. You can get introduced to the [basics of Unstrac
124135
|<img src="docs/assets/3rd_party/postgres.png" width="32"/>| PostgreSQL | ✅ Working |
125136
|<img src="docs/assets/3rd_party/milvus.png" width="32"/>| Milvus | ✅ Working |
126137

127-
128-
129138
### Embeddings
130139

131140
|| Provider | Status |

backend/account_v2/templates/login.html

Lines changed: 2 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -94,8 +94,7 @@
9494
.logo-box{
9595
width: 100%;
9696
text-align: center;
97-
margin-bottom: 20px;
98-
margin-top: 20px;
97+
margin: 20px 0;
9998
}
10099
.login-heading{
101100
font-size: 24px;
@@ -109,9 +108,8 @@
109108
<!-- Spinner animation -->
110109
<div class="lds-dual-ring"></div>
111110
</div>
112-
{% load static %}
113111
<div class="logo-box">
114-
<img src="{% static 'logo.svg' %}" alt="My image">
112+
<img src="/icons/logo.svg" alt="Unstract Logo">
115113
</div>
116114
<h2 class="login-heading">Login</h2>
117115
{% if error_message %}

backend/api_v2/api_deployment_views.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -70,6 +70,7 @@ def post(
7070
tag_names = serializer.validated_data.get(ApiExecution.TAGS)
7171
llm_profile_id = serializer.validated_data.get(ApiExecution.LLM_PROFILE_ID)
7272
hitl_queue_name = serializer.validated_data.get(ApiExecution.HITL_QUEUE_NAME)
73+
custom_data = serializer.validated_data.get(ApiExecution.CUSTOM_DATA)
7374

7475
if presigned_urls:
7576
DeploymentHelper.load_presigned_files(presigned_urls, file_objs)
@@ -85,6 +86,7 @@ def post(
8586
tag_names=tag_names,
8687
llm_profile_id=llm_profile_id,
8788
hitl_queue_name=hitl_queue_name,
89+
custom_data=custom_data,
8890
request_headers=dict(request.headers),
8991
)
9092
if "error" in response and response["error"]:

backend/api_v2/constants.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,3 +11,4 @@ class ApiExecution:
1111
LLM_PROFILE_ID: str = "llm_profile_id"
1212
HITL_QUEUE_NAME: str = "hitl_queue_name"
1313
PRESIGNED_URLS: str = "presigned_urls"
14+
CUSTOM_DATA: str = "custom_data"

backend/api_v2/deployment_helper.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -155,6 +155,7 @@ def execute_workflow(
155155
tag_names: list[str] = [],
156156
llm_profile_id: str | None = None,
157157
hitl_queue_name: str | None = None,
158+
custom_data: dict[str, Any] | None = None,
158159
request_headers=None,
159160
) -> ReturnDict:
160161
"""Execute workflow by api.
@@ -168,6 +169,7 @@ def execute_workflow(
168169
tag_names (list(str)): list of tag names
169170
llm_profile_id (str, optional): LLM profile ID for overriding tool settings
170171
hitl_queue_name (str, optional): Custom queue name for manual review
172+
custom_data (dict[str, Any], optional): JSON data for custom_data variable replacement in prompts
171173
172174
Returns:
173175
ReturnDict: execution status/ result
@@ -234,6 +236,7 @@ def execute_workflow(
234236
use_file_history=use_file_history,
235237
llm_profile_id=llm_profile_id,
236238
hitl_queue_name=hitl_queue_name,
239+
custom_data=custom_data,
237240
)
238241
result.status_api = DeploymentHelper.construct_status_endpoint(
239242
api_endpoint=api.api_endpoint, execution_id=execution_id

backend/api_v2/serializers.py

Lines changed: 28 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -195,21 +195,23 @@ class ExecutionRequestSerializer(TagParamsSerializer):
195195
"""Execution request serializer.
196196
197197
Attributes:
198-
timeout (int): Timeout for the API deployment, maximum value can be 300s.
199-
If -1 it corresponds to async execution. Defaults to -1
200-
include_metadata (bool): Flag to include metadata in API response
201-
include_metrics (bool): Flag to include metrics in API response
202-
use_file_history (bool): Flag to use FileHistory to save and retrieve
203-
responses quickly. This is undocumented to the user and can be
204-
helpful for demos.
205-
tags (str): Comma-separated List of tags to associate with the execution.
206-
e.g:'tag1,tag2-name,tag3_name'
207-
llm_profile_id (str): UUID of the LLM profile to override the default profile.
208-
If not provided, the default profile will be used.
209-
hitl_queue_name (str, optional): Document class name for manual review queue.
210-
If not provided, uses API name as document class.
211-
presigned_urls (list): List of presigned URLs to fetch files from.
212-
URLs are validated for HTTPS and S3 endpoint requirements.
198+
timeout (int): Timeout for the API deployment, maximum value can be 300s.
199+
If -1 it corresponds to async execution. Defaults to -1
200+
include_metadata (bool): Flag to include metadata in API response
201+
include_metrics (bool): Flag to include metrics in API response
202+
use_file_history (bool): Flag to use FileHistory to save and retrieve
203+
responses quickly. This is undocumented to the user and can be
204+
helpful for demos.
205+
tags (str): Comma-separated List of tags to associate with the execution.
206+
e.g:'tag1,tag2-name,tag3_name'
207+
llm_profile_id (str): UUID of the LLM profile to override the default profile.
208+
If not provided, the default profile will be used.
209+
hitl_queue_name (str, optional): Document class name for manual review queue.
210+
If not provided, uses API name as document class.
211+
presigned_urls (list): List of presigned URLs to fetch files from.
212+
URLs are validated for HTTPS and S3 endpoint requirements.
213+
custom_data (dict, optional): User-provided data for variable replacement in prompts.
214+
Can be accessed in prompts using {{custom_data.key}} syntax for dot notation traversal.
213215
"""
214216

215217
MAX_FILES_ALLOWED = 32
@@ -224,6 +226,7 @@ class ExecutionRequestSerializer(TagParamsSerializer):
224226
presigned_urls = ListField(child=URLField(), required=False)
225227
llm_profile_id = CharField(required=False, allow_null=True, allow_blank=True)
226228
hitl_queue_name = CharField(required=False, allow_null=True, allow_blank=True)
229+
custom_data = JSONField(required=False, allow_null=True)
227230

228231
def validate_hitl_queue_name(self, value: str | None) -> str | None:
229232
"""Validate queue name format using enterprise validation if available."""
@@ -244,6 +247,16 @@ def validate_hitl_queue_name(self, value: str | None) -> str | None:
244247
)
245248
return value
246249

250+
def validate_custom_data(self, value):
251+
"""Validate custom_data is a valid JSON object."""
252+
if value is None:
253+
return value
254+
255+
if not isinstance(value, dict):
256+
raise ValidationError("custom_data must be a JSON object")
257+
258+
return value
259+
247260
files = ListField(
248261
child=FileField(),
249262
required=False,

backend/prompt_studio/prompt_studio_registry_v2/prompt_studio_registry_helper.py

Lines changed: 2 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -125,8 +125,7 @@ def get_tool_by_prompt_registry_id(
125125
# Suppress all exceptions to allow processing
126126
except Exception as e:
127127
logger.warning(
128-
"Error while fetching for prompt registry "
129-
f"ID {prompt_registry_id}: {e} "
128+
f"Error while fetching for prompt registry ID {prompt_registry_id}: {e} "
130129
)
131130
return None
132131
return Tool(
@@ -215,8 +214,7 @@ def update_or_create_psr_tool(
215214
return obj
216215
except IntegrityError as error:
217216
logger.error(
218-
"Integrity Error - Error occurred while "
219-
f"exporting custom tool : {error}"
217+
f"Integrity Error - Error occurred while exporting custom tool : {error}"
220218
)
221219
raise ToolSaveError
222220

backend/pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -37,7 +37,7 @@ dependencies = [
3737
"social-auth-core==4.4.2", # For OAuth
3838
# TODO: Temporarily removing the extra dependencies of aws and gcs from unstract-sdk
3939
# to resolve lock file. Will have to be re-looked into
40-
"unstract-sdk[azure]~=0.77.1",
40+
"unstract-sdk[azure]~=0.77.3",
4141
"gcsfs==2024.10.0",
4242
"s3fs==2024.10.0",
4343
"azure-identity==1.16.0",

backend/sample.env

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -78,9 +78,9 @@ PROMPT_STUDIO_FILE_PATH=/app/prompt-studio-data
7878

7979
# Structure Tool Image (Runs prompt studio exported tools)
8080
# https://hub.docker.com/r/unstract/tool-structure
81-
STRUCTURE_TOOL_IMAGE_URL="docker:unstract/tool-structure:0.0.86"
81+
STRUCTURE_TOOL_IMAGE_URL="docker:unstract/tool-structure:0.0.88"
8282
STRUCTURE_TOOL_IMAGE_NAME="unstract/tool-structure"
83-
STRUCTURE_TOOL_IMAGE_TAG="0.0.86"
83+
STRUCTURE_TOOL_IMAGE_TAG="0.0.88"
8484

8585
# Feature Flags
8686
EVALUATION_SERVER_IP=unstract-flipt

backend/tool_instance_v2/views.py

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -119,8 +119,7 @@ def create(self, request: Any) -> Response:
119119
self.perform_create(serializer)
120120
except IntegrityError:
121121
raise DuplicateData(
122-
f"{ToolInstanceErrors.TOOL_EXISTS}, "
123-
f"{ToolInstanceErrors.DUPLICATE_API}"
122+
f"{ToolInstanceErrors.TOOL_EXISTS}, {ToolInstanceErrors.DUPLICATE_API}"
124123
)
125124
instance: ToolInstance = serializer.instance
126125
ToolInstanceHelper.update_metadata_with_default_values(

0 commit comments

Comments
 (0)