UN-3001 [FIX] Extract text in Prompt Studio when extractor metadata is updated #1661

chandrasekharan-zipstack · 2025-11-17T18:46:11Z

What

Refactored extraction status tracking to use x2text_config_hash instead of doc_id
X2Text config hash isolates extraction from indexing-related parameters (vector DB, embeddings, chunk size, etc.)
Prevents unnecessary re-extractions when only indexing params change
Unified mark_extraction_status() handles both success and failure cases
Added extraction failure tracking with error messages

Why

Previously, doc_id included vector DB, embedding model, and chunking parameters, causing unnecessary re-extractions when users only changed indexing settings. By using x2text_config_hash (hash of X2Text config metadata), we decouple extraction from indexing params. Extraction only depends on X2Text config + enable_highlight.

How

Added compute_x2text_config_hash() to hash X2Text metadata
Updated mark_extraction_status() to:
- Use x2text_config_hash as key
- Store single atomic entry per config (no accumulation)
- Track failures with error messages
Updated check_extraction_status() to:
- Use hash-based lookup
- Detect and allow retry on previous failures
- Validate highlight setting matches
Updated dynamic_extractor() to compute and pass x2text_config_hash
Removed doc_id from extraction status methods

Can this PR break any existing features

No

We force an extraction to avoid a data migration
check_extraction_status() handles boolean True legacy format
Extraction status stored in same IndexManager.extraction_status field
No database schema changes
Fallback to re-extraction if hash lookup fails

Database Migrations

None required. Existing extraction_status data remains usable with new hash-based lookups.

Env Config

None required. No new environment variables introduced.

Relevant Docs

Extraction optimization logic documented in mark_extraction_status() and check_extraction_status()
Hash computation in ToolUtils.hash_str()

Related Issues or PRs

Related: UN-3001

Notes on Testing

Verify extraction status is tracked using x2text_config_hash
Test re-extraction prevented when config unchanged
Test re-extraction triggered when X2Text config changes
Test extraction failure tracking with error messages
Test enable_highlight mismatch triggers re-extraction
Test backward compatibility with existing extraction_status data

Screenshots

Checklist

I have read and understood the Contribution Guidelines

…ation - Refactored extraction status tracking to use x2text_config_hash instead of doc_id - X2Text config hash isolates extraction from indexing-related parameters - Prevents unnecessary re-extractions when only vector DB/embeddings change - Single atomic update_or_create operation for extraction_status - Added extraction failure tracking with error messages for debugging - Unified mark_extraction_status() method handles both success and failure - Added USE_SDK_V2 feature flag support for sdk imports - Simplified signature of dynamic_extractor() and summarize() methods

coderabbitai · 2025-11-17T18:46:28Z

Summary by CodeRabbit

Release Notes

Refactor
- Streamlined internal extraction status tracking mechanisms for improved reliability and consistency during content processing.
- Enhanced error handling and logging throughout the extraction workflow to provide better visibility into processing failures and recovery paths.

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 57.14% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title directly summarizes the main change: refactoring extraction status tracking to use x2text_config_hash instead of doc_id when extractor metadata is updated.
Description check	✅ Passed	The description comprehensively covers all template sections: What (refactoring details), Why (rationale for decoupling), How (implementation steps), breaking changes (none), migrations (none), env config (none), docs, related issues, testing notes, and checklist completed.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch UN-3001-extract-hash-optimization

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to Reviews > Disable Cache setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between ff754c8 and 8178e87.

📒 Files selected for processing (2)

backend/prompt_studio/prompt_studio_core_v2/prompt_studio_helper.py (6 hunks)
backend/prompt_studio/prompt_studio_index_manager_v2/prompt_studio_index_helper.py (3 hunks)

🧰 Additional context used

🪛 Ruff (0.14.5)

backend/prompt_studio/prompt_studio_index_manager_v2/prompt_studio_index_helper.py

145-145: Consider moving this statement to an else block

(TRY300)

148-148: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

153-153: Redundant exception object included in logging.exception call

(TRY401)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: build

🔇 Additional comments (4)

backend/prompt_studio/prompt_studio_core_v2/prompt_studio_helper.py (3)

70-85: ToolUtils imports are consistent with feature-flagged SDK variants.

The conditional imports for ToolUtils under sdk1 vs sdk keep hashing behavior aligned across both paths; no issues here.

1318-1404: Hash-based extraction tracking and failure logging in dynamic_extractor look correct.

Guarding metadata with profile_manager.x2text.metadata or {} and hashing via ToolUtils.hash_str(json.dumps(..., sort_keys=True)) provides a stable, config‑driven key.

Using check_extraction_status(..., x2text_config_hash, enable_highlight=...) to short‑circuit when both config and highlight match is aligned with the PR goal of avoiding unnecessary re‑extraction.

On success and failure, mark_extraction_status is invoked with the same hash and highlight flag, and the boolean return is checked with a warning if the status can’t be persisted, which addresses earlier concerns about silent failures.

Overall, the new flow (hash computation → status check → extraction → status mark) is coherent and matches the intended semantics.

429-430: All call sites verified—no stale references or signature mismatches found.

The summarize() call at line 428–430 correctly passes (file_name, org_id, run_id, tool) matching the definition at line 485. Only one call site exists across the codebase; no stale uses of the old signature remain. The refactoring is complete and safe.

backend/prompt_studio/prompt_studio_index_manager_v2/prompt_studio_index_helper.py (1)

158-230: check_extraction_status: config-hash semantics and highlight handling match the new extraction flow.

Filtering by (document_manager, profile_manager) and then looking up extraction_status[x2text_config_hash] is consistent with how mark_extraction_status writes the data.

The dict-based status interpretation (extracted, enable_highlight, optional error) cleanly distinguishes:

Missing status → first-time extraction.

Failed prior extraction → log and retry.

Completed extraction with highlight match → reuse.

Completed extraction with highlight mismatch → log and re-extract.

On unexpected errors, raising IndexingAPIError is appropriate so callers don’t silently proceed on corrupted status.

Given the current write path always stores dict-shaped entries for x2text_config_hash, the assumptions here are coherent.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

backend/prompt_studio/prompt_studio_index_manager_v2/prompt_studio_index_helper.py (1)
226-228: Use logging.exception for better error diagnostics.

The error handling appropriately raises an exception, but the logging could be improved to include stack traces.

Apply this diff:
         except Exception as e:
-            logger.error(f"Unexpected error while checking extraction status: {e}")
+            logger.exception(f"Unexpected error while checking extraction status: {e}")
             raise IndexingAPIError(f"Error checking extraction status {str(e)}") from e
Based on static analysis hints.

🧹 Nitpick comments (3)

backend/prompt_studio/prompt_studio_index_manager_v2/prompt_studio_index_helper.py (3)
106-114: LGTM - Clean status_data structure with conditional error field.

The structured status data with extracted, enable_highlight, and optional error fields provides clear semantics for both success and failure cases.

Optional defensive validation: Consider validating that error_message is only provided when extracted=False:
         status_data = {
             "extracted": extracted,
             "enable_highlight": enable_highlight,
         }
 
+        # Validate that error is only provided on failure
+        if extracted and error_message:
+            logger.warning(
+                f"error_message provided with extracted=True for {x2text_config_hash}. "
+                "Ignoring error_message."
+            )
+
         # Add error message if extraction failed
         if not extracted and error_message:
             status_data["error"] = error_message
122-143: LGTM - Comprehensive logging for success and failure paths.

The logging provides good observability with document IDs, config hashes, and error messages. The distinction between creation and update is helpful for debugging.

Minor: The log at lines 122-124 includes index_ids_history, which is unrelated to extraction status tracking. Consider removing it or moving it to a more relevant location:
-                logger.info(
-                    f"Index manager {index_manager} {index_manager.index_ids_history}"
-                )
-
208-221: LGTM - Highlight validation ensures extraction format compatibility.

The validation logic correctly detects when extraction needs to be redone due to highlight setting changes, preventing format mismatches.

Minor simplification: The condition at line 215 is redundant:
-            if is_extracted and stored_highlight == enable_highlight:
+            if stored_highlight == enable_highlight:
                 logger.info(
                     f"Extraction already complete for document: {document_id} "
                     f"with x2text_config_hash: {x2text_config_hash} "
                     f"(highlight={enable_highlight})"
                 )
                 return True
-            elif is_extracted and stored_highlight != enable_highlight:
+            else:
                 logger.info(
                     f"Extraction exists but highlight mismatch for {x2text_config_hash}. "
                     f"Stored: {stored_highlight}, Requested: {enable_highlight}. "
                     f"Re-extraction needed."
                 )
                 return False
-            else:
-                logger.info(f"Extraction NOT complete for document: {document_id}")
-                return False
The lines 222-224 become unreachable with this simplification.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to Reviews > Disable Cache setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between a748c93 and ff754c8.

📒 Files selected for processing (2)

backend/prompt_studio/prompt_studio_core_v2/prompt_studio_helper.py (6 hunks)
backend/prompt_studio/prompt_studio_index_manager_v2/prompt_studio_index_helper.py (3 hunks)

🧰 Additional context used

🪛 Ruff (0.14.5)

backend/prompt_studio/prompt_studio_index_manager_v2/prompt_studio_index_helper.py

145-145: Consider moving this statement to an else block

(TRY300)

148-148: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

151-151: Do not catch blind exception: Exception

(BLE001)

152-152: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: build

🔇 Additional comments (11)

backend/prompt_studio/prompt_studio_core_v2/prompt_studio_helper.py (6)

77-77: LGTM - ToolUtils imports are correctly feature-flagged.

The imports are properly conditional based on the sdk1 feature flag, and both SDK versions provide the necessary ToolUtils.hash_str() method used at line 1318.

Also applies to: 84-84

429-429: LGTM - Simplified summarize() call aligns with updated signature.

The removal of document_id and doc_id parameters is consistent with the updated method signature at line 485. These parameters were not used in the summarization logic.

1333-1338: LGTM - Hash-based extraction status check enables config-aware caching.

The updated call correctly uses x2text_config_hash to check if extraction has already been completed with the same X2Text config and highlight settings, avoiding unnecessary re-extraction.

1375-1380: LGTM - Success case correctly marks extraction status.

The call appropriately uses the default extracted=True to record successful extraction with the config hash and highlight setting.

1386-1393: LGTM - Failure tracking enables retry with diagnostics.

The addition of failure recording with extracted=False and error_message is a good practice. It allows:

Future retry attempts (check_extraction_status returns False for failures)

Diagnostic visibility into past extraction failures

Error propagation remains intact since the ExtractionAPIError is still raised

485-485: No breaking changes detected. All callers updated.

The verification confirms that the only caller of summarize() at line 428 has already been updated to use the new 4-parameter signature. No remaining callers use the old signature with document_id and doc_id parameters.

backend/prompt_studio/prompt_studio_index_manager_v2/prompt_studio_index_helper.py (5)

71-78: LGTM - Signature change aligns with hash-based tracking.

The updated signature replaces doc_id with x2text_config_hash and adds enable_highlight, extracted, and error_message parameters to support both success and failure tracking.

156-161: LGTM - Signature aligns with hash-based lookup requirements.

The updated signature provides all necessary parameters for config-aware extraction status checking.

177-183: LGTM - Safe IndexManager lookup with appropriate fallback.

Using .first() avoids raising exceptions, and returning False when no record exists is the correct behavior.

185-193: LGTM - Safe hash-based lookup with proper defaults.

The defensive handling of None values and use of .get() prevents KeyError exceptions.

195-206: LGTM - Failure retry logic enables recovery from transient errors.

The logic correctly identifies previous extraction failures and returns False to allow retry attempts, with helpful error logging for diagnostics.

backend/prompt_studio/prompt_studio_core_v2/prompt_studio_helper.py

backend/prompt_studio/prompt_studio_index_manager_v2/prompt_studio_index_helper.py

…ion logging - Guard against None metadata when adapter_metadata_b is None to prevent TypeError - Changed from logger.error() to logger.exception() to capture full stack traces - Added return value checks on mark_extraction_status() calls with warning logs - Improves debugging visibility when status updates fail Fixes extraction crashes and provides better error tracking for monitoring.

github-actions · 2025-11-18T07:41:40Z

Test Results

Summary

✅ Runner Tests: 11 passed, 0 failed (11 total)
✅ SDK1 Tests: 66 passed, 0 failed (66 total)

Runner Tests - Full Report

filepath	function	$$\textcolor{#23d18b}{\tt{passed}}$$	SUBTOTAL
$$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$	$$\textcolor{#23d18b}{\tt{test\_logs}}$$	$$\textcolor{#23d18b}{\tt{1}}$$	$$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$	$$\textcolor{#23d18b}{\tt{test\_cleanup}}$$	$$\textcolor{#23d18b}{\tt{1}}$$	$$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$	$$\textcolor{#23d18b}{\tt{test\_cleanup\_skip}}$$	$$\textcolor{#23d18b}{\tt{1}}$$	$$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$	$$\textcolor{#23d18b}{\tt{test\_client\_init}}$$	$$\textcolor{#23d18b}{\tt{1}}$$	$$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$	$$\textcolor{#23d18b}{\tt{test\_get\_image\_exists}}$$	$$\textcolor{#23d18b}{\tt{1}}$$	$$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$	$$\textcolor{#23d18b}{\tt{test\_get\_image}}$$	$$\textcolor{#23d18b}{\tt{1}}$$	$$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$	$$\textcolor{#23d18b}{\tt{test\_get\_container\_run\_config}}$$	$$\textcolor{#23d18b}{\tt{1}}$$	$$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$	$$\textcolor{#23d18b}{\tt{test\_get\_container\_run\_config\_without\_mount}}$$	$$\textcolor{#23d18b}{\tt{1}}$$	$$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$	$$\textcolor{#23d18b}{\tt{test\_run\_container}}$$	$$\textcolor{#23d18b}{\tt{1}}$$	$$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$	$$\textcolor{#23d18b}{\tt{test\_get\_image\_for\_sidecar}}$$	$$\textcolor{#23d18b}{\tt{1}}$$	$$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$	$$\textcolor{#23d18b}{\tt{test\_sidecar\_container}}$$	$$\textcolor{#23d18b}{\tt{1}}$$	$$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{TOTAL}}$$		$$\textcolor{#23d18b}{\tt{11}}$$	$$\textcolor{#23d18b}{\tt{11}}$$

SDK1 Tests - Full Report

sonarqubecloud · 2025-11-18T07:41:54Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

chandrasekharan-zipstack self-assigned this Nov 17, 2025

chandrasekharan-zipstack requested review from Deepak-Kesavan, harini-venkataraman and jagadeeswaran-zipstack November 17, 2025 18:50

chandrasekharan-zipstack changed the title ~~UN-3001 [FEAT] Track X2Text config hash for extraction re-use optimization~~ UN-3001 [FIX] Extract text when extractor metadata is updated Nov 17, 2025

coderabbitai bot reviewed Nov 17, 2025

View reviewed changes

chandrasekharan-zipstack changed the title ~~UN-3001 [FIX] Extract text when extractor metadata is updated~~ UN-3001 [FIX] Extract text in Prompt Studio when extractor metadata is updated Nov 17, 2025

Deepak-Kesavan approved these changes Nov 18, 2025

View reviewed changes

harini-venkataraman approved these changes Nov 18, 2025

View reviewed changes

jagadeeswaran-zipstack approved these changes Nov 18, 2025

View reviewed changes

Merge branch 'main' into UN-3001-extract-hash-optimization

afdae20

johnyrahul merged commit 1889ddd into main Nov 18, 2025
7 checks passed

johnyrahul deleted the UN-3001-extract-hash-optimization branch November 18, 2025 09:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

UN-3001 [FIX] Extract text in Prompt Studio when extractor metadata is updated #1661

UN-3001 [FIX] Extract text in Prompt Studio when extractor metadata is updated #1661

Uh oh!

chandrasekharan-zipstack commented Nov 17, 2025 •

edited

Loading

Uh oh!

coderabbitai bot commented Nov 17, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Nov 18, 2025

Uh oh!

sonarqubecloud bot commented Nov 18, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

UN-3001 [FIX] Extract text in Prompt Studio when extractor metadata is updated #1661

UN-3001 [FIX] Extract text in Prompt Studio when extractor metadata is updated #1661

Uh oh!

Conversation

chandrasekharan-zipstack commented Nov 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Why

How

Can this PR break any existing features

Database Migrations

Env Config

Relevant Docs

Related Issues or PRs

Notes on Testing

Screenshots

Checklist

Uh oh!

coderabbitai bot commented Nov 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Release Notes

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Nov 18, 2025

Test Results

Uh oh!

sonarqubecloud bot commented Nov 18, 2025

Quality Gate passed

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

chandrasekharan-zipstack commented Nov 17, 2025 •

edited

Loading

coderabbitai bot commented Nov 17, 2025 •

edited

Loading