-
Notifications
You must be signed in to change notification settings - Fork 571
UN-3001 [FIX] Extract text in Prompt Studio when extractor metadata is updated #1661
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…ation - Refactored extraction status tracking to use x2text_config_hash instead of doc_id - X2Text config hash isolates extraction from indexing-related parameters - Prevents unnecessary re-extractions when only vector DB/embeddings change - Single atomic update_or_create operation for extraction_status - Added extraction failure tracking with error messages for debugging - Unified mark_extraction_status() method handles both success and failure - Added USE_SDK_V2 feature flag support for sdk imports - Simplified signature of dynamic_extractor() and summarize() methods
Summary by CodeRabbitRelease Notes
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
📜 Recent review detailsConfiguration used: CodeRabbit UI Review profile: CHILL Plan: Pro Cache: Disabled due to Reviews > Disable Cache setting Knowledge base: Disabled due to 📒 Files selected for processing (2)
🧰 Additional context used🪛 Ruff (0.14.5)backend/prompt_studio/prompt_studio_index_manager_v2/prompt_studio_index_helper.py145-145: Consider moving this statement to an (TRY300) 148-148: Use Replace with (TRY400) 153-153: Redundant exception object included in (TRY401) ⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
🔇 Additional comments (4)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
backend/prompt_studio/prompt_studio_index_manager_v2/prompt_studio_index_helper.py (1)
226-228: Use logging.exception for better error diagnostics.The error handling appropriately raises an exception, but the logging could be improved to include stack traces.
Apply this diff:
except Exception as e: - logger.error(f"Unexpected error while checking extraction status: {e}") + logger.exception(f"Unexpected error while checking extraction status: {e}") raise IndexingAPIError(f"Error checking extraction status {str(e)}") from eBased on static analysis hints.
🧹 Nitpick comments (3)
backend/prompt_studio/prompt_studio_index_manager_v2/prompt_studio_index_helper.py (3)
106-114: LGTM - Clean status_data structure with conditional error field.The structured status data with
extracted,enable_highlight, and optionalerrorfields provides clear semantics for both success and failure cases.Optional defensive validation: Consider validating that error_message is only provided when extracted=False:
status_data = { "extracted": extracted, "enable_highlight": enable_highlight, } + # Validate that error is only provided on failure + if extracted and error_message: + logger.warning( + f"error_message provided with extracted=True for {x2text_config_hash}. " + "Ignoring error_message." + ) + # Add error message if extraction failed if not extracted and error_message: status_data["error"] = error_message
122-143: LGTM - Comprehensive logging for success and failure paths.The logging provides good observability with document IDs, config hashes, and error messages. The distinction between creation and update is helpful for debugging.
Minor: The log at lines 122-124 includes
index_ids_history, which is unrelated to extraction status tracking. Consider removing it or moving it to a more relevant location:- logger.info( - f"Index manager {index_manager} {index_manager.index_ids_history}" - ) -
208-221: LGTM - Highlight validation ensures extraction format compatibility.The validation logic correctly detects when extraction needs to be redone due to highlight setting changes, preventing format mismatches.
Minor simplification: The condition at line 215 is redundant:
- if is_extracted and stored_highlight == enable_highlight: + if stored_highlight == enable_highlight: logger.info( f"Extraction already complete for document: {document_id} " f"with x2text_config_hash: {x2text_config_hash} " f"(highlight={enable_highlight})" ) return True - elif is_extracted and stored_highlight != enable_highlight: + else: logger.info( f"Extraction exists but highlight mismatch for {x2text_config_hash}. " f"Stored: {stored_highlight}, Requested: {enable_highlight}. " f"Re-extraction needed." ) return False - else: - logger.info(f"Extraction NOT complete for document: {document_id}") - return FalseThe lines 222-224 become unreachable with this simplification.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to Reviews > Disable Cache setting
Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting
📒 Files selected for processing (2)
backend/prompt_studio/prompt_studio_core_v2/prompt_studio_helper.py(6 hunks)backend/prompt_studio/prompt_studio_index_manager_v2/prompt_studio_index_helper.py(3 hunks)
🧰 Additional context used
🪛 Ruff (0.14.5)
backend/prompt_studio/prompt_studio_index_manager_v2/prompt_studio_index_helper.py
145-145: Consider moving this statement to an else block
(TRY300)
148-148: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
151-151: Do not catch blind exception: Exception
(BLE001)
152-152: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: build
🔇 Additional comments (11)
backend/prompt_studio/prompt_studio_core_v2/prompt_studio_helper.py (6)
77-77: LGTM - ToolUtils imports are correctly feature-flagged.The imports are properly conditional based on the sdk1 feature flag, and both SDK versions provide the necessary ToolUtils.hash_str() method used at line 1318.
Also applies to: 84-84
429-429: LGTM - Simplified summarize() call aligns with updated signature.The removal of
document_idanddoc_idparameters is consistent with the updated method signature at line 485. These parameters were not used in the summarization logic.
1333-1338: LGTM - Hash-based extraction status check enables config-aware caching.The updated call correctly uses
x2text_config_hashto check if extraction has already been completed with the same X2Text config and highlight settings, avoiding unnecessary re-extraction.
1375-1380: LGTM - Success case correctly marks extraction status.The call appropriately uses the default
extracted=Trueto record successful extraction with the config hash and highlight setting.
1386-1393: LGTM - Failure tracking enables retry with diagnostics.The addition of failure recording with
extracted=Falseanderror_messageis a good practice. It allows:
- Future retry attempts (check_extraction_status returns False for failures)
- Diagnostic visibility into past extraction failures
- Error propagation remains intact since the ExtractionAPIError is still raised
485-485: No breaking changes detected. All callers updated.The verification confirms that the only caller of
summarize()at line 428 has already been updated to use the new 4-parameter signature. No remaining callers use the old signature withdocument_idanddoc_idparameters.backend/prompt_studio/prompt_studio_index_manager_v2/prompt_studio_index_helper.py (5)
71-78: LGTM - Signature change aligns with hash-based tracking.The updated signature replaces
doc_idwithx2text_config_hashand addsenable_highlight,extracted, anderror_messageparameters to support both success and failure tracking.
156-161: LGTM - Signature aligns with hash-based lookup requirements.The updated signature provides all necessary parameters for config-aware extraction status checking.
177-183: LGTM - Safe IndexManager lookup with appropriate fallback.Using
.first()avoids raising exceptions, and returning False when no record exists is the correct behavior.
185-193: LGTM - Safe hash-based lookup with proper defaults.The defensive handling of None values and use of
.get()prevents KeyError exceptions.
195-206: LGTM - Failure retry logic enables recovery from transient errors.The logic correctly identifies previous extraction failures and returns False to allow retry attempts, with helpful error logging for diagnostics.
backend/prompt_studio/prompt_studio_core_v2/prompt_studio_helper.py
Outdated
Show resolved
Hide resolved
backend/prompt_studio/prompt_studio_index_manager_v2/prompt_studio_index_helper.py
Show resolved
Hide resolved
backend/prompt_studio/prompt_studio_index_manager_v2/prompt_studio_index_helper.py
Show resolved
Hide resolved
…ion logging - Guard against None metadata when adapter_metadata_b is None to prevent TypeError - Changed from logger.error() to logger.exception() to capture full stack traces - Added return value checks on mark_extraction_status() calls with warning logs - Improves debugging visibility when status updates fail Fixes extraction crashes and provides better error tracking for monitoring.
Test ResultsSummary
Runner Tests - Full Report
SDK1 Tests - Full Report
|
|



What
Why
Previously,
doc_idincluded vector DB, embedding model, and chunking parameters, causing unnecessary re-extractions when users only changed indexing settings. By using x2text_config_hash (hash of X2Text config metadata), we decouple extraction from indexing params. Extraction only depends on X2Text config + enable_highlight.How
compute_x2text_config_hash()to hash X2Text metadatamark_extraction_status()to:check_extraction_status()to:Can this PR break any existing features
No
Database Migrations
None required. Existing extraction_status data remains usable with new hash-based lookups.
Env Config
None required. No new environment variables introduced.
Relevant Docs
Related Issues or PRs
Notes on Testing
Screenshots
Checklist