Managed batches - Misc bug fixes#21157
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
Consolidated Bug Fixes1. S3 logger crashes on file delete callbacks
2. Cost tracking KeyError: 'stream'Health check callbacks don't include
3. Cost tracking skips when model=NoneNon-model call types (health checks,
4. Batch rate limiter 403 on managed files
5. Service logger KeyError: 'call_type'Batch polling callbacks don't include
6. Batch polling 403 — background job can't access managed files
7. afile_retrieve called without credentials for output files
8. Deleted managed files return 403 instead of 404
9. afile_retrieve returns raw provider ID for output filesWhen the stored
10. batches.retrieve returns raw provider input_file_id
11. Batch cost calculation ignores deployment-level custom pricingbatch_cost_calculator() only looked up pricing from the global model_prices_and_context_window.json map. Deployment-level custom batch pricing (input_cost_per_token_batches / output_cost_per_token_batches set in model_info) was never passed through, so cost was always 0 for models not in the global map. |
Greptile OverviewGreptile SummaryThis PR fixes a bug where the S3 v2 logger raised a
Confidence Score: 5/5
|
| Filename | Overview |
|---|---|
| litellm/integrations/s3_v2.py | Replaces a ValueError raise with a graceful debug log + early return when standard_logging_object is missing (e.g., for afile_delete call types). This aligns with the S3 v1 logger's approach and prevents unnecessary error metrics from being incremented. |
| tests/test_litellm/integrations/test_s3_v2.py | Adds a well-structured test that verifies the graceful skip behavior when standard_logging_object is missing. Uses mocks appropriately (no real network calls). Minor style note: test function is defined outside the existing test class. |
Sequence Diagram
sequenceDiagram
participant Client
participant LiteLLM as LiteLLM (afile_delete)
participant S3Logger as S3Logger._async_log_event_base
participant Queue as log_queue
Client->>LiteLLM: File delete request
LiteLLM->>S3Logger: async_log_success_event(kwargs={call_type: "afile_delete"})
S3Logger->>S3Logger: create_s3_batch_logging_element(standard_logging_payload=None)
S3Logger-->>S3Logger: Returns None (no payload)
alt Before Fix
S3Logger->>S3Logger: raise ValueError("s3_batch_logging_element is None")
S3Logger->>S3Logger: handle_callback_failure() — increments error metrics
end
alt After Fix (this PR)
S3Logger->>S3Logger: verbose_logger.debug("skipping event...")
S3Logger-->>LiteLLM: return (graceful skip)
end
Note over Queue: log_queue remains empty — no None appended
Last reviewed commit: 3eed4ba
3eed4ba to
bac6d11
Compare
…ile_content directly
batch_cost_calculator only checked the global cost map, ignoring deployment-level custom pricing (input_cost_per_token_batches etc.). Add optional model_info param through the batch cost chain and pass it from CheckBatchCost.
| @@ -0,0 +1,47 @@ | |||
| # Fix: afile_retrieve returns raw provider ID for batch output files | |||
There was a problem hiding this comment.
Can you remove this file
Greptile SummaryA collection of bug fixes for managed batches addressing several issues: (1) batch polling job (
Confidence Score: 3/5
|
| Filename | Overview |
|---|---|
| enterprise/litellm_enterprise/proxy/common_utils/check_batch_cost.py | Refactored to bypass managed files hook and call afile_content directly with deployment credentials. Threads model_info for custom batch pricing. Logic is correct; pre-existing indentation issue with completed_jobs update inside loop is not introduced by this PR. |
| enterprise/litellm_enterprise/proxy/hooks/managed_files.py | Three changes: (1) can_user_call_unified_file_id returns True for missing records (security concern), (2) async_post_call_success_hook uses router credentials for afile_retrieve, (3) afile_retrieve Case 2 sets unified ID on stored file_object. The credential fix and ID fix are sound; the access-control change needs review. |
| fix_afile_retrieve_returns_unified_id.md | Development note file accidentally committed to repo root. Should be removed before merge. |
| litellm/_service_logger.py | Changed kwargs["call_type"] to kwargs.get("call_type", "unknown") to avoid KeyError in batch polling callbacks. Clean defensive fix. |
| litellm/batches/batch_utils.py | Threads model_info parameter through batch cost calculation pipeline to support deployment-specific batch pricing. Clean, backwards-compatible addition. |
| litellm/cost_calculator.py | Allows model_info to be passed in to batch_cost_calculator, skipping global lookup when deployment-level pricing is provided. Backwards-compatible change. |
| litellm/integrations/s3_v2.py | Changed raise ValueError to graceful skip when s3_batch_logging_element is None (e.g. for afile_delete calls). Prevents crash for non-model call types. |
| litellm/proxy/batches_endpoints/endpoints.py | Added input_file_id → unified ID resolution in two code paths. Logic is correct but duplicated across 3 locations and uses direct DB queries in the request path. |
| litellm/proxy/hooks/proxy_track_cost_callback.py | Added early return when sl_object is None and model is falsy. Prevents spurious cost-tracking error for non-model call types like afile_delete. |
| litellm/proxy/openai_files_endpoints/common_utils.py | Added input_file_id resolution in get_batch_from_database. Same duplicated pattern as in endpoints.py. |
Sequence Diagram
sequenceDiagram
participant BG as CheckBatchCost (Background)
participant Router as LLM Router
participant Provider as LLM Provider (Azure/OpenAI)
participant DB as Prisma DB
participant CostCalc as Batch Cost Calculator
BG->>DB: find_many(status=validating/in_progress)
DB-->>BG: pending batch jobs
loop For each pending job
BG->>Router: aretrieve_batch(model_id, batch_id)
Router->>Provider: retrieve batch status
Provider-->>Router: batch response (status, output_file_id)
Router-->>BG: batch response
alt status == completed
BG->>BG: decode unified output_file_id → raw provider file ID
BG->>Router: get_deployment_credentials_with_provider(model_id)
Router-->>BG: {api_key, api_base, custom_llm_provider}
BG->>Provider: afile_content(raw_file_id, **credentials)
Note right of BG: Previously went through managed_files hook → 403
Provider-->>BG: file content (JSONL)
BG->>Router: get_deployment(model_id) → model_info
BG->>CostCalc: calculate_batch_cost_and_usage(content, model_info)
Note right of CostCalc: Now uses deployment-level batch pricing
CostCalc-->>BG: cost, usage, models
BG->>DB: update status → complete
end
end
Last reviewed commit: 7d794b5
| # Fix: afile_retrieve returns raw provider ID for batch output files | ||
|
|
||
| ## Bug | ||
|
|
||
| `managed_files.afile_retrieve()` Case 2 (file_object already in DB) returned the stored `file_object` without replacing `.id` with the unified file ID. Case 3 (fetch from provider) did this correctly at line 1028. | ||
|
|
||
| ## Fix | ||
|
|
||
| One-line change in `enterprise/litellm_enterprise/proxy/hooks/managed_files.py`: | ||
|
|
||
| ```python | ||
| # Before (line 1013-1014) | ||
| if stored_file_object and stored_file_object.file_object: | ||
| return stored_file_object.file_object | ||
|
|
||
| # After | ||
| if stored_file_object and stored_file_object.file_object: | ||
| stored_file_object.file_object.id = file_id | ||
| return stored_file_object.file_object | ||
| ``` | ||
|
|
||
| ## Test | ||
|
|
||
| ```bash | ||
| poetry run pytest tests/test_litellm/enterprise/proxy/test_afile_retrieve_returns_unified_id.py -s -vvvv | ||
| ``` | ||
|
|
||
| ## Test failure (before fix) | ||
|
|
||
| ``` | ||
| FAILED tests/test_litellm/enterprise/proxy/test_afile_retrieve_returns_unified_id.py::test_should_return_unified_id_when_file_object_exists_in_db | ||
| AssertionError: afile_retrieve should return the unified ID 'bGl0ZWxsbV9wcm94eTp1bmlmaWVkX291dHB1dF9maWxl', but got raw provider ID 'batch_20260214-output-file-1' | ||
| assert 'batch_20260214-output-file-1' == 'bGl0ZWxsbV9wcm94eTp1bmlmaWVkX291dHB1dF9maWxl' | ||
| =================== 1 failed, 1 retried in 102.95s =================== | ||
| ``` | ||
|
|
||
| ## Test pass (after fix) | ||
|
|
||
| ``` | ||
| tests/test_litellm/enterprise/proxy/test_afile_retrieve_returns_unified_id.py::test_should_return_unified_id_when_file_object_exists_in_db PASSED | ||
| ============================== 1 passed in 0.11s =============================== | ||
| ``` | ||
|
|
||
| ## Files changed | ||
|
|
||
| - `enterprise/litellm_enterprise/proxy/hooks/managed_files.py` — one-line fix | ||
| - `tests/test_litellm/enterprise/proxy/test_afile_retrieve_returns_unified_id.py` — new test |
There was a problem hiding this comment.
Stray markdown file in repo root
This file appears to be a development note/changelog for one of the fixes in this PR. It should not be committed to the repository root — it will clutter the top-level directory. Consider removing it before merging, or moving it into a docs directory if it's intended to be kept.
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
| if ( | ||
| unified_batch_id | ||
| and hasattr(response, "input_file_id") | ||
| and response.input_file_id | ||
| and not _is_base64_encoded_unified_file_id(response.input_file_id) | ||
| and prisma_client | ||
| ): | ||
| try: | ||
| _managed_file = await prisma_client.db.litellm_managedfiletable.find_first( | ||
| where={"flat_model_file_ids": {"has": response.input_file_id}} | ||
| ) | ||
| if _managed_file: | ||
| response.input_file_id = _managed_file.unified_file_id | ||
| except Exception: | ||
| pass |
There was a problem hiding this comment.
Duplicated input_file_id resolution logic across 3 locations
The same input_file_id → unified ID resolution block is copy-pasted here, again at line ~502 in this file, and in common_utils.py:get_batch_from_database. This triples the maintenance burden and risks divergence if the logic ever changes. Consider extracting this into a shared helper, e.g. _resolve_input_file_id_to_unified(response, prisma_client), and calling it from all three places.
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
| try: | ||
| _managed_file = await prisma_client.db.litellm_managedfiletable.find_first( | ||
| where={"flat_model_file_ids": {"has": response.input_file_id}} | ||
| ) | ||
| if _managed_file: | ||
| response.input_file_id = _managed_file.unified_file_id | ||
| except Exception: | ||
| pass |
There was a problem hiding this comment.
Direct DB query in request-handling endpoint
Both occurrences of prisma_client.db.litellm_managedfiletable.find_first(...) are direct database queries in the critical request path (the retrieve_batch endpoint). Per the repository's custom rules, the critical path should avoid direct DB queries and instead use helper functions (e.g. get_team/get_user/get_key patterns).
Consider wrapping this lookup in a helper (e.g. in common_utils.py) to keep the endpoint code clean and to allow caching or batching of these lookups in the future.
Context Used: Rule from dashboard - What: In critical path of request, there should be no direct db queries. Only allow them to be made ... (source)
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
There was a problem hiding this comment.
Addressed. Per discussion with @Sameerlite we are not removing the db query from the request path as it makes sense to have a live db query for security checks
| # When DB record is missing (file was deleted), allow through so downstream returns 404. | ||
| # Matches can_user_call_unified_object_id which also returns True for missing records. | ||
| return True |
There was a problem hiding this comment.
Security consideration: returning True for missing records bypasses access control
Changing the return from False to True when the managed file record is not found means that any user can pass the access check for a file ID that doesn't exist in the DB. The comment says this is to allow downstream to return 404, but it also means:
- If there's ever a race condition where a record is temporarily unavailable, access control is bypassed.
- An attacker could probe arbitrary file IDs and get past this check (though they'd hit 404 downstream).
The existing behavior for can_user_call_unified_object_id (also returning True for missing) is cited as justification, but that may itself be a design concern. Consider instead raising a 404 directly here rather than returning True, which more accurately represents "record not found" without weakening the access-control contract.
72a1bd6
into
BerriAI:litellm_oss_staging_02_14_20262
A collection of bug fixes for managed batches. See comment for more details. CC @Sameerlite
Relevant issues
Pre-Submission checklist
Please complete all items before asking a LiteLLM maintainer to review your PR
tests/litellm/directory, Adding at least 1 test is a hard requirement - see detailsmake test-unit@greptileaiand received a Confidence Score of at least 4/5 before requesting a maintainer reviewCI (LiteLLM team)
Branch creation CI run
Link:
CI run for the last commit
Link:
Merge / cherry-pick CI run
Links:
Type
🆕 New Feature
🐛 Bug Fix
🧹 Refactoring
📖 Documentation
🚄 Infrastructure
✅ Test
Changes