Skip to content

Managed batches - Misc bug fixes#21157

Merged
Sameerlite merged 11 commits intoBerriAI:litellm_oss_staging_02_14_20262from
Point72:ephrimstanley/s3-logger-skip-missing-standard-logging-object
Feb 16, 2026
Merged

Managed batches - Misc bug fixes#21157
Sameerlite merged 11 commits intoBerriAI:litellm_oss_staging_02_14_20262from
Point72:ephrimstanley/s3-logger-skip-missing-standard-logging-object

Conversation

@ephrimstanley
Copy link
Contributor

@ephrimstanley ephrimstanley commented Feb 13, 2026

A collection of bug fixes for managed batches. See comment for more details. CC @Sameerlite

Relevant issues

Pre-Submission checklist

Please complete all items before asking a LiteLLM maintainer to review your PR

  • I have Added testing in the tests/litellm/ directory, Adding at least 1 test is a hard requirement - see details
  • My PR passes all unit tests on make test-unit
  • My PR's scope is as isolated as possible, it only solves 1 specific problem
  • I have requested a Greptile review by commenting @greptileai and received a Confidence Score of at least 4/5 before requesting a maintainer review

CI (LiteLLM team)

CI status guideline:

  • 50-55 passing tests: main is stable with minor issues.
  • 45-49 passing tests: acceptable but needs attention
  • <= 40 passing tests: unstable; be careful with your merges and assess the risk.
  • Branch creation CI run
    Link:

  • CI run for the last commit
    Link:

  • Merge / cherry-pick CI run
    Links:

Type

🆕 New Feature
🐛 Bug Fix
🧹 Refactoring
📖 Documentation
🚄 Infrastructure
✅ Test

Changes

@vercel
Copy link

vercel bot commented Feb 13, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
litellm Ready Ready Preview, Comment Feb 16, 2026 0:59am

Request Review

@ephrimstanley
Copy link
Contributor Author

ephrimstanley commented Feb 13, 2026

Consolidated Bug Fixes

1. S3 logger crashes on file delete callbacks

afile_delete has no standard_logging_object, so S3 logger gets None and raises ValueError.

  • litellm/integrations/s3_v2.py:250 — Changed raise ValueError to return with debug log. Skip gracefully when no payload to persist.

2. Cost tracking KeyError: 'stream'

Health check callbacks don't include stream in kwargs, causing kwargs["stream"] to raise KeyError.

  • litellm/proxy/hooks/proxy_track_cost_callback.py:211 — Changed kwargs["stream"] to kwargs.get("stream").

3. Cost tracking skips when model=None

Non-model call types (health checks, afile_delete) trigger cost tracking with no model or standard_logging_object.

  • litellm/proxy/hooks/proxy_track_cost_callback.py:205 — Added early return when both model and standard_logging_object are missing.

4. Batch rate limiter 403 on managed files

count_input_file_usage called litellm.afile_content() which went through the HTTP endpoint and hit the managed files access check.

  • litellm/proxy/hooks/batch_rate_limiter.py:262 — Detect managed file IDs and call the managed files hook directly with user credentials, bypassing the HTTP endpoint.

5. Service logger KeyError: 'call_type'

Batch polling callbacks don't include call_type in kwargs, and _service_logger.py accessed it with kwargs["call_type"].

  • litellm/_service_logger.py:318 — Changed kwargs["call_type"] to kwargs.get("call_type", "unknown").

6. Batch polling 403 — background job can't access managed files

check_batch_cost runs as default_user_id and gets 403 when retrieving output file content through the HTTP endpoint.

  • enterprise/litellm_enterprise/proxy/common_utils/check_batch_cost.py:110-127 — Extract raw provider file ID from the unified ID and call litellm.afile_content directly with deployment credentials, bypassing access-control hooks.

7. afile_retrieve called without credentials for output files

async_post_call_success_hook calls litellm.afile_retrieve() with only custom_llm_provider and file_id — no api_key/api_base. Azure returns 500.

  • enterprise/litellm_enterprise/proxy/hooks/managed_files.py:917-923 — Use llm_router.get_deployment_credentials_with_provider(model_id) to pass proper credentials. Falls back to old behavior if router unavailable.

8. Deleted managed files return 403 instead of 404

can_user_call_unified_file_id() returns False when the DB record is missing (file deleted), causing the caller to raise 403 instead of letting downstream return 404.

  • enterprise/litellm_enterprise/proxy/hooks/managed_files.py:233 — Changed return False to return True when record not found. Matches the existing can_user_call_unified_object_id pattern.

9. afile_retrieve returns raw provider ID for output files

When the stored file_object exists in the DB (Case 2), afile_retrieve returned it with the raw provider file ID instead of the unified ID.

  • enterprise/litellm_enterprise/proxy/hooks/managed_files.py:1014 — Added stored_file_object.file_object.id = file_id to replace with unified ID, matching Case 3's response.id = file_id.

10. batches.retrieve returns raw provider input_file_id

async_post_call_success_hook replaces batch.id and output_file_id with unified IDs but not input_file_id. The stored batch retains the raw provider ID.

  • litellm/proxy/batches_endpoints/endpoints.py:381-396 — Resolve raw input_file_id to unified ID via flat_model_file_ids DB lookup (provider-sync path).
  • litellm/proxy/batches_endpoints/endpoints.py:499-514 — Same fix for the terminal-state early-return path.
  • litellm/proxy/openai_files_endpoints/common_utils.py:691-699 — Same fix in get_batch_from_database helper.

11. Batch cost calculation ignores deployment-level custom pricing

batch_cost_calculator() only looked up pricing from the global model_prices_and_context_window.json map. Deployment-level custom batch pricing (input_cost_per_token_batches / output_cost_per_token_batches set in model_info) was never passed through, so cost was always 0 for models not in the global map.

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Feb 13, 2026

Greptile Overview

Greptile Summary

This PR fixes a bug where the S3 v2 logger raised a ValueError when standard_logging_object was missing from kwargs during operations like afile_delete. File delete operations intentionally don't produce a standard_logging_object, so the logger should skip gracefully rather than error. The fix replaces the raise ValueError with a debug log and early return, aligning with the S3 v1 logger's existing behavior.

  • Replaced raise ValueError("s3_batch_logging_element is None") with a debug log and return in _async_log_event_base, preventing unnecessary handle_callback_failure calls and error metric increments for file operations that don't produce logging payloads.
  • Added a comprehensive mock-only test that validates the graceful skip, verifying that handle_callback_failure is not called and the log_queue remains empty.
  • Note: other loggers (Langsmith, LiteralAI, GCS Bucket, PostHog) still raise ValueError for missing standard_logging_object — they may need similar fixes if they are also triggered by file operations.

Confidence Score: 5/5

  • This PR is safe to merge — it's a minimal, targeted bug fix that replaces an error with a graceful skip.
  • The change is small (4 lines of production code), follows an established pattern from S3 v1, includes a well-structured test, and addresses a real runtime error for file delete operations. No risk of regression — the early return path is only triggered when there is genuinely no data to log.
  • No files require special attention.

Important Files Changed

Filename Overview
litellm/integrations/s3_v2.py Replaces a ValueError raise with a graceful debug log + early return when standard_logging_object is missing (e.g., for afile_delete call types). This aligns with the S3 v1 logger's approach and prevents unnecessary error metrics from being incremented.
tests/test_litellm/integrations/test_s3_v2.py Adds a well-structured test that verifies the graceful skip behavior when standard_logging_object is missing. Uses mocks appropriately (no real network calls). Minor style note: test function is defined outside the existing test class.

Sequence Diagram

sequenceDiagram
    participant Client
    participant LiteLLM as LiteLLM (afile_delete)
    participant S3Logger as S3Logger._async_log_event_base
    participant Queue as log_queue

    Client->>LiteLLM: File delete request
    LiteLLM->>S3Logger: async_log_success_event(kwargs={call_type: "afile_delete"})
    S3Logger->>S3Logger: create_s3_batch_logging_element(standard_logging_payload=None)
    S3Logger-->>S3Logger: Returns None (no payload)
    
    alt Before Fix
        S3Logger->>S3Logger: raise ValueError("s3_batch_logging_element is None")
        S3Logger->>S3Logger: handle_callback_failure() — increments error metrics
    end
    
    alt After Fix (this PR)
        S3Logger->>S3Logger: verbose_logger.debug("skipping event...")
        S3Logger-->>LiteLLM: return (graceful skip)
    end
    
    Note over Queue: log_queue remains empty — no None appended
Loading

Last reviewed commit: 3eed4ba

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

@ephrimstanley ephrimstanley force-pushed the ephrimstanley/s3-logger-skip-missing-standard-logging-object branch from 3eed4ba to bac6d11 Compare February 14, 2026 03:21
@ephrimstanley ephrimstanley changed the title Managed batches - Skip batch logger for file delete operations Managed batches - Fix errors in callbacks Feb 14, 2026
batch_cost_calculator only checked the global cost map, ignoring
deployment-level custom pricing (input_cost_per_token_batches etc.).
Add optional model_info param through the batch cost chain and pass
it from CheckBatchCost.
@ephrimstanley ephrimstanley changed the title Managed batches - Fix errors in callbacks Managed batches - Misc bug fixes Feb 16, 2026
@Sameerlite
Copy link
Collaborator

@greptile-apps

Copy link
Collaborator

@Sameerlite Sameerlite left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.

@@ -0,0 +1,47 @@
# Fix: afile_retrieve returns raw provider ID for batch output files
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you remove this file

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Feb 16, 2026

Greptile Summary

A collection of bug fixes for managed batches addressing several issues: (1) batch polling job (check_batch_cost) now bypasses managed files access-control hooks and calls afile_content directly with deployment credentials, fixing 403 errors when the background job runs as default_user_id; (2) afile_retrieve Case 2 now returns the unified file ID instead of the raw provider ID; (3) batch retrieve endpoints resolve raw input_file_id to unified ID; (4) async_post_call_success_hook passes router credentials to afile_retrieve (fixing Azure 401/500); (5) deployment-level custom batch pricing (input_cost_per_token_batches) is threaded through the cost calculation pipeline; (6) defensive fixes for KeyError/ValueError crashes in service logger, S3 logger, and cost tracking callback for non-model call types.

  • Access control change needs review: can_user_call_unified_file_id now returns True for missing DB records (was False). While this aligns with can_user_call_unified_object_id, it weakens access control — consider raising 404 directly instead.
  • Code duplication: The input_file_id → unified ID resolution logic is copy-pasted in 3 locations (endpoints.py x2, common_utils.py). Should be extracted into a shared helper.
  • Direct DB queries in request path: The input_file_id lookups in endpoints.py use direct prisma_client.db calls in the critical request path, contrary to repository conventions.
  • Stray file: fix_afile_retrieve_returns_unified_id.md is a development note committed to the repo root and should be removed.
  • Good test coverage: 10 new test files with comprehensive mock-only tests covering all bug fixes.

Confidence Score: 3/5

  • Most changes are defensive bug fixes with good test coverage, but the access-control weakening in can_user_call_unified_file_id and duplicated DB queries in the request path warrant careful review before merging.
  • Score of 3 reflects: strong test coverage (10 new test files), correct defensive fixes for crashes (service logger, S3, cost callback), and proper credential threading. However, the access-control change (returning True for missing records) is a security-sensitive modification that could mask issues, the input_file_id resolution logic is duplicated 3 times creating maintenance risk, direct DB queries appear in the request-critical path against repo conventions, and a stray markdown file is committed to the repo root.
  • enterprise/litellm_enterprise/proxy/hooks/managed_files.py (access control change at line 235), litellm/proxy/batches_endpoints/endpoints.py (duplicated logic and direct DB queries), fix_afile_retrieve_returns_unified_id.md (should be removed)

Important Files Changed

Filename Overview
enterprise/litellm_enterprise/proxy/common_utils/check_batch_cost.py Refactored to bypass managed files hook and call afile_content directly with deployment credentials. Threads model_info for custom batch pricing. Logic is correct; pre-existing indentation issue with completed_jobs update inside loop is not introduced by this PR.
enterprise/litellm_enterprise/proxy/hooks/managed_files.py Three changes: (1) can_user_call_unified_file_id returns True for missing records (security concern), (2) async_post_call_success_hook uses router credentials for afile_retrieve, (3) afile_retrieve Case 2 sets unified ID on stored file_object. The credential fix and ID fix are sound; the access-control change needs review.
fix_afile_retrieve_returns_unified_id.md Development note file accidentally committed to repo root. Should be removed before merge.
litellm/_service_logger.py Changed kwargs["call_type"] to kwargs.get("call_type", "unknown") to avoid KeyError in batch polling callbacks. Clean defensive fix.
litellm/batches/batch_utils.py Threads model_info parameter through batch cost calculation pipeline to support deployment-specific batch pricing. Clean, backwards-compatible addition.
litellm/cost_calculator.py Allows model_info to be passed in to batch_cost_calculator, skipping global lookup when deployment-level pricing is provided. Backwards-compatible change.
litellm/integrations/s3_v2.py Changed raise ValueError to graceful skip when s3_batch_logging_element is None (e.g. for afile_delete calls). Prevents crash for non-model call types.
litellm/proxy/batches_endpoints/endpoints.py Added input_file_id → unified ID resolution in two code paths. Logic is correct but duplicated across 3 locations and uses direct DB queries in the request path.
litellm/proxy/hooks/proxy_track_cost_callback.py Added early return when sl_object is None and model is falsy. Prevents spurious cost-tracking error for non-model call types like afile_delete.
litellm/proxy/openai_files_endpoints/common_utils.py Added input_file_id resolution in get_batch_from_database. Same duplicated pattern as in endpoints.py.

Sequence Diagram

sequenceDiagram
    participant BG as CheckBatchCost (Background)
    participant Router as LLM Router
    participant Provider as LLM Provider (Azure/OpenAI)
    participant DB as Prisma DB
    participant CostCalc as Batch Cost Calculator

    BG->>DB: find_many(status=validating/in_progress)
    DB-->>BG: pending batch jobs
    loop For each pending job
        BG->>Router: aretrieve_batch(model_id, batch_id)
        Router->>Provider: retrieve batch status
        Provider-->>Router: batch response (status, output_file_id)
        Router-->>BG: batch response
        alt status == completed
            BG->>BG: decode unified output_file_id → raw provider file ID
            BG->>Router: get_deployment_credentials_with_provider(model_id)
            Router-->>BG: {api_key, api_base, custom_llm_provider}
            BG->>Provider: afile_content(raw_file_id, **credentials)
            Note right of BG: Previously went through managed_files hook → 403
            Provider-->>BG: file content (JSONL)
            BG->>Router: get_deployment(model_id) → model_info
            BG->>CostCalc: calculate_batch_cost_and_usage(content, model_info)
            Note right of CostCalc: Now uses deployment-level batch pricing
            CostCalc-->>BG: cost, usage, models
            BG->>DB: update status → complete
        end
    end
Loading

Last reviewed commit: 7d794b5

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

21 files reviewed, 4 comments

Edit Code Review Agent Settings | Greptile

Comment on lines +1 to +47
# Fix: afile_retrieve returns raw provider ID for batch output files

## Bug

`managed_files.afile_retrieve()` Case 2 (file_object already in DB) returned the stored `file_object` without replacing `.id` with the unified file ID. Case 3 (fetch from provider) did this correctly at line 1028.

## Fix

One-line change in `enterprise/litellm_enterprise/proxy/hooks/managed_files.py`:

```python
# Before (line 1013-1014)
if stored_file_object and stored_file_object.file_object:
return stored_file_object.file_object

# After
if stored_file_object and stored_file_object.file_object:
stored_file_object.file_object.id = file_id
return stored_file_object.file_object
```

## Test

```bash
poetry run pytest tests/test_litellm/enterprise/proxy/test_afile_retrieve_returns_unified_id.py -s -vvvv
```

## Test failure (before fix)

```
FAILED tests/test_litellm/enterprise/proxy/test_afile_retrieve_returns_unified_id.py::test_should_return_unified_id_when_file_object_exists_in_db
AssertionError: afile_retrieve should return the unified ID 'bGl0ZWxsbV9wcm94eTp1bmlmaWVkX291dHB1dF9maWxl', but got raw provider ID 'batch_20260214-output-file-1'
assert 'batch_20260214-output-file-1' == 'bGl0ZWxsbV9wcm94eTp1bmlmaWVkX291dHB1dF9maWxl'
=================== 1 failed, 1 retried in 102.95s ===================
```

## Test pass (after fix)

```
tests/test_litellm/enterprise/proxy/test_afile_retrieve_returns_unified_id.py::test_should_return_unified_id_when_file_object_exists_in_db PASSED
============================== 1 passed in 0.11s ===============================
```

## Files changed

- `enterprise/litellm_enterprise/proxy/hooks/managed_files.py` — one-line fix
- `tests/test_litellm/enterprise/proxy/test_afile_retrieve_returns_unified_id.py` — new test
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stray markdown file in repo root
This file appears to be a development note/changelog for one of the fixes in this PR. It should not be committed to the repository root — it will clutter the top-level directory. Consider removing it before merging, or moving it into a docs directory if it's intended to be kept.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Comment on lines +383 to +397
if (
unified_batch_id
and hasattr(response, "input_file_id")
and response.input_file_id
and not _is_base64_encoded_unified_file_id(response.input_file_id)
and prisma_client
):
try:
_managed_file = await prisma_client.db.litellm_managedfiletable.find_first(
where={"flat_model_file_ids": {"has": response.input_file_id}}
)
if _managed_file:
response.input_file_id = _managed_file.unified_file_id
except Exception:
pass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicated input_file_id resolution logic across 3 locations
The same input_file_id → unified ID resolution block is copy-pasted here, again at line ~502 in this file, and in common_utils.py:get_batch_from_database. This triples the maintenance burden and risks divergence if the logic ever changes. Consider extracting this into a shared helper, e.g. _resolve_input_file_id_to_unified(response, prisma_client), and calling it from all three places.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed

Comment on lines +390 to +397
try:
_managed_file = await prisma_client.db.litellm_managedfiletable.find_first(
where={"flat_model_file_ids": {"has": response.input_file_id}}
)
if _managed_file:
response.input_file_id = _managed_file.unified_file_id
except Exception:
pass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Direct DB query in request-handling endpoint
Both occurrences of prisma_client.db.litellm_managedfiletable.find_first(...) are direct database queries in the critical request path (the retrieve_batch endpoint). Per the repository's custom rules, the critical path should avoid direct DB queries and instead use helper functions (e.g. get_team/get_user/get_key patterns).

Consider wrapping this lookup in a helper (e.g. in common_utils.py) to keep the endpoint code clean and to allow caching or batching of these lookups in the future.

Context Used: Rule from dashboard - What: In critical path of request, there should be no direct db queries. Only allow them to be made ... (source)

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed. Per discussion with @Sameerlite we are not removing the db query from the request path as it makes sense to have a live db query for security checks

Comment on lines +233 to +235
# When DB record is missing (file was deleted), allow through so downstream returns 404.
# Matches can_user_call_unified_object_id which also returns True for missing records.
return True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Security consideration: returning True for missing records bypasses access control
Changing the return from False to True when the managed file record is not found means that any user can pass the access check for a file ID that doesn't exist in the DB. The comment says this is to allow downstream to return 404, but it also means:

  1. If there's ever a race condition where a record is temporarily unavailable, access control is bypassed.
  2. An attacker could probe arbitrary file IDs and get past this check (though they'd hit 404 downstream).

The existing behavior for can_user_call_unified_object_id (also returning True for missing) is cited as justification, but that may itself be a design concern. Consider instead raising a 404 directly here rather than returning True, which more accurately represents "record not found" without weakening the access-control contract.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@Sameerlite Sameerlite changed the base branch from main to litellm_oss_staging_02_14_20262 February 16, 2026 12:59
@Sameerlite Sameerlite merged commit 72a1bd6 into BerriAI:litellm_oss_staging_02_14_20262 Feb 16, 2026
11 of 18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants