Skip to content

[Fix] Team Usage Spend Truncated Due to Pagination#22938

Merged
yuneng-jiang merged 1 commit intomainfrom
litellm_fix_team_usage_spend
Mar 6, 2026
Merged

[Fix] Team Usage Spend Truncated Due to Pagination#22938
yuneng-jiang merged 1 commit intomainfrom
litellm_fix_team_usage_spend

Conversation

@yuneng-jiang
Copy link
Collaborator

Relevant issues

Summary

Failure Path (Before Fix)

The /team/daily/activity endpoint used Prisma's find_many with skip/take pagination. The UI sends page_size=1000 but only fetches page 1. The LiteLLM_DailyTeamSpend table stores one row per unique (team_id, date, api_key, model, provider, endpoint) combination — a team with 141 keys and multiple models over 30 days produces ~1.3M rows (user confirmed total_pages: 1329). The total_spend in the response was computed only from the first 1000 rows.

This caused the team spend to appear identical across all date ranges (7-day, MTD, 30-day, YTD) since it always returned the same newest 1000 rows.

Fix

Switches /team/daily/activity from paginated Prisma queries to SQL GROUP BY via get_daily_activity_aggregated, returning all data in a single response. Adds include_entity_breakdown option to preserve per-team breakdown data in the response. Also adds timezone parameter support and api_key list filtering for the aggregated query path.

Changes

  • common_daily_activity.py: Added include_entity_id param to _build_aggregated_sql_query to optionally include entity_id in SELECT/GROUP BY. Widened api_key type to accept List[str] with proper SQL IN clause handling. Added include_entity_breakdown param to get_daily_activity_aggregated.
  • team_endpoints.py: /team/daily/activity now calls get_daily_activity_aggregated instead of get_daily_activity. Added timezone parameter. page/page_size kept in signature for backward compat but are no-ops.
  • test_team_endpoints.py: Updated existing tests to mock get_daily_activity_aggregated. Added test verifying include_entity_breakdown=True and timezone passthrough.

Testing

  • All 5 team daily activity tests pass
  • Verified with user that total_pages: 1329 confirms pagination truncation as root cause

Type

🐛 Bug Fix
✅ Test

The /team/daily/activity endpoint used Prisma pagination (page_size=1000)
but the UI only fetched page 1. Teams with many keys/models easily exceed
1000 rows in LiteLLM_DailyTeamSpend, causing truncated totals.

Switches the endpoint to use SQL GROUP BY via get_daily_activity_aggregated
with include_entity_breakdown=True, returning all data in a single response
while preserving per-team breakdown. Also adds timezone parameter support.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@vercel
Copy link

vercel bot commented Mar 6, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
litellm Error Error Mar 6, 2026 1:01am

Request Review

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 6, 2026

Greptile Summary

This PR fixes a real-data bug where /team/daily/activity silently returned truncated spend totals by replacing a paginated Prisma find_many call (capped at the first 1 000 rows of up to 1.3 M rows) with a SQL GROUP BY query that aggregates the entire dataset in the database before returning it. The fix is well-scoped: backward compatibility of page/page_size is preserved, a new timezone parameter is threaded through correctly, and per-team entity breakdown is restored by optionally including entity_id in the GROUP BY.

Key changes:

  • _build_aggregated_sql_query now accepts include_entity_id to selectively add entity_id to SELECT/GROUP BY, enabling per-team breakdowns.
  • api_key filter upgraded to accept List[str] with a proper SQL IN clause (used when non-admin users are filtered to their own keys).
  • get_daily_activity_aggregated gains include_entity_breakdown; the team endpoint always passes True to preserve per-team spend breakdown.
  • All 5 existing tests updated and a new integration-style test added to verify the aggregated path with include_entity_breakdown=True and timezone passthrough.

Two items to be aware of:

  • The new timezone parameter docstring in team_endpoints.py says "offset in minutes from UTC (e.g., 480 for PST)" but the implementation uses JavaScript's Date.getTimezoneOffset() sign convention (positive = west of UTC), the opposite of standard UTC offset notation. Without explicit documentation of this convention, API consumers using the standard convention will apply timezone adjustments to the wrong direction.
  • The aggregated SQL query has no LIMIT clause. For admin-level requests covering all teams with no team_ids filter, the GROUP BY over many teams × keys × models × days could still produce a very large in-memory result set. This is an acceptable tradeoff for correctness, but worth monitoring in large deployments.

Confidence Score: 4/5

  • Safe to merge — the fix correctly eliminates spend truncation with no breaking changes; minor documentation and scalability concerns are non-blocking.
  • The root cause (pagination truncation) is well-diagnosed and the GROUP BY approach is the correct fix. Backward compat is preserved for page/page_size. Tests cover the core scenarios with mocks and no real network calls. Two non-critical concerns lower the score slightly: the timezone docstring uses JS convention without labeling it (could cause API misuse), and the absence of a LIMIT on the aggregated query is a latent scalability risk for very large unfiltered admin requests.
  • litellm/proxy/management_endpoints/common_daily_activity.py — review the unbounded SQL result set concern for large deployments; litellm/proxy/management_endpoints/team_endpoints.py — clarify the timezone offset sign convention in the docstring.

Important Files Changed

Filename Overview
litellm/proxy/management_endpoints/common_daily_activity.py Adds include_entity_id param to _build_aggregated_sql_query (controls whether entity_id is in SELECT/GROUP BY), upgrades api_key to accept List[str] with correct IN clause, and adds include_entity_breakdown to get_daily_activity_aggregated. Logic is sound; the only concern is the lack of a LIMIT on the resulting SQL query which could be expensive for very large unfiltered requests.
litellm/proxy/management_endpoints/team_endpoints.py Switches /team/daily/activity from get_daily_activity (paginated Prisma find_many) to get_daily_activity_aggregated (SQL GROUP BY). Adds timezone parameter; deprecates but retains page/page_size for backward compat. A misleading docstring for the new timezone parameter uses JS convention (positive=west-of-UTC) without calling it out explicitly.
tests/test_litellm/proxy/management_endpoints/test_team_endpoints.py All existing tests updated to mock get_daily_activity_aggregated instead of get_daily_activity. New test test_get_team_daily_activity_uses_aggregated_with_entity_breakdown verifies include_entity_breakdown=True and timezone passthrough. Tests are mock-only, in line with CI requirements. Coverage looks thorough.

Sequence Diagram

sequenceDiagram
    participant UI as UI / API Client
    participant TE as team_endpoints.py<br/>/team/daily/activity
    participant CDA as common_daily_activity.py<br/>get_daily_activity_aggregated
    participant SQL as _build_aggregated_sql_query
    participant DB as PostgreSQL<br/>LiteLLM_DailyTeamSpend

    UI->>TE: GET /team/daily/activity<br/>(team_ids, start_date, end_date,<br/>model, api_key, timezone, page, page_size)

    note over TE: page/page_size accepted<br/>but are no-ops (deprecated)

    TE->>TE: Auth check & team membership validation
    TE->>TE: Build final_api_key_filter<br/>(user keys if non-admin)
    TE->>CDA: get_daily_activity_aggregated(<br/>include_entity_breakdown=True,<br/>timezone_offset_minutes=timezone)

    CDA->>SQL: _build_aggregated_sql_query(<br/>include_entity_id=True)
    SQL-->>CDA: SQL + params<br/>GROUP BY (team_id, date, api_key,<br/>model, model_group, provider,<br/>mcp_tool, endpoint)

    CDA->>DB: query_raw(sql, *params)
    DB-->>CDA: Pre-aggregated rows<br/>(all matching rows, no pagination)

    CDA->>CDA: _aggregate_spend_records(<br/>entity_id_field="team_id",<br/>entity_metadata_field=team_alias_metadata)
    CDA-->>TE: SpendAnalyticsPaginatedResponse<br/>(page=1, total_pages=1, has_more=False)
    TE-->>UI: Full spend data with per-team breakdown
Loading

Last reviewed commit: d0e4804

page (int): Deprecated, kept for backward compatibility. All results are returned in a single page.
page_size (int): Deprecated, kept for backward compatibility.
exclude_team_ids (Optional[str]): Comma-separated list of team IDs to exclude.
timezone (Optional[int]): Timezone offset in minutes from UTC (e.g., 480 for PST).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Misleading timezone offset convention in docstring

The docstring says "Timezone offset in minutes from UTC (e.g., 480 for PST)", but standard UTC offset notation for PST is −480 (UTC−8). This uses JavaScript's Date.getTimezoneOffset() convention (positive = west/behind UTC), which is the opposite of the IANA/ISO standard.

While the underlying _adjust_dates_for_timezone implementation is consistent (positive = west of UTC, so PST = +480), API consumers who follow the standard UTC-offset convention would pass −480 for PST and get the end-date expansion applied to the wrong hemisphere. This is a functional risk for anyone calling this new timezone parameter.

Consider clarifying the docstring to explicitly call out the JS convention and distinguish it from standard UTC offsets:

Suggested change
timezone (Optional[int]): Timezone offset in minutes from UTC (e.g., 480 for PST).
timezone (Optional[int]): Timezone offset in minutes using JavaScript's Date.getTimezoneOffset() conventionpositive values are *west* of UTC (e.g., 480 for PST = UTC-8). This is the opposite of standard UTC offset notation.

Comment on lines 561 to 583
@@ -563,7 +578,7 @@ def _build_aggregated_sql_query(
SUM(failed_requests)::bigint AS failed_requests
FROM "{pg_table}"
WHERE {where_clause}
GROUP BY date, api_key, model, model_group, custom_llm_provider,
GROUP BY {entity_group_by} date, api_key, model, model_group, custom_llm_provider,
mcp_namespaced_tool_name, endpoint
ORDER BY date DESC
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unbounded result set when querying without entity filter

The SQL query has no LIMIT clause. When include_entity_breakdown=True (always the case for the team endpoint now) and no team_ids filter is provided — e.g., a proxy admin loading the full dashboard — the query groups by (team_id, date, api_key, model, model_group, custom_llm_provider, mcp_namespaced_tool_name, endpoint). For a large deployment with many teams × keys × models × days, this can still produce tens or hundreds of thousands of grouped rows pulled entirely into Python memory in a single request.

The original paginated approach bounded memory per request via take=page_size. The new approach trades that safety valve for correctness (which is the right call), but the trade-off is worth surfacing. Consider either:

  1. Adding a defensive LIMIT with a generous cap (e.g., 500 000 rows) and logging a warning if it's hit, or
  2. Documenting this scalability assumption explicitly so operators are aware.

This is not a blocker for the fix itself, but is worth tracking for very large multi-team deployments.

@yuneng-jiang yuneng-jiang merged commit 99c4f3c into main Mar 6, 2026
74 of 101 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant