[Fix] Team Usage Spend Truncated Due to Pagination#22938
Conversation
The /team/daily/activity endpoint used Prisma pagination (page_size=1000) but the UI only fetched page 1. Teams with many keys/models easily exceed 1000 rows in LiteLLM_DailyTeamSpend, causing truncated totals. Switches the endpoint to use SQL GROUP BY via get_daily_activity_aggregated with include_entity_breakdown=True, returning all data in a single response while preserving per-team breakdown. Also adds timezone parameter support. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
Greptile SummaryThis PR fixes a real-data bug where Key changes:
Two items to be aware of:
Confidence Score: 4/5
|
| Filename | Overview |
|---|---|
| litellm/proxy/management_endpoints/common_daily_activity.py | Adds include_entity_id param to _build_aggregated_sql_query (controls whether entity_id is in SELECT/GROUP BY), upgrades api_key to accept List[str] with correct IN clause, and adds include_entity_breakdown to get_daily_activity_aggregated. Logic is sound; the only concern is the lack of a LIMIT on the resulting SQL query which could be expensive for very large unfiltered requests. |
| litellm/proxy/management_endpoints/team_endpoints.py | Switches /team/daily/activity from get_daily_activity (paginated Prisma find_many) to get_daily_activity_aggregated (SQL GROUP BY). Adds timezone parameter; deprecates but retains page/page_size for backward compat. A misleading docstring for the new timezone parameter uses JS convention (positive=west-of-UTC) without calling it out explicitly. |
| tests/test_litellm/proxy/management_endpoints/test_team_endpoints.py | All existing tests updated to mock get_daily_activity_aggregated instead of get_daily_activity. New test test_get_team_daily_activity_uses_aggregated_with_entity_breakdown verifies include_entity_breakdown=True and timezone passthrough. Tests are mock-only, in line with CI requirements. Coverage looks thorough. |
Sequence Diagram
sequenceDiagram
participant UI as UI / API Client
participant TE as team_endpoints.py<br/>/team/daily/activity
participant CDA as common_daily_activity.py<br/>get_daily_activity_aggregated
participant SQL as _build_aggregated_sql_query
participant DB as PostgreSQL<br/>LiteLLM_DailyTeamSpend
UI->>TE: GET /team/daily/activity<br/>(team_ids, start_date, end_date,<br/>model, api_key, timezone, page, page_size)
note over TE: page/page_size accepted<br/>but are no-ops (deprecated)
TE->>TE: Auth check & team membership validation
TE->>TE: Build final_api_key_filter<br/>(user keys if non-admin)
TE->>CDA: get_daily_activity_aggregated(<br/>include_entity_breakdown=True,<br/>timezone_offset_minutes=timezone)
CDA->>SQL: _build_aggregated_sql_query(<br/>include_entity_id=True)
SQL-->>CDA: SQL + params<br/>GROUP BY (team_id, date, api_key,<br/>model, model_group, provider,<br/>mcp_tool, endpoint)
CDA->>DB: query_raw(sql, *params)
DB-->>CDA: Pre-aggregated rows<br/>(all matching rows, no pagination)
CDA->>CDA: _aggregate_spend_records(<br/>entity_id_field="team_id",<br/>entity_metadata_field=team_alias_metadata)
CDA-->>TE: SpendAnalyticsPaginatedResponse<br/>(page=1, total_pages=1, has_more=False)
TE-->>UI: Full spend data with per-team breakdown
Last reviewed commit: d0e4804
| page (int): Deprecated, kept for backward compatibility. All results are returned in a single page. | ||
| page_size (int): Deprecated, kept for backward compatibility. | ||
| exclude_team_ids (Optional[str]): Comma-separated list of team IDs to exclude. | ||
| timezone (Optional[int]): Timezone offset in minutes from UTC (e.g., 480 for PST). |
There was a problem hiding this comment.
Misleading timezone offset convention in docstring
The docstring says "Timezone offset in minutes from UTC (e.g., 480 for PST)", but standard UTC offset notation for PST is −480 (UTC−8). This uses JavaScript's Date.getTimezoneOffset() convention (positive = west/behind UTC), which is the opposite of the IANA/ISO standard.
While the underlying _adjust_dates_for_timezone implementation is consistent (positive = west of UTC, so PST = +480), API consumers who follow the standard UTC-offset convention would pass −480 for PST and get the end-date expansion applied to the wrong hemisphere. This is a functional risk for anyone calling this new timezone parameter.
Consider clarifying the docstring to explicitly call out the JS convention and distinguish it from standard UTC offsets:
| timezone (Optional[int]): Timezone offset in minutes from UTC (e.g., 480 for PST). | |
| timezone (Optional[int]): Timezone offset in minutes using JavaScript's Date.getTimezoneOffset() convention — positive values are *west* of UTC (e.g., 480 for PST = UTC-8). This is the opposite of standard UTC offset notation. |
| @@ -563,7 +578,7 @@ def _build_aggregated_sql_query( | |||
| SUM(failed_requests)::bigint AS failed_requests | |||
| FROM "{pg_table}" | |||
| WHERE {where_clause} | |||
| GROUP BY date, api_key, model, model_group, custom_llm_provider, | |||
| GROUP BY {entity_group_by} date, api_key, model, model_group, custom_llm_provider, | |||
| mcp_namespaced_tool_name, endpoint | |||
| ORDER BY date DESC | |||
There was a problem hiding this comment.
Unbounded result set when querying without entity filter
The SQL query has no LIMIT clause. When include_entity_breakdown=True (always the case for the team endpoint now) and no team_ids filter is provided — e.g., a proxy admin loading the full dashboard — the query groups by (team_id, date, api_key, model, model_group, custom_llm_provider, mcp_namespaced_tool_name, endpoint). For a large deployment with many teams × keys × models × days, this can still produce tens or hundreds of thousands of grouped rows pulled entirely into Python memory in a single request.
The original paginated approach bounded memory per request via take=page_size. The new approach trades that safety valve for correctness (which is the right call), but the trade-off is worth surfacing. Consider either:
- Adding a defensive
LIMITwith a generous cap (e.g., 500 000 rows) and logging a warning if it's hit, or - Documenting this scalability assumption explicitly so operators are aware.
This is not a blocker for the fix itself, but is worth tracking for very large multi-team deployments.
Relevant issues
Summary
Failure Path (Before Fix)
The
/team/daily/activityendpoint used Prisma'sfind_manywithskip/takepagination. The UI sendspage_size=1000but only fetches page 1. TheLiteLLM_DailyTeamSpendtable stores one row per unique(team_id, date, api_key, model, provider, endpoint)combination — a team with 141 keys and multiple models over 30 days produces ~1.3M rows (user confirmedtotal_pages: 1329). Thetotal_spendin the response was computed only from the first 1000 rows.This caused the team spend to appear identical across all date ranges (7-day, MTD, 30-day, YTD) since it always returned the same newest 1000 rows.
Fix
Switches
/team/daily/activityfrom paginated Prisma queries to SQLGROUP BYviaget_daily_activity_aggregated, returning all data in a single response. Addsinclude_entity_breakdownoption to preserve per-team breakdown data in the response. Also addstimezoneparameter support andapi_keylist filtering for the aggregated query path.Changes
common_daily_activity.py: Addedinclude_entity_idparam to_build_aggregated_sql_queryto optionally include entity_id in SELECT/GROUP BY. Widenedapi_keytype to acceptList[str]with proper SQL IN clause handling. Addedinclude_entity_breakdownparam toget_daily_activity_aggregated.team_endpoints.py:/team/daily/activitynow callsget_daily_activity_aggregatedinstead ofget_daily_activity. Addedtimezoneparameter.page/page_sizekept in signature for backward compat but are no-ops.test_team_endpoints.py: Updated existing tests to mockget_daily_activity_aggregated. Added test verifyinginclude_entity_breakdown=Trueand timezone passthrough.Testing
total_pages: 1329confirms pagination truncation as root causeType
🐛 Bug Fix
✅ Test