Litellm update blog posts rss#23791
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
Greptile SummaryThis PR replaces the previous manual GitHub-hosted JSON blog post fetch with a live RSS feed parse from Key changes:
Issues found:
Confidence Score: 3/5
|
| Filename | Overview |
|---|---|
| litellm/litellm_core_utils/get_blog_posts.py | Replaces GitHub JSON fetch with RSS XML parsing. Contains a security concern: xml.etree.ElementTree is vulnerable to entity-expansion (Billion Laughs) DoS attacks. Also has dead code: BlogPost/BlogPostsResponse Pydantic models are defined but never used in the parsing pipeline. |
| tests/test_litellm/test_get_blog_posts.py | Tests updated to mock httpx.get returning RSS XML text. All tests use proper mocks — no real network calls. New tests for parse_rss_to_posts, including invalid XML and missing channel edge cases. |
| litellm/init.py | Changes the default blog_posts_url from a GitHub raw JSON URL to the Docusaurus RSS feed endpoint. Simple one-line change, no issues. |
Flowchart
%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[get_blog_posts called] --> B{LITELLM_LOCAL_BLOG_POSTS=true?}
B -- Yes --> C[load_local_blog_posts\nblog_posts.json]
B -- No --> D{Cache valid?\nwithin TTL?}
D -- Yes --> E[Return cached posts]
D -- No --> F[fetch_rss_feed\nhttpx.get RSS URL]
F -- Network/HTTP error --> G[load_local_blog_posts\nfallback]
F -- Success: raw XML --> H[parse_rss_to_posts\nET.fromstring\nmax_posts=1]
H -- Parse error --> G
H -- Parsed posts --> I{validate_blog_posts\nnon-empty list?}
I -- False --> G
I -- True --> J[Cache posts\nReturn posts]
Comments Outside Diff (1)
-
litellm/litellm_core_utils/get_blog_posts.py, line 27-35 (link)Pydantic models
BlogPost/BlogPostsResponseare defined but never usedBlogPostandBlogPostsResponsewere defined to validate the blog post structure, butparse_rss_to_postsreturns rawList[Dict[str, str]]instead ofList[BlogPost]. The models therefore provide no actual runtime validation in the production path — they only appear in two isolated test assertions.Either use them to validate/coerce the parsed dicts (which would catch malformed RSS responses early), or remove them to avoid dead code:
# Option A – use the model for validation inside parse_rss_to_posts posts.append( BlogPost( title=title_el.text or "", description=desc_el.text or "" if desc_el is not None else "", date=date_str, url=link_el.text or "", ).model_dump() )
# Option B – remove the unused classes and imports
Last reviewed commit: 12facb2
| import xml.etree.ElementTree as ET | ||
| from email.utils import parsedate_to_datetime | ||
| from importlib.resources import files | ||
| from typing import Any, Dict, List, Optional |
There was a problem hiding this comment.
Unused Any import
Any was used by the old validate_blog_posts(data: Any) signature but is no longer referenced after the refactor. It should be removed.
| from typing import Any, Dict, List, Optional | |
| from typing import Dict, List, Optional |
| try: | ||
| data = cls.fetch_remote_blog_posts(url) | ||
| xml_text = cls.fetch_rss_feed(url) | ||
| posts = cls.parse_rss_to_posts(xml_text) |
There was a problem hiding this comment.
max_posts=1 silently limits output to a single post
get_blog_posts calls parse_rss_to_posts(xml_text) without overriding max_posts, so the RSS feed is always truncated to one post. If the previous blog_posts.json contained multiple posts and the UI surfaces more than one, callers will now always receive a single-item list — a silent regression in the number of blog posts shown.
The default of 1 in parse_rss_to_posts is reasonable as a conservative safeguard, but the call site should explicitly set the intended limit (or use a named constant) so the behaviour is obvious and easy to change:
| posts = cls.parse_rss_to_posts(xml_text) | |
| posts = cls.parse_rss_to_posts(xml_text, max_posts=10) |
(Adjust the value to whatever the UI is designed to display.)
|
|
||
| Extracts title, description, date (YYYY-MM-DD), and url from each <item>. | ||
| """ | ||
| root = ET.fromstring(xml_text) |
There was a problem hiding this comment.
xml.etree.ElementTree is vulnerable to XML entity expansion attacks
Python's own documentation explicitly states that xml.etree.ElementTree is not secure against maliciously constructed data and is vulnerable to "Billion Laughs" and "Quadratic Blowup" entity-expansion DoS attacks.
While the default URL (https://docs.litellm.ai/blog/rss.xml) is trusted, the URL is user-configurable via LITELLM_BLOG_POSTS_URL. If an operator points this to an attacker-controlled endpoint, the server can return a deeply nested entity-expansion payload that exhausts memory/CPU before the response is even fully processed.
The defusedxml library is the recommended drop-in replacement:
import defusedxml.ElementTree as ETThis single swap neutralises billion-laughs, quadratic-blowup, and external-entity attacks without any other code changes. Alternatively, you can validate the response size before parsing (e.g., reject payloads over ~1 MB).
|
@greptileai the URL is only configurable via server-side env var (LITELLM_BLOG_POSTS_URL), meaning only operators who already have full system access can change it. It's not user-facing input. "Unused Pydantic models", BlogPost and BlogPostsResponse are used they are not dead code (inside public endpoints) |
|
You're right that the threat model is limited here - if an operator has access to set
But if you want to skip the dependency and keep stdlib-only, validating the response size before parsing (e.g., reject if Either way, your point about the access model is valid - this isn't a high-severity issue given the current architecture. |
Type
🧹 Refactoring
Changes
https://docs.litellm.ai/blog/rss.xml) instead of a manually-updated JSON file on GitHubxml.etree.ElementTreeandemail.utils)blog_posts.jsonon any failure (network error, invalid XML, etc.)