Skip to content

Litellm update blog posts rss#23791

Merged
ryan-crabbe merged 2 commits intolitellm_ryan_march_16from
litellm_update-blog-posts-rss
Mar 16, 2026
Merged

Litellm update blog posts rss#23791
ryan-crabbe merged 2 commits intolitellm_ryan_march_16from
litellm_update-blog-posts-rss

Conversation

@ryan-crabbe
Copy link
Copy Markdown
Contributor

Type

🧹 Refactoring

Changes

  • Fetch blog posts from the docs site RSS feed (https://docs.litellm.ai/blog/rss.xml) instead of a manually-updated JSON file on GitHub
  • Parses RSS XML to extract title, description, date, and URL, no new dependencies (uses stdlib xml.etree.ElementTree and email.utils)
  • Falls back to bundled local blog_posts.json on any failure (network error, invalid XML, etc.)
  • Blog posts now stay in sync with the docs site automatically, no more manual JSON updates

@vercel
Copy link
Copy Markdown

vercel bot commented Mar 16, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
litellm Ready Ready Preview, Comment Mar 16, 2026 11:24pm

Request Review

@ryan-crabbe ryan-crabbe changed the base branch from main to litellm_ryan_march_16 March 16, 2026 23:11
@codspeed-hq
Copy link
Copy Markdown
Contributor

codspeed-hq bot commented Mar 16, 2026

Merging this PR will not alter performance

✅ 16 untouched benchmarks


Comparing litellm_update-blog-posts-rss (67482db) with main (278c9ba)1

Open in CodSpeed

Footnotes

  1. No successful run was found on litellm_ryan_march_16 (4f2fe33) during the generation of this report, so main (278c9ba) was used instead as the comparison base. There might be some changes unrelated to this pull request in this report.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 16, 2026

Greptile Summary

This PR replaces the previous manual GitHub-hosted JSON blog post fetch with a live RSS feed parse from https://docs.litellm.ai/blog/rss.xml, using only stdlib modules (xml.etree.ElementTree, email.utils) alongside the existing httpx dependency. The fallback to the bundled blog_posts.json and the in-process TTL cache are preserved.

Key changes:

  • litellm/__init__.py: Default blog_posts_url updated to point at the Docusaurus RSS endpoint.
  • get_blog_posts.py: fetch_remote_blog_postsfetch_rss_feed (returns raw XML text); new parse_rss_to_posts method extracts post dicts from <item> elements; validate_blog_posts simplified to check for a non-empty list.
  • test_get_blog_posts.py: All tests updated to mock httpx.get returning RSS XML; new unit tests cover the XML parser, including invalid XML and missing <channel> edge cases. All tests are properly mocked with no real network calls.

Issues found:

  • xml.etree.ElementTree.fromstring is documented as insecure against maliciously constructed XML (Billion Laughs / entity-expansion DoS). Since blog_posts_url is operator-configurable via environment variable, this is an exploitable surface if it is ever pointed at an untrusted endpoint. Using defusedxml would resolve this with a one-line change.
  • BlogPost and BlogPostsResponse Pydantic models are defined but the parsing pipeline returns raw dicts, leaving the models as dead code that provides no runtime validation.

Confidence Score: 3/5

  • Functional logic is sound and well-tested, but the use of the unsafe xml.etree.ElementTree parser should be addressed before merging.
  • The core RSS-parsing logic is correct, the fallback chain works, and all tests are properly mocked. The score is lowered primarily because xml.etree.ElementTree is explicitly documented as vulnerable to entity-expansion DoS attacks, and the parsing URL is user-configurable via an environment variable. Additionally, the defined Pydantic models are dead code in the production path.
  • litellm/litellm_core_utils/get_blog_posts.py — review the XML parsing security concern and unused Pydantic models.

Important Files Changed

Filename Overview
litellm/litellm_core_utils/get_blog_posts.py Replaces GitHub JSON fetch with RSS XML parsing. Contains a security concern: xml.etree.ElementTree is vulnerable to entity-expansion (Billion Laughs) DoS attacks. Also has dead code: BlogPost/BlogPostsResponse Pydantic models are defined but never used in the parsing pipeline.
tests/test_litellm/test_get_blog_posts.py Tests updated to mock httpx.get returning RSS XML text. All tests use proper mocks — no real network calls. New tests for parse_rss_to_posts, including invalid XML and missing channel edge cases.
litellm/init.py Changes the default blog_posts_url from a GitHub raw JSON URL to the Docusaurus RSS feed endpoint. Simple one-line change, no issues.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[get_blog_posts called] --> B{LITELLM_LOCAL_BLOG_POSTS=true?}
    B -- Yes --> C[load_local_blog_posts\nblog_posts.json]
    B -- No --> D{Cache valid?\nwithin TTL?}
    D -- Yes --> E[Return cached posts]
    D -- No --> F[fetch_rss_feed\nhttpx.get RSS URL]
    F -- Network/HTTP error --> G[load_local_blog_posts\nfallback]
    F -- Success: raw XML --> H[parse_rss_to_posts\nET.fromstring\nmax_posts=1]
    H -- Parse error --> G
    H -- Parsed posts --> I{validate_blog_posts\nnon-empty list?}
    I -- False --> G
    I -- True --> J[Cache posts\nReturn posts]
Loading

Comments Outside Diff (1)

  1. litellm/litellm_core_utils/get_blog_posts.py, line 27-35 (link)

    Pydantic models BlogPost / BlogPostsResponse are defined but never used

    BlogPost and BlogPostsResponse were defined to validate the blog post structure, but parse_rss_to_posts returns raw List[Dict[str, str]] instead of List[BlogPost]. The models therefore provide no actual runtime validation in the production path — they only appear in two isolated test assertions.

    Either use them to validate/coerce the parsed dicts (which would catch malformed RSS responses early), or remove them to avoid dead code:

    # Option A – use the model for validation inside parse_rss_to_posts
    posts.append(
        BlogPost(
            title=title_el.text or "",
            description=desc_el.text or "" if desc_el is not None else "",
            date=date_str,
            url=link_el.text or "",
        ).model_dump()
    )
    # Option B – remove the unused classes and imports

Last reviewed commit: 12facb2

import xml.etree.ElementTree as ET
from email.utils import parsedate_to_datetime
from importlib.resources import files
from typing import Any, Dict, List, Optional
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused Any import

Any was used by the old validate_blog_posts(data: Any) signature but is no longer referenced after the refactor. It should be removed.

Suggested change
from typing import Any, Dict, List, Optional
from typing import Dict, List, Optional

try:
data = cls.fetch_remote_blog_posts(url)
xml_text = cls.fetch_rss_feed(url)
posts = cls.parse_rss_to_posts(xml_text)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

max_posts=1 silently limits output to a single post

get_blog_posts calls parse_rss_to_posts(xml_text) without overriding max_posts, so the RSS feed is always truncated to one post. If the previous blog_posts.json contained multiple posts and the UI surfaces more than one, callers will now always receive a single-item list — a silent regression in the number of blog posts shown.

The default of 1 in parse_rss_to_posts is reasonable as a conservative safeguard, but the call site should explicitly set the intended limit (or use a named constant) so the behaviour is obvious and easy to change:

Suggested change
posts = cls.parse_rss_to_posts(xml_text)
posts = cls.parse_rss_to_posts(xml_text, max_posts=10)

(Adjust the value to whatever the UI is designed to display.)


Extracts title, description, date (YYYY-MM-DD), and url from each <item>.
"""
root = ET.fromstring(xml_text)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

xml.etree.ElementTree is vulnerable to XML entity expansion attacks

Python's own documentation explicitly states that xml.etree.ElementTree is not secure against maliciously constructed data and is vulnerable to "Billion Laughs" and "Quadratic Blowup" entity-expansion DoS attacks.

While the default URL (https://docs.litellm.ai/blog/rss.xml) is trusted, the URL is user-configurable via LITELLM_BLOG_POSTS_URL. If an operator points this to an attacker-controlled endpoint, the server can return a deeply nested entity-expansion payload that exhausts memory/CPU before the response is even fully processed.

The defusedxml library is the recommended drop-in replacement:

import defusedxml.ElementTree as ET

This single swap neutralises billion-laughs, quadratic-blowup, and external-entity attacks without any other code changes. Alternatively, you can validate the response size before parsing (e.g., reject payloads over ~1 MB).

@ryan-crabbe
Copy link
Copy Markdown
Contributor Author

@greptileai the URL is only configurable via server-side env var (LITELLM_BLOG_POSTS_URL), meaning only operators who already have full system access can change it. It's not user-facing input. "Unused Pydantic models", BlogPost and BlogPostsResponse are used they are not dead code (inside public endpoints)

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 16, 2026

You're right that the threat model is limited here - if an operator has access to set LITELLM_BLOG_POSTS_URL, they already have significant system access. That said, defusedxml is still worth considering for defense-in-depth:

  1. Zero additional complexity - it's a drop-in replacement (import defusedxml.ElementTree as ET)
  2. Guards against indirect attacks - if this env var is ever set via a config management system with broader access, or if the code is reused elsewhere
  3. Standard Python security guidance - the stdlib docs explicitly recommend against using xml.etree.ElementTree for untrusted data

But if you want to skip the dependency and keep stdlib-only, validating the response size before parsing (e.g., reject if len(response.text) > 1_000_000) would also mitigate the DoS risk without adding a new import.

Either way, your point about the access model is valid - this isn't a high-severity issue given the current architecture.

@ryan-crabbe ryan-crabbe merged commit 0d45b1d into litellm_ryan_march_16 Mar 16, 2026
59 of 65 checks passed
@ryan-crabbe ryan-crabbe deleted the litellm_update-blog-posts-rss branch March 17, 2026 00:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant