Skip to content

Conversation

@amotl
Copy link
Member

@amotl amotl commented May 16, 2025

About

Inquire information from https://llmtxt.dev/hub, in order to ...

about/docs/backlog.md

Lines 4 to 6 in 419a85c

- Bundle: Compare sizes of CrateDB's `llms.txt` files against sizes
of other vendors. Adjust when needed.
https://github.com/crate/about/issues/20

@coderabbitai
Copy link

coderabbitai bot commented May 16, 2025

Walkthrough

A new CLI command named hub has been added, which, when executed, fetches and processes information from the "https://llmtxt.dev/hub" webpage. This is accomplished through a newly introduced module that defines data models and a scraper class for extracting and organizing hub item details.

Changes

File(s) Change Summary
src/cratedb_about/cli.py Added a new CLI command hub to the CLI group. This command creates an instance of LLMsTxtHub, invokes its fetch() method to retrieve and process data from the hub webpage, pretty-prints the fetched items, and logs a completion message. Added imports for pprint and LLMsTxtHub.
src/cratedb_about/hub/model.py Introduced a new module defining the Resource and LLMsTxtHubItem dataclasses to represent hub entries and their resources, and the LLMsTxtHub class to fetch and parse the hub webpage. The fetch method retrieves HTML, parses it for "website-card" elements, converts each card into a model instance, and stores them. The acquire_sizes method fetches each resource URL to update resource sizes. The card_to_model static method extracts details from HTML elements to populate the dataclasses.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant CLI
    participant LLMsTxtHub
    participant WebPage

    User->>CLI: Invoke "hub" command
    CLI->>LLMsTxtHub: Create instance
    CLI->>LLMsTxtHub: Call fetch()
    LLMsTxtHub->>WebPage: GET https://llmtxt.dev/hub
    WebPage-->>LLMsTxtHub: Return HTML content
    LLMsTxtHub->>LLMsTxtHub: Parse HTML, extract items
    LLMsTxtHub->>WebPage: GET resource URLs (acquire_sizes)
    WebPage-->>LLMsTxtHub: Return resource content
    LLMsTxtHub-->>CLI: Return populated instance
    CLI->>CLI: Pretty-print items and log "Ready."
Loading

Poem

In the warren of code, a new hub appears,
Fetching web treasures, it joyfully cheers.
With dataclasses hopping and parsing in tow,
The CLI now fetches what rabbits must know.
"Ready!" it chirps, with a digital thump—
Another fine leap, with a data-rich jump!
🥕

✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate Unit Tests
  • Create PR with Unit Tests
  • Commit Unit Tests in branch hub-inquiry
  • Post Copyable Unit Tests in Comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

‼️ IMPORTANT
Auto-reply has been disabled for this repository in the CodeRabbit settings. The CodeRabbit bot will not respond to your replies unless it is explicitly tagged.

  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai generate unit tests to generate unit tests for this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (2)
src/cratedb_about/hub/model.py (2)

33-47: Fragile HTML traversal may raise AttributeError

The chain divs.find(...).text assumes every node exists. If the remote page layout changes, find() may return None and .text will crash.
Guard with if not x: continue or use get_text(strip=True) with default values to keep the scraper resilient.

🧰 Tools
🪛 Ruff (0.8.2)

34-34: Found commented-out code

(ERA001)


35-35: Found commented-out code

(ERA001)


36-36: Found commented-out code

(ERA001)


37-37: Found commented-out code

(ERA001)


38-38: Found commented-out code

(ERA001)


39-39: Found commented-out code

(ERA001)


41-41: Found commented-out code

(ERA001)


43-43: Found commented-out code

(ERA001)


44-44: Found commented-out code

(ERA001)


45-45: Found commented-out code

(ERA001)


34-45: Remove commented-out debug code & print statements

Nine blocks (#print, #continue, etc.) violate cleanup standards and trigger Ruff ERA001 warnings. Convert useful ones to logger.debug() and drop the rest.

-            #print("divs:", divs)
+            # logger.debug("divs: %s", divs)
...
-        #print("cards:", cards)
-        print("data:", data)
+        # logger.debug("cards: %s", cards)
+        logger.debug("Scraped %d hub items", len(items))

Also applies to: 57-58

🧰 Tools
🪛 Ruff (0.8.2)

34-34: Found commented-out code

(ERA001)


35-35: Found commented-out code

(ERA001)


36-36: Found commented-out code

(ERA001)


37-37: Found commented-out code

(ERA001)


38-38: Found commented-out code

(ERA001)


39-39: Found commented-out code

(ERA001)


41-41: Found commented-out code

(ERA001)


43-43: Found commented-out code

(ERA001)


44-44: Found commented-out code

(ERA001)


45-45: Found commented-out code

(ERA001)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 419a85c and fce13e0.

📒 Files selected for processing (2)
  • src/cratedb_about/cli.py (2 hunks)
  • src/cratedb_about/hub/model.py (1 hunks)
🧰 Additional context used
🪛 Ruff (0.8.2)
src/cratedb_about/hub/model.py

34-34: Found commented-out code

(ERA001)


35-35: Found commented-out code

(ERA001)


36-36: Found commented-out code

(ERA001)


37-37: Found commented-out code

(ERA001)


38-38: Found commented-out code

(ERA001)


39-39: Found commented-out code

(ERA001)


41-41: Found commented-out code

(ERA001)


43-43: Found commented-out code

(ERA001)


44-44: Found commented-out code

(ERA001)


45-45: Found commented-out code

(ERA001)


53-53: SyntaxError: Positional argument cannot follow keyword argument


56-56: Found commented-out code

(ERA001)

🪛 GitHub Actions: Tests
src/cratedb_about/hub/model.py

[error] 53-53: Ruff formatting check failed: Positional argument cannot follow keyword argument.

Comment on lines 27 to 31
def fetch(self):
index_html = self.client.get(self.url)
bs = BeautifulSoup(index_html, "html.parser")
cards = bs.find_all(attrs={"class": "website-card"})
data = []
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

No error handling around HTTP request & parsing

get_cache_client().get() may raise network-level exceptions or return non-200 responses, and BeautifulSoup parsing can fail if unexpected markup is returned.
Wrap the request in try/except and raise or log meaningful errors; consider passing a timeout.

🤖 Prompt for AI Agents
In src/cratedb_about/hub/model.py around lines 27 to 31, the fetch method lacks
error handling for the HTTP request and HTML parsing. Wrap the call to
self.client.get(self.url) in a try/except block to catch network exceptions and
handle non-200 HTTP responses by checking the status code. Also, add a timeout
parameter to the request if supported. Similarly, catch exceptions from
BeautifulSoup parsing and raise or log meaningful errors to improve robustness.

Comment on lines 103 to 112
@cli.command()
@click.pass_context
def hub(ctx: click.Context) -> None:
"""
Inquire information from https://llmtxt.dev/hub.
"""
LLMsTxtHub().fetch()
logger.info("Ready.")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

CLI command provides no output or feedback to the user

LLMsTxtHub().fetch() silently scrapes the site and the command ends with “Ready.” Users cannot see the harvested data or where it was stored.

Consider:

  1. Returning the list from fetch() and pretty-printing it (e.g. table, JSON, YAML) based on an optional --format flag.
  2. Catching and surfacing network errors so the CLI exits with a non-zero status on failure.
-@cli.command()
+@cli.command()
+@click.option("--format", "-f", type=click.Choice(["json", "yaml", "table"]), default="table")
 @click.pass_context
-def hub(ctx: click.Context) -> None:
+def hub(ctx: click.Context, format: str) -> None:
 ...
-    LLMsTxtHub().fetch()
-    logger.info("Ready.")
+    items = LLMsTxtHub().fetch()
+    if format == "json":
+        click.echo(json.dumps([dataclasses.asdict(i) for i in items], indent=2))
+    elif format == "yaml":
+        click.echo(yaml.safe_dump([dataclasses.asdict(i) for i in items]))
+    else:  # table
+        click.echo("\n".join(f"- {i.name}: {i.website}" for i in items))
+    logger.info("Scrape completed – %d items", len(items))

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In src/cratedb_about/cli.py around lines 103 to 110, the hub CLI command calls
LLMsTxtHub().fetch() but provides no output or feedback to the user about the
fetched data or its storage location. Modify the hub function to capture the
list returned by fetch(), then pretty-print this data in a user-friendly format
such as a table, JSON, or YAML, controlled by an optional --format flag.
Additionally, add error handling to catch network or fetch-related exceptions,
log an appropriate error message, and exit the CLI with a non-zero status code
on failure.

@amotl amotl force-pushed the this-and-that-2 branch 2 times, most recently from 348ec43 to d6017ce Compare May 16, 2025 09:25
Base automatically changed from this-and-that-2 to main May 16, 2025 11:09
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (2)
src/cratedb_about/hub/model.py (1)

36-42: Add error handling for HTTP request and BeautifulSoup parsing

The fetch method lacks any error handling for HTTP requests or HTML parsing. If the request fails or the HTML structure is unexpected, it will raise exceptions that bubble up to the CLI without meaningful context.

Apply this diff to add proper error handling:

def fetch(self):
-    index_html = self.client.get(self.url)
-    bs = BeautifulSoup(index_html, "html.parser")
-    cards = bs.find_all(attrs={"class": "website-card"})
-    self.items = [self.card_to_model(card) for card in cards]
-    self.acquire_sizes()
-    return self
+    try:
+        index_html = self.client.get(self.url)
+        bs = BeautifulSoup(index_html, "html.parser")
+        cards = bs.find_all(attrs={"class": "website-card"})
+        self.items = [self.card_to_model(card) for card in cards]
+        self.acquire_sizes()
+        return self
+    except Exception as e:
+        logger.error(f"Failed to fetch hub data: {e}")
+        raise RuntimeError(f"Failed to fetch hub information from {self.url}: {e}") from e
src/cratedb_about/cli.py (1)

104-112: Improve CLI output with format options and error handling

The current implementation uses pprint directly and lacks output format options or error handling. This makes the command less flexible and robust than other commands in the CLI.

Apply this diff to add format options and error handling:

@cli.command()
+@click.option(
+    "--format", 
+    "-f", 
+    type=click.Choice(["json", "yaml", "table", "pretty"]), 
+    default="pretty",
+    help="Output format"
+)
@click.pass_context
-def hub(ctx: click.Context) -> None:
+def hub(ctx: click.Context, format: str) -> None:
    """
    Inquire information from https://llmtxt.dev/hub.
    """
-    txt_hub = LLMsTxtHub().fetch()
-    pprint(txt_hub.items)
-    logger.info("Ready.")
+    try:
+        txt_hub = LLMsTxtHub().fetch()
+        items = txt_hub.items
+        
+        if format == "json":
+            import json
+            click.echo(json.dumps([dataclasses.asdict(item) for item in items], indent=2))
+        elif format == "yaml":
+            import yaml
+            click.echo(yaml.safe_dump([dataclasses.asdict(item) for item in items]))
+        elif format == "table":
+            for item in items:
+                click.echo(f"- {item.title}: {item.website}")
+                if item.description:
+                    click.echo(f"  {item.description}")
+                if item.tags:
+                    click.echo(f"  Tags: {', '.join(item.tags)}")
+                if item.resources:
+                    click.echo(f"  Resources: {len(item.resources)}")
+                click.echo("")
+        else:  # pretty
+            from pprint import pprint
+            pprint(items)
+            
+        logger.info(f"Successfully retrieved {len(items)} items.")
+    except Exception as e:
+        logger.error(f"Failed to retrieve hub information: {e}")
+        ctx.exit(1)
🧰 Tools
🪛 Ruff (0.11.9)

111-111: pprint found

Remove pprint

(T203)

🧹 Nitpick comments (2)
src/cratedb_about/hub/model.py (1)

44-53: Consider batching or limiting the number of HTTP requests

The acquire_sizes method makes an HTTP request for each resource of each item, which could be many requests. This might hit rate limits or take a long time to complete.

Consider adding a limit parameter or implementing batch processing:

-def acquire_sizes(self):
+def acquire_sizes(self, limit: int = 100):
    logger.info(f"Acquiring sizes for {len(self.items)} items")
+    request_count = 0
    for item in self.items:
        logger.info(f"Acquiring size for {item}")
        for resource in item.resources:
+            if request_count >= limit:
+                logger.warning(f"Reached limit of {limit} requests, stopping size acquisition")
+                return
            try:
                response = self.client.get(resource.url)
                resource.size = len(response.text)
+                request_count += 1
            except Exception as e:
                logger.warning(f"Failed to acquire size for {item}: {e}")
src/cratedb_about/cli.py (1)

108-109: Enhance CLI command documentation

The command description is minimal and could provide more details about what information is being retrieved.

Apply this diff to improve the documentation:

    """
-    Inquire information from https://llmtxt.dev/hub.
+    Retrieve and display information about LLM tools and resources from https://llmtxt.dev/hub.
+    
+    This command scrapes the llmtxt.dev/hub webpage, extracts information about LLM tools,
+    and displays the results in the specified format.
    """
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fce13e0 and 1f2b422.

📒 Files selected for processing (2)
  • src/cratedb_about/cli.py (2 hunks)
  • src/cratedb_about/hub/model.py (1 hunks)
🧰 Additional context used
🪛 Ruff (0.11.9)
src/cratedb_about/hub/model.py

68-68: Line too long (131 > 100)

(E501)

src/cratedb_about/cli.py

111-111: pprint found

Remove pprint

(T203)

🪛 GitHub Actions: Tests
src/cratedb_about/hub/model.py

[error] 1-1: Ruff formatting check failed. File would be reformatted. Run 'ruff format' to fix code style issues.

Comment on lines +55 to +68
@staticmethod
def card_to_model(card):
divs = card.find(name="div")
title = divs.find(name="h3").text
tags = []
for tag in divs.find_all(name="span"):
tags.append(tag.text)
website = divs.find(name="p", attrs={"class": "text-sm"}).text
description = divs.find(name="p", attrs={"class": "text-sm", "title": True}).text
logo_url = divs.find(name="img").get("src")
resources = []
for anchor in divs.find_all(name="a"):
resources.append(Resource(url=anchor.get("href")))
return LLMsTxtHubItem(title=title, website=website, description=description, logo=logo_url, tags=tags, resources=resources)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Add error handling and type annotations to card_to_model method

The card_to_model method lacks error handling for HTML parsing and is missing return type annotation. If the HTML structure changes, the method will fail with confusing errors.

Apply this diff to add error handling and type annotation:

@staticmethod
-def card_to_model(card):
-    divs = card.find(name="div")
-    title = divs.find(name="h3").text
-    tags = []
-    for tag in divs.find_all(name="span"):
-        tags.append(tag.text)
-    website = divs.find(name="p", attrs={"class": "text-sm"}).text
-    description = divs.find(name="p", attrs={"class": "text-sm", "title": True}).text
-    logo_url = divs.find(name="img").get("src")
-    resources = []
-    for anchor in divs.find_all(name="a"):
-        resources.append(Resource(url=anchor.get("href")))
-    return LLMsTxtHubItem(title=title, website=website, description=description, logo=logo_url, tags=tags, resources=resources)
+def card_to_model(card) -> LLMsTxtHubItem:
+    try:
+        divs = card.find(name="div")
+        if not divs:
+            raise ValueError("Card structure doesn't contain expected div element")
+            
+        title_elem = divs.find(name="h3")
+        if not title_elem:
+            raise ValueError("Missing title element (h3)")
+        title = title_elem.text
+        
+        tags = []
+        for tag in divs.find_all(name="span"):
+            tags.append(tag.text)
+            
+        website_elem = divs.find(name="p", attrs={"class": "text-sm"})
+        if not website_elem:
+            raise ValueError("Missing website element (p.text-sm)")
+        website = website_elem.text
+        
+        desc_elem = divs.find(name="p", attrs={"class": "text-sm", "title": True})
+        if not desc_elem:
+            raise ValueError("Missing description element (p.text-sm[title])")
+        description = desc_elem.text
+        
+        logo_elem = divs.find(name="img")
+        logo_url = logo_elem.get("src") if logo_elem else None
+        
+        resources = []
+        for anchor in divs.find_all(name="a"):
+            href = anchor.get("href")
+            if href:
+                resources.append(Resource(url=href))
+                
+        return LLMsTxtHubItem(
+            title=title, 
+            website=website, 
+            description=description, 
+            logo=logo_url, 
+            tags=tags, 
+            resources=resources
+        )
+    except Exception as e:
+        logger.error(f"Failed to parse card: {e}")
+        raise ValueError(f"Failed to parse card: {e}") from e
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
@staticmethod
def card_to_model(card):
divs = card.find(name="div")
title = divs.find(name="h3").text
tags = []
for tag in divs.find_all(name="span"):
tags.append(tag.text)
website = divs.find(name="p", attrs={"class": "text-sm"}).text
description = divs.find(name="p", attrs={"class": "text-sm", "title": True}).text
logo_url = divs.find(name="img").get("src")
resources = []
for anchor in divs.find_all(name="a"):
resources.append(Resource(url=anchor.get("href")))
return LLMsTxtHubItem(title=title, website=website, description=description, logo=logo_url, tags=tags, resources=resources)
@staticmethod
def card_to_model(card) -> LLMsTxtHubItem:
try:
divs = card.find(name="div")
if not divs:
raise ValueError("Card structure doesn't contain expected div element")
title_elem = divs.find(name="h3")
if not title_elem:
raise ValueError("Missing title element (h3)")
title = title_elem.text
tags = []
for tag in divs.find_all(name="span"):
tags.append(tag.text)
website_elem = divs.find(name="p", attrs={"class": "text-sm"})
if not website_elem:
raise ValueError("Missing website element (p.text-sm)")
website = website_elem.text
desc_elem = divs.find(name="p", attrs={"class": "text-sm", "title": True})
if not desc_elem:
raise ValueError("Missing description element (p.text-sm[title])")
description = desc_elem.text
logo_elem = divs.find(name="img")
logo_url = logo_elem.get("src") if logo_elem else None
resources = []
for anchor in divs.find_all(name="a"):
href = anchor.get("href")
if href:
resources.append(Resource(url=href))
return LLMsTxtHubItem(
title=title,
website=website,
description=description,
logo=logo_url,
tags=tags,
resources=resources,
)
except Exception as e:
logger.error(f"Failed to parse card: {e}")
raise ValueError(f"Failed to parse card: {e}") from e
🧰 Tools
🪛 Ruff (0.11.9)

68-68: Line too long (131 > 100)

(E501)

🤖 Prompt for AI Agents
In src/cratedb_about/hub/model.py around lines 55 to 68, the card_to_model
method lacks error handling for potential missing HTML elements and is missing a
return type annotation. Add appropriate try-except blocks to catch and handle
exceptions during HTML parsing, returning None or raising a clear error if
required elements are missing. Also, add a return type annotation indicating the
method returns an instance of LLMsTxtHubItem or None to improve code clarity and
robustness.

import logging
import typing as t
from pathlib import Path
from pprint import pprint
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Use dedicated formatters instead of pprint

The pprint module is flagged by the linter as it's generally intended for debugging rather than production code.

Apply this diff to use structured formatters instead:

-from pprint import pprint
+import dataclasses
+import json

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In src/cratedb_about/cli.py at line 4, replace the use of the pprint module with
a dedicated structured formatter suitable for production, such as the logging
module's formatters or JSON formatting. Remove the import of pprint and update
the code to output data using the chosen formatter to ensure cleaner, more
maintainable, and production-appropriate output.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant