-
Couldn't load subscription status.
- Fork 0
[WIP] Hub: Inquire information from https://llmtxt.dev/hub #36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
WalkthroughA new CLI command named Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant CLI
participant LLMsTxtHub
participant WebPage
User->>CLI: Invoke "hub" command
CLI->>LLMsTxtHub: Create instance
CLI->>LLMsTxtHub: Call fetch()
LLMsTxtHub->>WebPage: GET https://llmtxt.dev/hub
WebPage-->>LLMsTxtHub: Return HTML content
LLMsTxtHub->>LLMsTxtHub: Parse HTML, extract items
LLMsTxtHub->>WebPage: GET resource URLs (acquire_sizes)
WebPage-->>LLMsTxtHub: Return resource content
LLMsTxtHub-->>CLI: Return populated instance
CLI->>CLI: Pretty-print items and log "Ready."
Poem
✨ Finishing Touches
🧪 Generate Unit Tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
🧹 Nitpick comments (2)
src/cratedb_about/hub/model.py (2)
33-47: Fragile HTML traversal may raiseAttributeErrorThe chain
divs.find(...).textassumes every node exists. If the remote page layout changes,find()may returnNoneand.textwill crash.
Guard withif not x: continueor useget_text(strip=True)with default values to keep the scraper resilient.🧰 Tools
🪛 Ruff (0.8.2)
34-34: Found commented-out code
(ERA001)
35-35: Found commented-out code
(ERA001)
36-36: Found commented-out code
(ERA001)
37-37: Found commented-out code
(ERA001)
38-38: Found commented-out code
(ERA001)
39-39: Found commented-out code
(ERA001)
41-41: Found commented-out code
(ERA001)
43-43: Found commented-out code
(ERA001)
44-44: Found commented-out code
(ERA001)
45-45: Found commented-out code
(ERA001)
34-45: Remove commented-out debug code &Nine blocks (
#continue, etc.) violate cleanup standards and trigger Ruff ERA001 warnings. Convert useful ones tologger.debug()and drop the rest.- #print("divs:", divs) + # logger.debug("divs: %s", divs) ... - #print("cards:", cards) - print("data:", data) + # logger.debug("cards: %s", cards) + logger.debug("Scraped %d hub items", len(items))Also applies to: 57-58
🧰 Tools
🪛 Ruff (0.8.2)
34-34: Found commented-out code
(ERA001)
35-35: Found commented-out code
(ERA001)
36-36: Found commented-out code
(ERA001)
37-37: Found commented-out code
(ERA001)
38-38: Found commented-out code
(ERA001)
39-39: Found commented-out code
(ERA001)
41-41: Found commented-out code
(ERA001)
43-43: Found commented-out code
(ERA001)
44-44: Found commented-out code
(ERA001)
45-45: Found commented-out code
(ERA001)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
src/cratedb_about/cli.py(2 hunks)src/cratedb_about/hub/model.py(1 hunks)
🧰 Additional context used
🪛 Ruff (0.8.2)
src/cratedb_about/hub/model.py
34-34: Found commented-out code
(ERA001)
35-35: Found commented-out code
(ERA001)
36-36: Found commented-out code
(ERA001)
37-37: Found commented-out code
(ERA001)
38-38: Found commented-out code
(ERA001)
39-39: Found commented-out code
(ERA001)
41-41: Found commented-out code
(ERA001)
43-43: Found commented-out code
(ERA001)
44-44: Found commented-out code
(ERA001)
45-45: Found commented-out code
(ERA001)
53-53: SyntaxError: Positional argument cannot follow keyword argument
56-56: Found commented-out code
(ERA001)
🪛 GitHub Actions: Tests
src/cratedb_about/hub/model.py
[error] 53-53: Ruff formatting check failed: Positional argument cannot follow keyword argument.
src/cratedb_about/hub/model.py
Outdated
| def fetch(self): | ||
| index_html = self.client.get(self.url) | ||
| bs = BeautifulSoup(index_html, "html.parser") | ||
| cards = bs.find_all(attrs={"class": "website-card"}) | ||
| data = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
No error handling around HTTP request & parsing
get_cache_client().get() may raise network-level exceptions or return non-200 responses, and BeautifulSoup parsing can fail if unexpected markup is returned.
Wrap the request in try/except and raise or log meaningful errors; consider passing a timeout.
🤖 Prompt for AI Agents
In src/cratedb_about/hub/model.py around lines 27 to 31, the fetch method lacks
error handling for the HTTP request and HTML parsing. Wrap the call to
self.client.get(self.url) in a try/except block to catch network exceptions and
handle non-200 HTTP responses by checking the status code. Also, add a timeout
parameter to the request if supported. Similarly, catch exceptions from
BeautifulSoup parsing and raise or log meaningful errors to improve robustness.
| @cli.command() | ||
| @click.pass_context | ||
| def hub(ctx: click.Context) -> None: | ||
| """ | ||
| Inquire information from https://llmtxt.dev/hub. | ||
| """ | ||
| LLMsTxtHub().fetch() | ||
| logger.info("Ready.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
CLI command provides no output or feedback to the user
LLMsTxtHub().fetch() silently scrapes the site and the command ends with “Ready.” Users cannot see the harvested data or where it was stored.
Consider:
- Returning the list from
fetch()and pretty-printing it (e.g. table, JSON, YAML) based on an optional--formatflag. - Catching and surfacing network errors so the CLI exits with a non-zero status on failure.
-@cli.command()
+@cli.command()
+@click.option("--format", "-f", type=click.Choice(["json", "yaml", "table"]), default="table")
@click.pass_context
-def hub(ctx: click.Context) -> None:
+def hub(ctx: click.Context, format: str) -> None:
...
- LLMsTxtHub().fetch()
- logger.info("Ready.")
+ items = LLMsTxtHub().fetch()
+ if format == "json":
+ click.echo(json.dumps([dataclasses.asdict(i) for i in items], indent=2))
+ elif format == "yaml":
+ click.echo(yaml.safe_dump([dataclasses.asdict(i) for i in items]))
+ else: # table
+ click.echo("\n".join(f"- {i.name}: {i.website}" for i in items))
+ logger.info("Scrape completed – %d items", len(items))Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In src/cratedb_about/cli.py around lines 103 to 110, the hub CLI command calls
LLMsTxtHub().fetch() but provides no output or feedback to the user about the
fetched data or its storage location. Modify the hub function to capture the
list returned by fetch(), then pretty-print this data in a user-friendly format
such as a table, JSON, or YAML, controlled by an optional --format flag.
Additionally, add error handling to catch network or fetch-related exceptions,
log an appropriate error message, and exit the CLI with a non-zero status code
on failure.
348ec43 to
d6017ce
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (2)
src/cratedb_about/hub/model.py (1)
36-42: Add error handling for HTTP request and BeautifulSoup parsingThe
fetchmethod lacks any error handling for HTTP requests or HTML parsing. If the request fails or the HTML structure is unexpected, it will raise exceptions that bubble up to the CLI without meaningful context.Apply this diff to add proper error handling:
def fetch(self): - index_html = self.client.get(self.url) - bs = BeautifulSoup(index_html, "html.parser") - cards = bs.find_all(attrs={"class": "website-card"}) - self.items = [self.card_to_model(card) for card in cards] - self.acquire_sizes() - return self + try: + index_html = self.client.get(self.url) + bs = BeautifulSoup(index_html, "html.parser") + cards = bs.find_all(attrs={"class": "website-card"}) + self.items = [self.card_to_model(card) for card in cards] + self.acquire_sizes() + return self + except Exception as e: + logger.error(f"Failed to fetch hub data: {e}") + raise RuntimeError(f"Failed to fetch hub information from {self.url}: {e}") from esrc/cratedb_about/cli.py (1)
104-112: Improve CLI output with format options and error handlingThe current implementation uses
pprintdirectly and lacks output format options or error handling. This makes the command less flexible and robust than other commands in the CLI.Apply this diff to add format options and error handling:
@cli.command() +@click.option( + "--format", + "-f", + type=click.Choice(["json", "yaml", "table", "pretty"]), + default="pretty", + help="Output format" +) @click.pass_context -def hub(ctx: click.Context) -> None: +def hub(ctx: click.Context, format: str) -> None: """ Inquire information from https://llmtxt.dev/hub. """ - txt_hub = LLMsTxtHub().fetch() - pprint(txt_hub.items) - logger.info("Ready.") + try: + txt_hub = LLMsTxtHub().fetch() + items = txt_hub.items + + if format == "json": + import json + click.echo(json.dumps([dataclasses.asdict(item) for item in items], indent=2)) + elif format == "yaml": + import yaml + click.echo(yaml.safe_dump([dataclasses.asdict(item) for item in items])) + elif format == "table": + for item in items: + click.echo(f"- {item.title}: {item.website}") + if item.description: + click.echo(f" {item.description}") + if item.tags: + click.echo(f" Tags: {', '.join(item.tags)}") + if item.resources: + click.echo(f" Resources: {len(item.resources)}") + click.echo("") + else: # pretty + from pprint import pprint + pprint(items) + + logger.info(f"Successfully retrieved {len(items)} items.") + except Exception as e: + logger.error(f"Failed to retrieve hub information: {e}") + ctx.exit(1)🧰 Tools
🪛 Ruff (0.11.9)
111-111:
pprintfoundRemove
pprint(T203)
🧹 Nitpick comments (2)
src/cratedb_about/hub/model.py (1)
44-53: Consider batching or limiting the number of HTTP requestsThe
acquire_sizesmethod makes an HTTP request for each resource of each item, which could be many requests. This might hit rate limits or take a long time to complete.Consider adding a limit parameter or implementing batch processing:
-def acquire_sizes(self): +def acquire_sizes(self, limit: int = 100): logger.info(f"Acquiring sizes for {len(self.items)} items") + request_count = 0 for item in self.items: logger.info(f"Acquiring size for {item}") for resource in item.resources: + if request_count >= limit: + logger.warning(f"Reached limit of {limit} requests, stopping size acquisition") + return try: response = self.client.get(resource.url) resource.size = len(response.text) + request_count += 1 except Exception as e: logger.warning(f"Failed to acquire size for {item}: {e}")src/cratedb_about/cli.py (1)
108-109: Enhance CLI command documentationThe command description is minimal and could provide more details about what information is being retrieved.
Apply this diff to improve the documentation:
""" - Inquire information from https://llmtxt.dev/hub. + Retrieve and display information about LLM tools and resources from https://llmtxt.dev/hub. + + This command scrapes the llmtxt.dev/hub webpage, extracts information about LLM tools, + and displays the results in the specified format. """
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
src/cratedb_about/cli.py(2 hunks)src/cratedb_about/hub/model.py(1 hunks)
🧰 Additional context used
🪛 Ruff (0.11.9)
src/cratedb_about/hub/model.py
68-68: Line too long (131 > 100)
(E501)
src/cratedb_about/cli.py
111-111: pprint found
Remove pprint
(T203)
🪛 GitHub Actions: Tests
src/cratedb_about/hub/model.py
[error] 1-1: Ruff formatting check failed. File would be reformatted. Run 'ruff format' to fix code style issues.
| @staticmethod | ||
| def card_to_model(card): | ||
| divs = card.find(name="div") | ||
| title = divs.find(name="h3").text | ||
| tags = [] | ||
| for tag in divs.find_all(name="span"): | ||
| tags.append(tag.text) | ||
| website = divs.find(name="p", attrs={"class": "text-sm"}).text | ||
| description = divs.find(name="p", attrs={"class": "text-sm", "title": True}).text | ||
| logo_url = divs.find(name="img").get("src") | ||
| resources = [] | ||
| for anchor in divs.find_all(name="a"): | ||
| resources.append(Resource(url=anchor.get("href"))) | ||
| return LLMsTxtHubItem(title=title, website=website, description=description, logo=logo_url, tags=tags, resources=resources) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add error handling and type annotations to card_to_model method
The card_to_model method lacks error handling for HTML parsing and is missing return type annotation. If the HTML structure changes, the method will fail with confusing errors.
Apply this diff to add error handling and type annotation:
@staticmethod
-def card_to_model(card):
- divs = card.find(name="div")
- title = divs.find(name="h3").text
- tags = []
- for tag in divs.find_all(name="span"):
- tags.append(tag.text)
- website = divs.find(name="p", attrs={"class": "text-sm"}).text
- description = divs.find(name="p", attrs={"class": "text-sm", "title": True}).text
- logo_url = divs.find(name="img").get("src")
- resources = []
- for anchor in divs.find_all(name="a"):
- resources.append(Resource(url=anchor.get("href")))
- return LLMsTxtHubItem(title=title, website=website, description=description, logo=logo_url, tags=tags, resources=resources)
+def card_to_model(card) -> LLMsTxtHubItem:
+ try:
+ divs = card.find(name="div")
+ if not divs:
+ raise ValueError("Card structure doesn't contain expected div element")
+
+ title_elem = divs.find(name="h3")
+ if not title_elem:
+ raise ValueError("Missing title element (h3)")
+ title = title_elem.text
+
+ tags = []
+ for tag in divs.find_all(name="span"):
+ tags.append(tag.text)
+
+ website_elem = divs.find(name="p", attrs={"class": "text-sm"})
+ if not website_elem:
+ raise ValueError("Missing website element (p.text-sm)")
+ website = website_elem.text
+
+ desc_elem = divs.find(name="p", attrs={"class": "text-sm", "title": True})
+ if not desc_elem:
+ raise ValueError("Missing description element (p.text-sm[title])")
+ description = desc_elem.text
+
+ logo_elem = divs.find(name="img")
+ logo_url = logo_elem.get("src") if logo_elem else None
+
+ resources = []
+ for anchor in divs.find_all(name="a"):
+ href = anchor.get("href")
+ if href:
+ resources.append(Resource(url=href))
+
+ return LLMsTxtHubItem(
+ title=title,
+ website=website,
+ description=description,
+ logo=logo_url,
+ tags=tags,
+ resources=resources
+ )
+ except Exception as e:
+ logger.error(f"Failed to parse card: {e}")
+ raise ValueError(f"Failed to parse card: {e}") from e📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| @staticmethod | |
| def card_to_model(card): | |
| divs = card.find(name="div") | |
| title = divs.find(name="h3").text | |
| tags = [] | |
| for tag in divs.find_all(name="span"): | |
| tags.append(tag.text) | |
| website = divs.find(name="p", attrs={"class": "text-sm"}).text | |
| description = divs.find(name="p", attrs={"class": "text-sm", "title": True}).text | |
| logo_url = divs.find(name="img").get("src") | |
| resources = [] | |
| for anchor in divs.find_all(name="a"): | |
| resources.append(Resource(url=anchor.get("href"))) | |
| return LLMsTxtHubItem(title=title, website=website, description=description, logo=logo_url, tags=tags, resources=resources) | |
| @staticmethod | |
| def card_to_model(card) -> LLMsTxtHubItem: | |
| try: | |
| divs = card.find(name="div") | |
| if not divs: | |
| raise ValueError("Card structure doesn't contain expected div element") | |
| title_elem = divs.find(name="h3") | |
| if not title_elem: | |
| raise ValueError("Missing title element (h3)") | |
| title = title_elem.text | |
| tags = [] | |
| for tag in divs.find_all(name="span"): | |
| tags.append(tag.text) | |
| website_elem = divs.find(name="p", attrs={"class": "text-sm"}) | |
| if not website_elem: | |
| raise ValueError("Missing website element (p.text-sm)") | |
| website = website_elem.text | |
| desc_elem = divs.find(name="p", attrs={"class": "text-sm", "title": True}) | |
| if not desc_elem: | |
| raise ValueError("Missing description element (p.text-sm[title])") | |
| description = desc_elem.text | |
| logo_elem = divs.find(name="img") | |
| logo_url = logo_elem.get("src") if logo_elem else None | |
| resources = [] | |
| for anchor in divs.find_all(name="a"): | |
| href = anchor.get("href") | |
| if href: | |
| resources.append(Resource(url=href)) | |
| return LLMsTxtHubItem( | |
| title=title, | |
| website=website, | |
| description=description, | |
| logo=logo_url, | |
| tags=tags, | |
| resources=resources, | |
| ) | |
| except Exception as e: | |
| logger.error(f"Failed to parse card: {e}") | |
| raise ValueError(f"Failed to parse card: {e}") from e |
🧰 Tools
🪛 Ruff (0.11.9)
68-68: Line too long (131 > 100)
(E501)
🤖 Prompt for AI Agents
In src/cratedb_about/hub/model.py around lines 55 to 68, the card_to_model
method lacks error handling for potential missing HTML elements and is missing a
return type annotation. Add appropriate try-except blocks to catch and handle
exceptions during HTML parsing, returning None or raising a clear error if
required elements are missing. Also, add a return type annotation indicating the
method returns an instance of LLMsTxtHubItem or None to improve code clarity and
robustness.
| import logging | ||
| import typing as t | ||
| from pathlib import Path | ||
| from pprint import pprint |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Use dedicated formatters instead of pprint
The pprint module is flagged by the linter as it's generally intended for debugging rather than production code.
Apply this diff to use structured formatters instead:
-from pprint import pprint
+import dataclasses
+import jsonCommittable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In src/cratedb_about/cli.py at line 4, replace the use of the pprint module with
a dedicated structured formatter suitable for production, such as the logging
module's formatters or JSON formatting. Remove the import of pprint and update
the code to output data using the chosen formatter to ensure cleaner, more
maintainable, and production-appropriate output.
About
Inquire information from https://llmtxt.dev/hub, in order to ...
about/docs/backlog.md
Lines 4 to 6 in 419a85c