Skip to content

Improve Per-extractor Instructions#170

Merged
elzik merged 23 commits intomainfrom
improve-per-extractor-instructions
Nov 29, 2025
Merged

Improve Per-extractor Instructions#170
elzik merged 23 commits intomainfrom
improve-per-extractor-instructions

Conversation

@elzik
Copy link
Owner

@elzik elzik commented Nov 1, 2025

Summary by CodeRabbit

  • New Features

    • Reddit summaries now produce strict HTML outputs with per-post sections and HTML links to posts and top replies.
  • Updates

    • Extraction now captures and exposes original post URLs across posts, comments and extracted records.
    • Subreddit content titles may include an “as of” timestamp using an injected time provider.
  • Documentation

    • Summarisation instructions replaced with Reddit- and HTML-focused specs; README adds Time Zone guidance.
  • Chores

    • Added a local VS Code launch configuration.
  • Tests

    • Tests updated to validate URL, timestamp and PostUrl behavior; Docker tests grouped for serial execution.

✏️ Tip: You can customize this high-level summary in your review settings.

Copilot AI review requested due to automatic review settings November 1, 2025 10:02
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 1, 2025

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

Walkthrough

Propagates OriginalUrl/PostUrl through extractors and models, injects IOptions into RawRedditPostTransformer to derive host-based comment URLs, adds TimeProvider-based instance URLs in the subreddit extractor, updates UntypedExtract/Extract signatures, and adjusts tests, tooling, and docs accordingly.

Changes

Cohort / File(s) Summary
Summarisation instructions
src/Elzik.Breef.Infrastructure/SummarisationInstructions/SubredditContent.md, src/Elzik.Breef.Infrastructure/SummarisationInstructions/RedditPostContent.md, src/Elzik.Breef.Infrastructure/SummarisationInstructions/HtmlContent.md
Replace generic instruction blocks with structured, HTML-only summarisation specs defining task, input schema, output requirements, HTML link formatting, length rules, and explicit exclusions.
Reddit transformer & URL propagation
src/Elzik.Breef.Infrastructure/ContentExtractors/Reddit/Client/Raw/RawRedditPostTransformer.cs
Add IOptions<RedditOptions> dependency; validate/derive host from DefaultBaseAddress; add Transform(RawRedditPost) entry; propagate (subreddit, postId, host) through TransformComments overloads; compute and populate PostUrl for posts, comments, replies; integrate image extraction.
Models — Reddit post/comment
src/Elzik.Breef.Infrastructure/ContentExtractors/Reddit/Client/RedditPost.cs
Add PostUrl property to RedditPostContent and RedditComment; populate in transformations.
UntypedExtract / Extract / extractor call sites
src/Elzik.Breef.Infrastructure/ContentExtractors/UntypedExtract.cs, src/Elzik.Breef.Domain/Extract.cs, src/Elzik.Breef.Infrastructure/ContentExtractors/Reddit/RedditPostContentExtractor.cs, src/Elzik.Breef.Infrastructure/ContentExtractors/HtmlContentExtractor.cs, src/Elzik.Breef.Infrastructure/ContentExtractors/Reddit/SubRedditContentExtractor.cs
Add OriginalUrl parameter to UntypedExtract and Extract; thread OriginalUrl through HTML and Reddit extractors; SubRedditContentExtractor now accepts TimeProvider, appends "as of {date/time}" to titles, and composes instance-specific original URLs (webPageUrl#{timestamp}).
Application wiring
src/Elzik.Breef.Application/BreefGenerator.cs
Construct Breef using extract.OriginalUrl instead of the previous input url.
Exception style
src/Elzik.Breef.Infrastructure/CallerFixableHttpRequestException.cs
Convert class to C# primary-constructor form while preserving base type HttpRequestException and interface ICallerFixableException.
Tests — unit & integration
tests/.../Reddit/*, tests/.../ContentExtractors/*, tests/Elzik.Breef.Infrastructure.Tests.*, tests/Elzik.Breef.Api.Tests.*
Update tests to supply IOptions<RedditOptions> (DefaultBaseAddress) to transformers, assert PostUrl and OriginalUrl, inject FakeTimeProvider / Microsoft.Extensions.TimeProvider.Testing, adapt helpers to primary-constructor style, and update sample/test data to include PostUrl/OriginalUrl.
Dev tooling / samples / docs
.vscode/launch.json, tests/SampleRequests/Elzik.Breef.Api.http, README.md
Add VSCode CoreCLR launch configuration; normalize HTTP sample host variable to {{host}}; add Time Zone guidance and Docker TZ example to README.
Test orchestration
tests/Elzik.Breef.Api.Tests.Functional/DockerTestsCollection.cs, tests/Elzik.Breef.Api.Tests.Functional/*Docker.cs
Add xUnit Docker test collection and annotate Docker-based functional tests with [Collection("Docker Tests")] to disable parallelization.

Sequence Diagram(s)

sequenceDiagram
    participant Client as Caller / Tests
    participant RawClient as RawRedditClient
    participant Transformer as RawRedditPostTransformer
    participant Models as RedditPost Models
    participant Extractor as SubredditContentExtractor / UntypedExtract

    Client->>RawClient: request subreddit/post JSON
    RawClient->>Transformer: deliver RawRedditPost payload
    Transformer->>Transformer: validate/derive host from RedditOptions.DefaultBaseAddress
    Transformer->>Transformer: parse listing → postId, subreddit, comments
    Transformer->>Transformer: set Post.PostUrl = webPageUrl
    Transformer->>Transformer: select best image
    Transformer->>Transformer: recursively TransformComments(replies, subreddit, postId, host)
    note right of Transformer: build per-comment PostUrl using host/subreddit/postId/commentId
    Transformer->>Models: construct RedditPost (post + comments with PostUrl)
    Transformer-->>Extractor: return transformed RedditPost
    Extractor->>Extractor: compose UntypedExtract(title, content, OriginalUrl, image)
    Extractor->>Client: forward extract (OriginalUrl included)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

  • Pay special attention to:
    • Recursive comment transformation overloads and correct propagation/formatting of PostUrl
    • Validation/parsing of RedditOptions.DefaultBaseAddress (absolute URI expectations)
    • Signature changes to UntypedExtract and Extract and all affected call sites
    • Tests injecting FakeTimeProvider and added time-testing package references

Possibly related PRs

Poem

I hop through JSON, stitch each link with care,
I tuck PostUrl in posts and replies there,
I chase the host and mark each comment's place,
Time-stamped titles trimmed with gentle pace,
A rabbit leaves a tidy HTML trace 🐇✨

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Improve Per-extractor Instructions' clearly and concisely describes the main focus of the changeset, which centers on updating and restructuring summarization instruction files for different content extractors (Reddit, HTML) and improving their formatting and clarity.
✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch improve-per-extractor-instructions

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c5afe8a and 6f90e5d.

📒 Files selected for processing (1)
  • src/Elzik.Breef.Infrastructure/SummarisationInstructions/HtmlContent.md (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: build-ubuntu
  • GitHub Check: build-ubuntu
  • GitHub Check: Analyze (csharp)
🔇 Additional comments (1)
src/Elzik.Breef.Infrastructure/SummarisationInstructions/HtmlContent.md (1)

11-19: Markdown linting issues resolved and structure improved.

The restructuring successfully addresses the previous MD007 violations from lines 12–14 and 16–19:

  • Sequential numbering: 1, 2 (previously jumped to 1, 5)
  • Proper 3-space indentation for nested bullets under numbered items (previously 2-space)
  • Root-level list items positioned at column 0

The new task-template structure with explicit Input Structure, Requirements, and Output Formatting sections improves clarity and aligns with the PR objective to enhance per-extractor instructions. Functional requirements remain intact: HTML entity inclusion, 10%/200-word limits, accurate attribution, and exclusions of links, title, metadata, and code blocks.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR updates the SubredditContent.md summarization instructions to provide more detailed and structured guidance for summarizing Reddit subreddit data. The update transforms the brief instructions into a comprehensive guide with clear sections for task description, input structure, requirements, and output format.

Key Changes:

  • Restructured instructions from a simple bullet list to a well-organized document with markdown sections
  • Added detailed specifications for handling Reddit-specific data structures (posts, comments, nested replies)
  • Introduced HTML link formatting requirements for post titles and comment author attributions
  • Clarified summary length constraints and high-scoring content prioritization

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a69e6ae and 941f61e.

📒 Files selected for processing (1)
  • src/Elzik.Breef.Infrastructure/SummarisationInstructions/SubredditContent.md (1 hunks)
🧰 Additional context used
🪛 LanguageTool
src/Elzik.Breef.Infrastructure/SummarisationInstructions/SubredditContent.md

[grammar] ~14-~14: Ensure spelling is correct
Context: ...e/themes of the subreddit 2. Posts: Sumarise every post with a thematic summary of i...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)


[grammar] ~30-~30: Use a hyphen to join words.
Context: ...reddit itself - Summaries of the highest scoring top-level posts

(QB_NEW_EN_HYPHEN)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: build-ubuntu
  • GitHub Check: Analyze (csharp)

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (2)
src/Elzik.Breef.Infrastructure/SummarisationInstructions/SubredditContent.md (2)

14-14: Fix spelling: "Sumarise" → "Summarise".

This spelling error was flagged in previous reviews and remains unaddressed. Correct to use consistent UK spelling throughout.

-2. **Posts**: Sumarise every post with a thematic summary of its comments
+2. **Posts**: Summarise every post with a thematic summary of its comments

29-31: Complete the final bullet-point instruction.

Lines 29–31 contain bullet points that provide guidance but lack imperative verbs to form complete instructions. Line 31 ("Summaries of the highest scoring top-level posts") is particularly unclear—it reads as a noun phrase rather than a directive.

Consider revising to clarify intent:

  - Strictly well-formatted HTML output
  - Do not include any markdown notation nor put the summary in a codeblock
  - Brief overview of themes covered in this specific JSON document
  - Do not include a general description of the subreddit itself
- - Summaries of the highest scoring top-level posts
+ - Prioritize summaries of the highest-scoring top-level posts in the output
🧹 Nitpick comments (4)
src/Elzik.Breef.Infrastructure/ContentExtractors/Reddit/Client/Raw/RawRedditPostTransformer.cs (1)

239-247: Silent exception swallowing may hide deserialization issues.

The empty catch block discards all exceptions during JSON deserialization, making it difficult to diagnose issues with malformed Reddit API responses.

Consider logging the exception or at least catching a specific exception type:

             try
             {
                 var deserializedListing = JsonSerializer.Deserialize<RawRedditListing>(jsonElement.GetRawText());
                 return TransformComments(deserializedListing, subreddit, postId, host);
             }
-            catch
+            catch (JsonException)
             {
+                // Replies may have unexpected format in some edge cases
                 return [];
             }
tests/Elzik.Breef.Infrastructure.Tests.Integration/ContentExtractors/Reddit/RedditPostContentExtractorTests.cs (1)

66-66: LGTM!

Good validation of the new PostUrl property.

For stronger validation, consider asserting the actual URL value matches the input:

         redditPost.Post.PostUrl.ShouldNotBeNullOrWhiteSpace();
+        redditPost.Post.PostUrl.ShouldStartWith("https://");
+        redditPost.Post.PostUrl.ShouldContain("1kqiwzc");
tests/SampleRequests/Elzik.Breef.Api.http (1)

35-36: Unnecessary Content-Type header on GET request.

The Content-Type header is only meaningful for requests with a body. GET requests don't have a request body, so this header can be removed.

 GET {{host}}/health
-Content-Type: application/json
tests/Elzik.Breef.Infrastructure.Tests.Unit/ContentExtractors/Reddit/Client/RawNewInSubredditTransformerTests.cs (1)

326-327: Consider a more specific assertion for consistency.

The test data sets PostUrl to a specific value at line 327, but line 360 only asserts it's not null or empty. For consistency with the other tests (lines 97, 105) that assert exact values, consider using a specific assertion here as well.

-        post.Post.PostUrl.ShouldNotBeNullOrEmpty();
+        post.Post.PostUrl.ShouldBe("https://reddit.com/r/test/single_post");

Also applies to: 360-360

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 941f61e and e5506c6.

📒 Files selected for processing (11)
  • .vscode/launch.json (1 hunks)
  • src/Elzik.Breef.Infrastructure/ContentExtractors/Reddit/Client/Raw/RawRedditPostTransformer.cs (5 hunks)
  • src/Elzik.Breef.Infrastructure/ContentExtractors/Reddit/Client/RedditPost.cs (2 hunks)
  • src/Elzik.Breef.Infrastructure/ContentExtractors/Reddit/RedditPostContentExtractor.cs (1 hunks)
  • src/Elzik.Breef.Infrastructure/SummarisationInstructions/SubredditContent.md (1 hunks)
  • tests/Elzik.Breef.Infrastructure.Tests.Integration/ContentExtractors/Reddit/Client/RawRedditPostClientTests.cs (1 hunks)
  • tests/Elzik.Breef.Infrastructure.Tests.Integration/ContentExtractors/Reddit/RedditPostContentExtractorTests.cs (2 hunks)
  • tests/Elzik.Breef.Infrastructure.Tests.Unit/ContentExtractors/Reddit/Client/RawNewInSubredditTransformerTests.cs (5 hunks)
  • tests/Elzik.Breef.Infrastructure.Tests.Unit/ContentExtractors/Reddit/Client/RedditPostClientTests.cs (4 hunks)
  • tests/Elzik.Breef.Infrastructure.Tests.Unit/ContentExtractors/Reddit/RedditPostContentExtractorTests.cs (4 hunks)
  • tests/SampleRequests/Elzik.Breef.Api.http (1 hunks)
✅ Files skipped from review due to trivial changes (1)
  • .vscode/launch.json
🧰 Additional context used
🧬 Code graph analysis (3)
tests/Elzik.Breef.Infrastructure.Tests.Unit/ContentExtractors/Reddit/Client/RawNewInSubredditTransformerTests.cs (1)
src/Elzik.Breef.Infrastructure/ContentExtractors/Reddit/Client/Raw/RedditDateTimeConverter.cs (1)
  • DateTime (8-19)
tests/Elzik.Breef.Infrastructure.Tests.Unit/ContentExtractors/Reddit/RedditPostContentExtractorTests.cs (4)
src/Elzik.Breef.Infrastructure/ContentExtractors/Reddit/Client/Raw/RawRedditPostTransformer.cs (1)
  • RedditPost (147-190)
src/Elzik.Breef.Infrastructure/ContentExtractors/Reddit/Client/RedditPost.cs (1)
  • RedditPost (3-7)
tests/Elzik.Breef.Infrastructure.Tests.Unit/ContentExtractors/Reddit/Client/RedditPostClientTests.cs (4)
  • RedditPost (249-287)
  • RedditPost (289-315)
  • RedditPost (317-343)
  • RedditPost (345-371)
src/Elzik.Breef.Infrastructure/ContentExtractors/Reddit/Client/Raw/IRawRedditPostTransformer.cs (1)
  • RedditPost (5-5)
src/Elzik.Breef.Infrastructure/ContentExtractors/Reddit/Client/Raw/RawRedditPostTransformer.cs (4)
src/Elzik.Breef.Infrastructure/ContentExtractors/Reddit/RedditOptions.cs (1)
  • List (24-32)
src/Elzik.Breef.Infrastructure/ContentExtractors/Reddit/Client/RedditPost.cs (3)
  • RedditComment (22-31)
  • RedditPost (3-7)
  • RedditPostContent (9-20)
src/Elzik.Breef.Infrastructure/ContentExtractors/Reddit/Client/Raw/RawRedditPost.cs (3)
  • RawRedditChild (30-37)
  • RawRedditListing (9-16)
  • RawRedditPost (5-7)
src/Elzik.Breef.Infrastructure/ContentExtractors/Reddit/Client/Raw/IRawRedditPostTransformer.cs (1)
  • RedditPost (5-5)
🪛 LanguageTool
src/Elzik.Breef.Infrastructure/SummarisationInstructions/SubredditContent.md

[uncategorized] ~18-~18: Do not mix variants of the same word (‘summarise’ and ‘summarize’) within a single text.
Context: ...post's replies are highly scoring, also summarise them and include author attribution wit...

(EN_WORD_COHERENCY)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: build-ubuntu
  • GitHub Check: build-ubuntu
  • GitHub Check: Analyze (csharp)
🔇 Additional comments (13)
src/Elzik.Breef.Infrastructure/SummarisationInstructions/SubredditContent.md (2)

18-18: Ensure consistent spelling variant for "summarise".

Line 18 uses "summarise" (UK spelling), which is correct. However, once line 14 is corrected from "Sumarise" to "Summarise", verify that no other variants like "summarize" (US spelling) appear elsewhere in this file to maintain coherence.


8-9: Verify PostUrl propagation aligns with schema documentation.

The instructions now reference PostUrl in the input structure (lines 8–9) and specify HTML link generation using RedditPostContent.PostUrl (line 17). Ensure that the actual post and comment objects in the transformer and extractor consistently populate this property before these instructions are applied. Per the enriched summary, PostUrl was introduced in the refactoring, so confirm it is reliably available in both top-level posts and nested replies.

Also applies to: 17-18

tests/Elzik.Breef.Infrastructure.Tests.Integration/ContentExtractors/Reddit/Client/RawRedditPostClientTests.cs (1)

47-47: LGTM!

Good addition to verify that the raw Reddit API response contains a non-empty URL, which is essential for the downstream PostUrl propagation.

src/Elzik.Breef.Infrastructure/ContentExtractors/Reddit/RedditPostContentExtractor.cs (1)

63-63: LGTM!

Correctly assigns the web page URL to PostUrl, ensuring it always points to the Reddit post page regardless of whether the post is a link post or self post. This appropriately overrides the transformer's value.

tests/Elzik.Breef.Infrastructure.Tests.Integration/ContentExtractors/Reddit/RedditPostContentExtractorTests.cs (1)

140-140: LGTM!

Consistent validation of PostUrl in the comprehensive structure test.

src/Elzik.Breef.Infrastructure/ContentExtractors/Reddit/Client/RedditPost.cs (2)

19-19: LGTM!

The new PostUrl property follows the same pattern as other string properties in the model with a sensible empty default.


30-30: LGTM!

Consistent addition of PostUrl to RedditComment, enabling direct links to individual comments.

src/Elzik.Breef.Infrastructure/ContentExtractors/Reddit/Client/Raw/RawRedditPostTransformer.cs (1)

72-145: Dead code: Duplicate method overloads that are never called.

Lines 72-145 define three TransformComments overloads without the host parameter, but the Transform method at line 186 calls the 4-parameter overload (with host). These 3-parameter overloads appear to be leftover from a refactor and are unreachable.

Remove the unused overloads:

-    private List<RedditComment> TransformComments(List<RawRedditChild> children, 
-        string subreddit = "", string postId = "")
-    {
-        var comments = new List<RedditComment>();
-
-        foreach (var child in children)
-        {
-            if (child.Kind == "t1")
-            {
-                var commentUrl = $"https://www.reddit.com/r/{subreddit}/comments/{postId}/comment/{child.Data.Id}/";
-                var comment = new RedditComment
-                {
-                    Id = child.Data.Id ?? string.Empty,
-                    Author = child.Data.Author ?? string.Empty,
-                    Score = child.Data.Score,
-                    Content = child.Data.Content ?? string.Empty,
-                    CreatedUtc = child.Data.CreatedUtc,
-                    PostUrl = commentUrl,
-                    Replies = TransformComments(child.Data.Replies, subreddit, postId)
-                };
-
-                comments.Add(comment);
-            }
-        }
-
-        return comments;
-    }
-
-    private List<RedditComment> TransformComments(object? replies, string subreddit = "", string postId = "")
-    {
-        if (replies == null)
-            return [];
-
-        if (replies is string stringReply && stringReply == "")
-            return [];
-
-        if (replies is JsonElement jsonElement)
-        {
-            if (jsonElement.ValueKind == JsonValueKind.Null)
-                return [];
-
-            if (jsonElement.ValueKind == JsonValueKind.String && jsonElement.GetString() == "")
-                return [];
-
-            try
-            {
-                var deserializedListing = JsonSerializer.Deserialize<RawRedditListing>(jsonElement.GetRawText());
-                return TransformComments(deserializedListing, subreddit, postId);
-            }
-            catch
-            {
-                return [];
-            }
-        }
-
-        if (replies is RawRedditListing listing)
-            return TransformComments(listing, subreddit, postId);
-
-        return [];
-    }
-
-    private List<RedditComment> TransformComments(RawRedditListing? replies, string subreddit = "", string postId = "")
-    {
-        if (replies == null)
-            return [];
-
-        if (replies.Data == null)
-            return [];
-
-        if (replies.Data.Children == null)
-            return [];
-
-        return TransformComments(replies.Data.Children, subreddit, postId);
-    }
-

Likely an incorrect or invalid review comment.

tests/Elzik.Breef.Infrastructure.Tests.Unit/ContentExtractors/Reddit/RedditPostContentExtractorTests.cs (2)

218-229: LGTM! Good test coverage for PostUrl propagation.

The test correctly verifies that the URL passed to ExtractAsync is propagated through to the PostUrl property in the serialized content, ensuring end-to-end validation of the new feature.


353-365: Good approach with the optional parameter and sensible default.

The helper method maintains backward compatibility with existing tests while allowing explicit PostUrl values when needed. The default URL pattern correctly follows Reddit's URL structure.

tests/Elzik.Breef.Infrastructure.Tests.Unit/ContentExtractors/Reddit/Client/RawNewInSubredditTransformerTests.cs (1)

59-60: LGTM! Comprehensive test coverage for PostUrl.

The test data and assertions properly verify that PostUrl is populated and propagated correctly through the transformation pipeline.

Also applies to: 74-75, 97-97, 105-105

tests/Elzik.Breef.Infrastructure.Tests.Unit/ContentExtractors/Reddit/Client/RedditPostClientTests.cs (2)

58-58: LGTM! Good assertions for comment and reply PostUrl format.

The assertions correctly verify that the PostUrl property follows the expected Reddit URL format: https://www.reddit.com/r/{subreddit}/comments/{postId}/comment/{commentId}/

Also applies to: 67-67


289-315: Verify if missing PostUrl in edge case factory methods is intentional.

The CreateExpectedTransformedResult method now includes PostUrl for comments/replies (lines 271, 280), but other factory methods like CreateExpectedResultWithEmptyReplies, CreateExpectedResultWithSingleComment, and CreateExpectedResultWithNullFields don't set PostUrl on their comments. If the production transformer always populates PostUrl, these expected results may need updating for consistency.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
src/Elzik.Breef.Infrastructure/SummarisationInstructions/SubredditContent.md (1)

29-31: Complete the Output Format bullet points as imperative instructions.

Lines 29 and 31 are noun phrases that lack clear imperative verbs, making the requirements ambiguous. These should be explicit instructions stating whether to include, focus on, or filter content.

Apply this diff to clarify the instructions:

-- Brief overview of themes covered in this specific JSON document
-- Do not include a general description of the subreddit itself
-- Summaries of the highest scoring top-level posts
+- Include a brief overview of themes covered in this specific JSON document
+- Do not include a general description of the subreddit itself
+- Include summaries of the highest scoring top-level posts
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e5506c6 and 9bdb8ef.

📒 Files selected for processing (1)
  • src/Elzik.Breef.Infrastructure/SummarisationInstructions/SubredditContent.md (1 hunks)
🧰 Additional context used
🪛 LanguageTool
src/Elzik.Breef.Infrastructure/SummarisationInstructions/SubredditContent.md

[uncategorized] ~14-~14: Do not mix variants of the same word (‘summarise’ and ‘summarize’) within a single text.
Context: ...e/themes of the subreddit 2. Posts: Summarise every post with a thematic summary of i...

(EN_WORD_COHERENCY)


[uncategorized] ~18-~18: Do not mix variants of the same word (‘summarise’ and ‘summarize’) within a single text.
Context: ...post's replies are highly scoring, also summarise them and include author attribution wit...

(EN_WORD_COHERENCY)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: build-ubuntu
  • GitHub Check: build-ubuntu
  • GitHub Check: Analyze (csharp)

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
src/Elzik.Breef.Infrastructure/SummarisationInstructions/SubredditContent.md (2)

3-3: Fix inconsistent English spelling: use either "Summarise" (British) or "Summarize" (American) consistently.

The document mixes American English "Summarize" (line 3) with British English "Summarise" (lines 14, 18). Choose one variant and apply it throughout for consistency.

-Summarize the provided Reddit subreddit JSON data containing posts and nested comments.
+Summarise the provided Reddit subreddit JSON data containing posts and nested comments.

Alternatively, if your project defaults to American English, change lines 14 and 18 to "Summarize".

Also applies to: 14-14, 18-18


31-31: Complete the incomplete instruction on line 31.

Line 31 ("Summaries of the highest scoring top-level posts") is a noun phrase without an action verb, making it an incomplete instruction. Add a verb to clarify the requirement.

-- Summaries of the highest scoring top-level posts
+- Include summaries of the highest scoring top-level posts
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9bdb8ef and 339b0ce.

📒 Files selected for processing (1)
  • src/Elzik.Breef.Infrastructure/SummarisationInstructions/SubredditContent.md (1 hunks)
🧰 Additional context used
🪛 LanguageTool
src/Elzik.Breef.Infrastructure/SummarisationInstructions/SubredditContent.md

[uncategorized] ~14-~14: Do not mix variants of the same word (‘summarise’ and ‘summarize’) within a single text.
Context: ...e/themes of the subreddit 2. Posts: Summarise every post with a thematic summary of i...

(EN_WORD_COHERENCY)


[uncategorized] ~18-~18: Do not mix variants of the same word (‘summarise’ and ‘summarize’) within a single text.
Context: ...post's replies are highly scoring, also summarise them and include author attribution wit...

(EN_WORD_COHERENCY)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: build-ubuntu
  • GitHub Check: build-ubuntu
  • GitHub Check: Analyze (csharp)

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
src/Elzik.Breef.Infrastructure/ContentExtractors/Reddit/Client/Raw/RawRedditPostTransformer.cs (3)

120-123: Silent exception swallowing may hide deserialization issues.

The bare catch block swallows all exceptions without logging or distinguishing exception types. This could mask legitimate bugs or unexpected data shapes from the Reddit API.

Consider catching JsonException specifically and optionally logging:

-            catch
+            catch (JsonException)
             {
+                // Reddit API sometimes returns unexpected reply formats; gracefully degrade
                 return [];
             }

The same applies to the duplicate catch block at lines 194-197.


117-118: Consider deserializing directly from JsonElement.

Using GetRawText() creates an intermediate string. JsonSerializer can deserialize directly from JsonElement:

-                var deserializedListing = JsonSerializer.Deserialize<RawRedditListing>(jsonElement.GetRawText());
+                var deserializedListing = JsonSerializer.Deserialize<RawRedditListing>(jsonElement);

72-144: Refactor the 3-parameter TransformComments overloads to simple delegates to eliminate dead code duplication.

The 3-parameter overloads (lines 72–144) are confirmed to be unused. All three are private methods that form a closed recursive chain, never called from the Transform method (the public entry point) or any other code in the repository. The 4-parameter versions (lines 146–218) handle all actual usage via the Transform method.

Consolidating these into delegates is appropriate:

-    private List<RedditComment> TransformComments(List<RawRedditChild> children, string subreddit = "", string postId = "")
-    {
-        var comments = new List<RedditComment>();
-
-        foreach (var child in children)
-        {
-            if (child.Kind == "t1")
-            {
-                var commentUrl = $"https://www.reddit.com/r/{subreddit}/comments/{postId}/comment/{child.Data.Id}/";
-                var comment = new RedditComment
-                {
-                    Id = child.Data.Id ?? string.Empty,
-                    Author = child.Data.Author ?? string.Empty,
-                    Score = child.Data.Score,
-                    Content = child.Data.Content ?? string.Empty,
-                    CreatedUtc = child.Data.CreatedUtc,
-                    PostUrl = commentUrl,
-                    Replies = TransformComments(child.Data.Replies, subreddit, postId)
-                };
-
-                comments.Add(comment);
-            }
-        }
-
-        return comments;
-    }
-
-    private List<RedditComment> TransformComments(object? replies, string subreddit = "", string postId = "")
-    {
-        if (replies == null)
-            return [];
-
-        if (replies is string stringReply && stringReply == "")
-            return [];
-
-        if (replies is JsonElement jsonElement)
-        {
-            if (jsonElement.ValueKind == JsonValueKind.Null)
-                return [];
-
-            if (jsonElement.ValueKind == JsonValueKind.String && jsonElement.GetString() == "")
-                return [];
-
-            try
-            {
-                var deserializedListing = JsonSerializer.Deserialize<RawRedditListing>(jsonElement.GetRawText());
-                return TransformComments(deserializedListing, subreddit, postId);
-            }
-            catch
-            {
-                return [];
-            }
-        }
-
-        if (replies is RawRedditListing listing)
-            return TransformComments(listing, subreddit, postId);
-
-        return [];
-    }
-
-    private List<RedditComment> TransformComments(RawRedditListing? replies, string subreddit = "", string postId = "")
-    {
-        if (replies == null)
-            return [];
-
-        if (replies.Data == null)
-            return [];
-
-        if (replies.Data.Children == null)
-            return [];
-
-        return TransformComments(replies.Data.Children, subreddit, postId);
-    }
+    private List<RedditComment> TransformComments(List<RawRedditChild> children, string subreddit = "", string postId = "")
+        => TransformComments(children, subreddit, postId, "www.reddit.com");
+
+    private List<RedditComment> TransformComments(object? replies, string subreddit = "", string postId = "")
+        => TransformComments(replies, subreddit, postId, "www.reddit.com");
+
+    private List<RedditComment> TransformComments(RawRedditListing? replies, string subreddit = "", string postId = "")
+        => TransformComments(replies, subreddit, postId, "www.reddit.com");
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 339b0ce and 5784130.

📒 Files selected for processing (1)
  • src/Elzik.Breef.Infrastructure/ContentExtractors/Reddit/Client/Raw/RawRedditPostTransformer.cs (5 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
src/Elzik.Breef.Infrastructure/ContentExtractors/Reddit/Client/Raw/RawRedditPostTransformer.cs (2)
src/Elzik.Breef.Infrastructure/ContentExtractors/Reddit/Client/Raw/RawRedditPost.cs (3)
  • RawRedditChild (30-37)
  • RawRedditListing (9-16)
  • RawRedditPost (5-7)
src/Elzik.Breef.Infrastructure/ContentExtractors/Reddit/Client/Raw/IRawRedditPostTransformer.cs (1)
  • RedditPost (5-5)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: build-ubuntu
  • GitHub Check: build-ubuntu
  • GitHub Check: Analyze (csharp)
🔇 Additional comments (2)
src/Elzik.Breef.Infrastructure/ContentExtractors/Reddit/Client/Raw/RawRedditPostTransformer.cs (2)

238-243: Verify PostUrl semantics for link posts vs self-posts.

mainPostData.Url returns the external link URL for link posts (e.g., https://example.com/article) but the Reddit permalink for self-posts. If PostUrl is intended to always be the Reddit permalink, you may want to use the Permalink field or construct it similarly to comment URLs:

var postUrl = $"https://{host}/r/{subreddit}/comments/{postId}/";

If the current behavior (external URL for link posts, permalink for self-posts) is intentional, this is fine as-is.

Also applies to: 257-257


220-263: Transform method implementation looks solid.

Good input validation with ArgumentNullException.ThrowIfNull and meaningful exception messages. The host derivation with fallback to www.reddit.com handles edge cases well. Required vs optional field handling is consistent.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5784130 and 94365e9.

📒 Files selected for processing (1)
  • src/Elzik.Breef.Infrastructure/ContentExtractors/Reddit/Client/Raw/RawRedditPostTransformer.cs (5 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
src/Elzik.Breef.Infrastructure/ContentExtractors/Reddit/Client/Raw/RawRedditPostTransformer.cs (4)
src/Elzik.Breef.Infrastructure/ContentExtractors/Reddit/RedditOptions.cs (1)
  • List (24-32)
src/Elzik.Breef.Infrastructure/ContentExtractors/Reddit/Client/RedditPost.cs (3)
  • RedditComment (22-31)
  • RedditPost (3-7)
  • RedditPostContent (9-20)
src/Elzik.Breef.Infrastructure/ContentExtractors/Reddit/Client/Raw/RawRedditPost.cs (3)
  • RawRedditChild (30-37)
  • RawRedditListing (9-16)
  • RawRedditPost (5-7)
src/Elzik.Breef.Infrastructure/ContentExtractors/Reddit/Client/Raw/IRawRedditPostTransformer.cs (1)
  • RedditPost (5-5)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: build-ubuntu
  • GitHub Check: build-ubuntu
  • GitHub Check: Analyze (csharp)
🔇 Additional comments (4)
src/Elzik.Breef.Infrastructure/ContentExtractors/Reddit/Client/Raw/RawRedditPostTransformer.cs (4)

71-96: LGTM!

The context propagation pattern is well-implemented. Comment URLs are correctly constructed following Reddit's standard URL structure, and the recursive call properly passes context to nested replies.


98-129: LGTM!

The overload correctly handles Reddit's polymorphic reply format (null, empty string, JsonElement, or RawRedditListing). The silent exception handling at line 119 provides resilient fallback behavior, which is appropriate for parsing external API responses.


131-143: LGTM!

Proper null-guarding before delegating to the list-based overload.


145-160: Clean validation and transformation logic.

Input validation is thorough with clear exception messages. The object initialization is well-structured, and the null-coalescing for empty children list at line 184 handles edge cases gracefully.

Also applies to: 170-188

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
src/Elzik.Breef.Infrastructure/ContentExtractors/Reddit/Client/Raw/RawRedditPostTransformer.cs (2)

11-18: Fix indentation on line 17.

Line 17 has inconsistent leading whitespace (two spaces instead of proper indentation). While this doesn't affect functionality, it impacts code readability.

Apply this diff to fix the indentation:

     if (options.Value == null)
         throw new InvalidOperationException("RedditOptions configuration is missing or not bound.");
-      _options = options.Value;
+    _options = options.Value;
 }

173-180: Consider using Permalink instead of Url for semantic correctness.

Line 175 assigns mainPostData.Url to postUrl, which is later used for the PostUrl property (line 194). For link posts, Url contains the external link (e.g., imgur.com, youtube.com) rather than the Reddit post permalink. While RedditPostContentExtractor overwrites this value with the actual Reddit URL (line 63), using Url here is semantically incorrect and could cause issues if Transform is called directly.

Consider using the Permalink field instead, which always contains the Reddit post URL:

 var subreddit = mainPostData.Subreddit ?? string.Empty;
 var postId = mainPostData.Id ?? string.Empty;
-var postUrl = mainPostData.Url ?? string.Empty;
+var postUrl = mainPostData.Permalink ?? string.Empty;
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 94365e9 and a6e5da5.

📒 Files selected for processing (5)
  • src/Elzik.Breef.Infrastructure/ContentExtractors/Reddit/Client/Raw/RawRedditPostTransformer.cs (6 hunks)
  • src/Elzik.Breef.Infrastructure/SummarisationInstructions/SubredditContent.md (1 hunks)
  • tests/Elzik.Breef.Infrastructure.Tests.Integration/ContentExtractors/Reddit/Client/RedditPostClientTests.cs (2 hunks)
  • tests/Elzik.Breef.Infrastructure.Tests.Integration/ContentExtractors/Reddit/RedditPostContentExtractorTests.cs (3 hunks)
  • tests/Elzik.Breef.Infrastructure.Tests.Unit/ContentExtractors/Reddit/Client/RawRedditPostTransformerTests.cs (3 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
tests/Elzik.Breef.Infrastructure.Tests.Unit/ContentExtractors/Reddit/Client/RawRedditPostTransformerTests.cs (2)
src/Elzik.Breef.Infrastructure/ContentExtractors/Reddit/Client/Raw/RawRedditPostTransformer.cs (2)
  • RawRedditPostTransformer (7-201)
  • RawRedditPostTransformer (11-18)
src/Elzik.Breef.Infrastructure/ContentExtractors/Reddit/RedditOptions.cs (1)
  • RedditOptions (5-33)
tests/Elzik.Breef.Infrastructure.Tests.Integration/ContentExtractors/Reddit/RedditPostContentExtractorTests.cs (2)
src/Elzik.Breef.Infrastructure/ContentExtractors/Reddit/RedditOptions.cs (1)
  • RedditOptions (5-33)
src/Elzik.Breef.Infrastructure/ContentExtractors/Reddit/Client/Raw/RawRedditPostTransformer.cs (2)
  • RawRedditPostTransformer (7-201)
  • RawRedditPostTransformer (11-18)
🪛 LanguageTool
src/Elzik.Breef.Infrastructure/SummarisationInstructions/SubredditContent.md

[grammar] ~31-~31: Use a hyphen to join words.
Context: ...reddit itself - Summaries of the highest scoring top-level posts

(QB_NEW_EN_HYPHEN)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: build-ubuntu
  • GitHub Check: build-ubuntu
  • GitHub Check: Analyze (csharp)
🔇 Additional comments (7)
src/Elzik.Breef.Infrastructure/ContentExtractors/Reddit/Client/Raw/RawRedditPostTransformer.cs (1)

83-155: LGTM!

Context propagation for subreddit, postId, and host is implemented correctly across all TransformComments overloads. The use of the configured host (from DefaultBaseAddress) for constructing comment URLs properly resolves the previous issue where the host was incorrectly extracted from mainPostData.Url.

src/Elzik.Breef.Infrastructure/SummarisationInstructions/SubredditContent.md (1)

1-31: LGTM!

The refactored instructions provide a clear, structured specification for Reddit subreddit summarization. The HTML-centric format with explicit link examples (lines 17-18) aligns well with the PostUrl changes introduced across the codebase.

tests/Elzik.Breef.Infrastructure.Tests.Integration/ContentExtractors/Reddit/RedditPostContentExtractorTests.cs (2)

22-26: LGTM!

The test correctly instantiates the transformer with injected configuration via Options.Create. Using a distinct test base address (https://www.test-reddit.com) ensures test isolation.


70-70: LGTM!

PostUrl assertions validate that the new property is properly populated through the extraction pipeline.

Also applies to: 144-144

tests/Elzik.Breef.Infrastructure.Tests.Integration/ContentExtractors/Reddit/Client/RedditPostClientTests.cs (1)

22-26: LGTM!

Consistent with other test files, the transformer is now instantiated with injected configuration. The test base address ensures that comment URLs are constructed using the configured host.

tests/Elzik.Breef.Infrastructure.Tests.Unit/ContentExtractors/Reddit/Client/RawRedditPostTransformerTests.cs (2)

10-19: LGTM!

The test constructor properly initializes the transformer with test configuration, enabling controlled validation of the host-based URL generation.


126-126: LGTM!

The PostUrl assertions validate that comment URLs are correctly constructed using the configured host from RedditOptions.DefaultBaseAddress, including for nested reply chains. This confirms that context propagation works as expected throughout the transformation hierarchy.

Also applies to: 135-135

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
tests/Elzik.Breef.Infrastructure.Tests.Unit/ContentExtractors/Reddit/Client/RawRedditPostTransformerTests.cs (1)

119-137: Consider also asserting the main post PostUrl in the happy‑path test

The new PostUrl assertions for the top‑level comment and its reply look correct and validate the expected Reddit URL pattern; as a small enhancement, you might also assert result.Post.PostUrl here to keep all primary URL expectations in a single canonical test.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a6e5da5 and 134b56e.

📒 Files selected for processing (1)
  • tests/Elzik.Breef.Infrastructure.Tests.Unit/ContentExtractors/Reddit/Client/RawRedditPostTransformerTests.cs (4 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: build-ubuntu
  • GitHub Check: build-ubuntu
  • GitHub Check: Analyze (csharp)
🔇 Additional comments (6)
tests/Elzik.Breef.Infrastructure.Tests.Unit/ContentExtractors/Reddit/Client/RawRedditPostTransformerTests.cs (6)

1-19: Fixture construction with IOptions<RedditOptions> is clear and realistic

Using Options.Create with a test base address in the test fixture nicely mirrors production wiring and keeps individual tests focused on behavior rather than setup noise. Please just confirm the Options pattern here matches how the transformer is built in your real DI configuration.


450-466: Constructor guard‑rail tests for options are well targeted

The tests for null options and null options.Value exercise the key failure modes for misconfigured RedditOptions and pin down both exception type and diagnostic text, which is valuable for catching configuration regressions early.


469-565: Good coverage of malformed post/listing and invalid base‑address scenarios

The new tests around empty post listings, null titles, and an invalid DefaultBaseAddress provide solid negative‑path coverage and clearly document the invariants the transformer expects from both Reddit payloads and configuration.


567-783: Image selection and HTML‑decoding tests look accurate and comprehensive

The gallery and preview tests, including HTML‑encoded URLs and multi‑image selection by size, tightly match the expected behavior and should guard well against regressions in image URL resolution logic.


785-966: Robust handling of gallery metadata and “thumbnail only” posts

Tests that skip invalid gallery metadata, ignore default/nsfw thumbnails, and treat non‑image URLs as null image URLs are all consistent and give a clear contract for what “valid image” means in this transformer.


968-1218: Comment/reply edge‑case tests are thorough and align with Reddit’s odd shapes

The scenarios for empty‑string replies, null listing data, null children, and non‑t1 kinds demonstrate careful handling of real‑world Reddit comment trees and ensure the transformer produces a clean, predictable comment list.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
tests/Elzik.Breef.Infrastructure.Tests.Unit/ContentExtractors/Reddit/Client/RawRedditPostTransformerTests.cs (1)

22-137: Consider adding Post.PostUrl assertions.

The test verifies comment PostUrl values (lines 126, 135) but doesn't assert on result.Post.PostUrl. While ImageUrl is checked, confirming PostUrl behavior for the post itself would complete the validation coverage.

Add an assertion like:

 result.Post.CreatedUtc.ShouldBe(new DateTime(2025, 1, 1, 12, 0, 0, DateTimeKind.Utc));
+result.Post.PostUrl.ShouldNotBeNullOrEmpty();
 
 result.Post.ImageUrl.ShouldBeNull();
src/Elzik.Breef.Infrastructure/ContentExtractors/Reddit/Client/Raw/RawRedditPostTransformer.cs (1)

21-71: Consider consistent HTML decoding across all image sources.

HTML decoding is applied to gallery URLs (line 35) and preview URLs (line 49), but not to direct URLs (line 57) or thumbnails (line 67). If Reddit's API can return HTML-encoded characters in any URL field, applying HttpUtility.HtmlDecode consistently would prevent potential issues.

Apply decoding to all image URLs:

 var directUrl = postData.UrlOverriddenByDest ?? postData.Url;
 if (IsImageUrl(directUrl))
 {
-    return directUrl;
+    return HttpUtility.HtmlDecode(directUrl);
 }

And for thumbnails:

 if (!string.IsNullOrEmpty(postData.Thumbnail) && 
     postData.Thumbnail != "self" && 
     postData.Thumbnail != "default" && 
     postData.Thumbnail != "nsfw" &&
     IsImageUrl(postData.Thumbnail))
 {
-    return postData.Thumbnail;
+    return HttpUtility.HtmlDecode(postData.Thumbnail);
 }
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 134b56e and b0f0d7a.

📒 Files selected for processing (2)
  • src/Elzik.Breef.Infrastructure/ContentExtractors/Reddit/Client/Raw/RawRedditPostTransformer.cs (6 hunks)
  • tests/Elzik.Breef.Infrastructure.Tests.Unit/ContentExtractors/Reddit/Client/RawRedditPostTransformerTests.cs (4 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
tests/Elzik.Breef.Infrastructure.Tests.Unit/ContentExtractors/Reddit/Client/RawRedditPostTransformerTests.cs (3)
src/Elzik.Breef.Infrastructure/ContentExtractors/Reddit/Client/Raw/RawRedditPostTransformer.cs (2)
  • RawRedditPostTransformer (7-202)
  • RawRedditPostTransformer (11-19)
src/Elzik.Breef.Infrastructure/ContentExtractors/Reddit/RedditOptions.cs (1)
  • RedditOptions (5-33)
src/Elzik.Breef.Infrastructure/ContentExtractors/Reddit/Client/Raw/RawRedditPost.cs (11)
  • RawRedditPost (5-7)
  • RawRedditListing (9-16)
  • RawRedditListingData (18-28)
  • RawRedditChild (30-37)
  • RawRedditCommentData (39-93)
  • RawRedditGalleryData (140-144)
  • RawRedditGalleryItem (146-154)
  • RawRedditMediaMetadata (125-138)
  • RawRedditImageSource (113-123)
  • RawRedditPreview (95-102)
  • RawRedditPreviewImage (104-111)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: build-ubuntu
  • GitHub Check: build-ubuntu
  • GitHub Check: Analyze (csharp)
🔇 Additional comments (8)
tests/Elzik.Breef.Infrastructure.Tests.Unit/ContentExtractors/Reddit/Client/RawRedditPostTransformerTests.cs (4)

1-19: LGTM! Constructor properly initializes options for dependency injection.

The test setup correctly creates IOptions<RedditOptions> with a configured DefaultBaseAddress. This aligns with the transformer's new constructor signature and enables proper PostUrl generation throughout the tests.


450-467: Excellent validation coverage for null options scenarios.

These tests properly verify that the transformer rejects both null options and unbound options.Value, ensuring configuration is always present at instantiation time.


527-565: Good validation test for invalid base address.

This test confirms that an invalid DefaultBaseAddress is caught during transformation when attempting to extract the host, not at construction time. This is appropriate since the validation attribute on RedditOptions would catch this during app startup in production.


568-1218: Comprehensive edge case coverage for image extraction and comment handling.

The test suite thoroughly covers:

  • HTML decoding for gallery and preview URLs
  • URL override behavior
  • Largest image selection from multiple sources
  • Invalid gallery metadata filtering
  • Thumbnail validation (default/nsfw/self ignored)
  • Non-image URL handling
  • Various reply edge cases (empty string, null data, null children, non-t1 kinds)

This level of coverage significantly improves confidence in the transformer's robustness.

src/Elzik.Breef.Infrastructure/ContentExtractors/Reddit/Client/Raw/RawRedditPostTransformer.cs (4)

9-19: LGTM! Robust constructor validation ensures configuration is always present.

The constructor properly validates both the options parameter and its Value property, throwing appropriate exceptions with clear messages. This prevents runtime errors from missing configuration.


84-109: Correct context propagation for comment URL generation.

The method now receives subreddit, postId, and host parameters and constructs proper Reddit comment URLs using the configured host. This addresses the previous review concern about extracting host from post URLs (which could be external domains for link posts).


111-156: Excellent handling of polymorphic replies with context threading.

These overloads gracefully handle various reply formats (null, empty string, JsonElement, RawRedditListing) while consistently propagating the context parameters throughout the transformation chain. The defensive null checks prevent errors from malformed data.


158-201: Well-structured Transform method with proper validation.

The method validates input, extracts necessary data, confirms DefaultBaseAddress is a valid URI, and correctly uses the configured host for comment URLs. The PostUrl assignment on line 195 captures the post's linked content URL (which may be external for link posts), while comments receive Reddit permalink URLs.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tests/Elzik.Breef.Infrastructure.Tests.Unit/ContentExtractors/Reddit/RedditPostContentExtractorTests.cs (1)

233-245: Test currently can’t detect regressions in PostUrl assignment

By passing url into CreateTestRedditPost as postUrl, the stub’s Post.PostUrl already equals the expected value. If RedditPostContentExtractor ever stops overriding Post.PostUrl with webPageUrl, this test would still pass, so it doesn’t truly guard the behavior.

To ensure the test fails when the extractor stops setting PostUrl, let the helper use its fallback URL (different from url) and keep asserting against url:

@@ public async Task ExtractAsync_ValidUrl_ReturnsSerializedPostAsContent()
-            var url = "https://www.reddit.com/r/programming/comments/abc123/title";
-            var testPost = CreateTestRedditPost("abc123", "Test Title", "https://example.com/image.jpg", url);
+            var url = "https://www.reddit.com/r/programming/comments/abc123/title";
+            // Do not pass `url` as postUrl so we can detect that the extractor overwrites it
+            var testPost = CreateTestRedditPost("abc123", "Test Title", "https://example.com/image.jpg");
@@
-            deserializedPost.Post.PostUrl.ShouldBe(url);
+            deserializedPost.Post.PostUrl.ShouldBe(url);

The updated helper with optional postUrl and a sensible fallback is otherwise fine and keeps existing call sites compatible.

Also applies to: 368-381

🧹 Nitpick comments (4)
tests/Elzik.Breef.Infrastructure.Tests.Unit/ContentExtractors/ContentExtractorBaseTests.cs (2)

40-54: Consider adding assertion for OriginalUrl.

The test validates Title, Content, and PreviewImageUrl but doesn't assert OriginalUrl. For completeness, consider adding:

         result.Title.ShouldBe("Test Title");
         result.Content.ShouldBe("Test Content");
         result.PreviewImageUrl.ShouldBe("https://example.com/image.jpg");
+        result.OriginalUrl.ShouldBe("https://original.url.com");

107-110: Consider using webPageUrl parameter for more realistic test behavior.

The test extractors use a hardcoded URL instead of the webPageUrl parameter. Using the parameter would better mirror production extractors and allow tests to verify URL propagation:

         protected override Task<UntypedExtract> CreateUntypedExtractAsync(string webPageUrl)
         {
-            return Task.FromResult(new UntypedExtract("Test Title", "Test Content", "https://original.url.com", "https://example.com/image.jpg"));
+            return Task.FromResult(new UntypedExtract("Test Title", "Test Content", webPageUrl, "https://example.com/image.jpg"));
         }

This would apply similarly to the other test extractors.

tests/Elzik.Breef.Infrastructure.Tests.Unit/ContentExtractors/ContentExtractorStrategyTests.cs (1)

11-13: Align Extract constructor argument order with domain model

Extract is constructed elsewhere as (Title, Content, OriginalUrl, PreviewImageUrl, ExtractType), but _extractedByExtractor1 and _extractedByExtractor2 currently pass image URL as the third argument and the original URL as the fourth, while _extractedByDefaultExtractor uses the consistent order.

Consider swapping the 3rd/4th arguments for the first two fields to avoid confusion and future copy‑paste errors:

-    private readonly Extract _extractedByExtractor1 = new("Title1", "Content1", "Image1", "https://original.url.com", "Extractor1Type");
-    private readonly Extract _extractedByExtractor2 = new("Title2", "Content2", "Image2", "https://original.url.com", "Extractor2Type");
-    private readonly Extract _extractedByDefaultExtractor = new("DefaultTitle", "DefaultContent", "https://original.url.com", "DefaultImage", "DefaultExtractorType");
+    private readonly Extract _extractedByExtractor1 = new("Title1", "Content1", "https://original.url.com", "Image1", "Extractor1Type");
+    private readonly Extract _extractedByExtractor2 = new("Title2", "Content2", "https://original.url.com", "Image2", "Extractor2Type");
+    private readonly Extract _extractedByDefaultExtractor = new("DefaultTitle", "DefaultContent", "https://original.url.com", "DefaultImage", "DefaultExtractorType");
src/Elzik.Breef.Infrastructure/ContentExtractors/Reddit/SubRedditContentExtractor.cs (1)

11-12: Timestamped title and instance-specific OriginalUrl look consistent

Using TimeProvider to derive dateAndTime and then reusing it in both the title and instanceSpecificoriginalUrl cleanly guarantees a unique, time‑scoped extract per run, and the UntypedExtract call matches the new (Title, Content, OriginalUrl, PreviewImageUrl) shape.

If you touch this again, you might consider renaming instanceSpecificoriginalUrl to instanceSpecificOriginalUrl for readability, but behaviorally this looks solid.

Also applies to: 47-52

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b0f0d7a and 1b7a14e.

📒 Files selected for processing (19)
  • README.md (1 hunks)
  • src/Elzik.Breef.Application/BreefGenerator.cs (1 hunks)
  • src/Elzik.Breef.Domain/Extract.cs (1 hunks)
  • src/Elzik.Breef.Infrastructure/CallerFixableHttpRequestException.cs (1 hunks)
  • src/Elzik.Breef.Infrastructure/ContentExtractors/HtmlContentExtractor.cs (1 hunks)
  • src/Elzik.Breef.Infrastructure/ContentExtractors/Reddit/RedditPostContentExtractor.cs (2 hunks)
  • src/Elzik.Breef.Infrastructure/ContentExtractors/Reddit/SubRedditContentExtractor.cs (2 hunks)
  • src/Elzik.Breef.Infrastructure/ContentExtractors/UntypedExtract.cs (1 hunks)
  • tests/Elzik.Breef.Api.Tests.Functional/Breefs/BreefTestsBase.cs (1 hunks)
  • tests/Elzik.Breef.Api.Tests.Integration/Elzik.Breef.Api.Tests.Integration.csproj (1 hunks)
  • tests/Elzik.Breef.Api.Tests.Integration/FileBasedContentSummarisationInstructionProviderTests.cs (1 hunks)
  • tests/Elzik.Breef.Infrastructure.Tests.Integration/ContentExtractors/HtmlContentExtractorTests.cs (1 hunks)
  • tests/Elzik.Breef.Infrastructure.Tests.Integration/ContentExtractors/Reddit/RedditPostContentExtractorTests.cs (5 hunks)
  • tests/Elzik.Breef.Infrastructure.Tests.Integration/Elzik.Breef.Infrastructure.Tests.Integration.csproj (1 hunks)
  • tests/Elzik.Breef.Infrastructure.Tests.Unit/ContentExtractors/ContentExtractorBaseTests.cs (4 hunks)
  • tests/Elzik.Breef.Infrastructure.Tests.Unit/ContentExtractors/ContentExtractorStrategyTests.cs (1 hunks)
  • tests/Elzik.Breef.Infrastructure.Tests.Unit/ContentExtractors/Reddit/RedditPostContentExtractorTests.cs (4 hunks)
  • tests/Elzik.Breef.Infrastructure.Tests.Unit/ContentExtractors/Reddit/SubRedditExtractorTests.cs (10 hunks)
  • tests/Elzik.Breef.Infrastructure.Tests.Unit/Elzik.Breef.Infrastructure.Tests.Unit.csproj (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/Elzik.Breef.Infrastructure/ContentExtractors/Reddit/RedditPostContentExtractor.cs
🧰 Additional context used
🧬 Code graph analysis (6)
tests/Elzik.Breef.Infrastructure.Tests.Unit/ContentExtractors/ContentExtractorStrategyTests.cs (1)
src/Elzik.Breef.Infrastructure/ContentExtractors/UntypedExtract.cs (1)
  • Extract (7-8)
tests/Elzik.Breef.Infrastructure.Tests.Unit/ContentExtractors/Reddit/RedditPostContentExtractorTests.cs (2)
src/Elzik.Breef.Infrastructure/ContentExtractors/Reddit/RedditPostContentExtractor.cs (1)
  • Task (34-74)
src/Elzik.Breef.Infrastructure/ContentExtractors/Reddit/Client/Raw/RawRedditPostTransformer.cs (1)
  • RedditPost (158-201)
src/Elzik.Breef.Domain/Extract.cs (1)
src/Elzik.Breef.Infrastructure/ContentExtractors/UntypedExtract.cs (1)
  • Extract (7-8)
tests/Elzik.Breef.Infrastructure.Tests.Integration/ContentExtractors/Reddit/RedditPostContentExtractorTests.cs (4)
src/Elzik.Breef.Infrastructure/ContentExtractors/Reddit/RedditOptions.cs (1)
  • RedditOptions (5-33)
src/Elzik.Breef.Infrastructure/ContentExtractors/Reddit/Client/Raw/RawRedditPostTransformer.cs (3)
  • RawRedditPostTransformer (7-202)
  • RawRedditPostTransformer (11-19)
  • RedditPost (158-201)
src/Elzik.Breef.Infrastructure/ContentExtractors/Reddit/SubRedditContentExtractor.cs (1)
  • SubredditContentExtractor (8-92)
src/Elzik.Breef.Infrastructure/ContentExtractors/Reddit/Client/Raw/IRawRedditPostTransformer.cs (1)
  • RedditPost (5-5)
tests/Elzik.Breef.Infrastructure.Tests.Unit/ContentExtractors/ContentExtractorBaseTests.cs (4)
src/Elzik.Breef.Infrastructure/ContentExtractors/HtmlContentExtractor.cs (1)
  • Task (15-35)
src/Elzik.Breef.Infrastructure/ContentExtractors/Reddit/RedditPostContentExtractor.cs (1)
  • Task (34-74)
src/Elzik.Breef.Infrastructure/ContentExtractors/Reddit/SubRedditContentExtractor.cs (3)
  • Task (37-52)
  • Task (54-58)
  • Task (60-91)
src/Elzik.Breef.Infrastructure/ContentExtractors/ContentExtractorBase.cs (2)
  • Task (22-35)
  • Task (37-37)
tests/Elzik.Breef.Infrastructure.Tests.Unit/ContentExtractors/Reddit/SubRedditExtractorTests.cs (1)
src/Elzik.Breef.Infrastructure/ContentExtractors/Reddit/SubRedditContentExtractor.cs (3)
  • Task (37-52)
  • Task (54-58)
  • Task (60-91)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: build-ubuntu
  • GitHub Check: build-ubuntu
  • GitHub Check: Analyze (csharp)
🔇 Additional comments (18)
tests/Elzik.Breef.Api.Tests.Integration/Elzik.Breef.Api.Tests.Integration.csproj (1)

20-20: ✓ Good alignment with existing dependencies.

The addition of Microsoft.Extensions.TimeProvider.Testing 10.0.0 aligns well with the PR's time-based test updates and matches the version strategy of other Microsoft.Extensions packages in the file.

README.md (1)

148-159: Documentation additions are well-integrated and follow existing patterns.

The new Time Zone section is clearly written, logically positioned, and maintains consistency with the existing documentation structure. The example is practical and the Wikipedia reference provides a helpful resource for users seeking additional time zone identifiers.

src/Elzik.Breef.Infrastructure/CallerFixableHttpRequestException.cs (1)

5-6: LGTM!

Clean refactoring to C# 12 primary constructor syntax. The behavior remains identical—passing message and innerException to the base HttpRequestException constructor.

tests/Elzik.Breef.Api.Tests.Integration/FileBasedContentSummarisationInstructionProviderTests.cs (1)

99-105: Collection expression for TheoryData<string[]> looks correct; just confirm toolchain/xUnit versions

The new return [ [] ]; is functionally equivalent to returning a TheoryData<string[]> with a single row containing an empty string[], and matches modern xUnit guidance for using collection/array-style initialization with TheoryData<T>. (xunit.net)

Please just confirm:

  • The test project is compiling with C# 12 / .NET 8 (so collection expressions are supported), and
  • You’re on an xUnit version that supports these initializers and is not stuck on 2.9.1, which briefly had a bug with TheoryData<string[]> and arrays, fixed in 2.9.2. (xunit.net)
tests/Elzik.Breef.Api.Tests.Functional/Breefs/BreefTestsBase.cs (1)

47-47: No changes to this file in the current PR.

The file tests/Elzik.Breef.Api.Tests.Functional/Breefs/BreefTestsBase.cs was not modified in this PR. Line 47 already contains http://example.com in the baseline code, so there is no URL scheme change to verify. The review comment is based on incorrect information.

Likely an incorrect or invalid review comment.

tests/Elzik.Breef.Infrastructure.Tests.Unit/ContentExtractors/Reddit/RedditPostContentExtractorTests.cs (1)

213-226: Good coverage for OriginalUrl propagation

This test cleanly verifies that ExtractAsync preserves the requested URL into result.OriginalUrl, independent of the Reddit client stub, and is a solid addition around URL handling.

tests/Elzik.Breef.Infrastructure.Tests.Unit/Elzik.Breef.Infrastructure.Tests.Unit.csproj (1)

23-23: LGTM!

The package addition is appropriate for enabling deterministic time-based testing with FakeTimeProvider. Version 10.0.0 is consistent with other Microsoft.Extensions.* packages in this project.

src/Elzik.Breef.Application/BreefGenerator.cs (1)

19-19: LGTM!

Using extract.OriginalUrl instead of the input url allows extractors to provide a canonical or instance-specific URL (e.g., subreddit extracts with timestamps). This aligns well with the OriginalUrl propagation introduced across the extraction pipeline.

tests/Elzik.Breef.Infrastructure.Tests.Integration/Elzik.Breef.Infrastructure.Tests.Integration.csproj (1)

28-28: LGTM!

Consistent with the unit test project, this enables FakeTimeProvider usage for deterministic time-based testing in integration tests.

tests/Elzik.Breef.Infrastructure.Tests.Integration/ContentExtractors/HtmlContentExtractorTests.cs (1)

44-44: LGTM!

Good addition to validate that OriginalUrl is correctly propagated from the input URL through the extraction process.

src/Elzik.Breef.Infrastructure/ContentExtractors/HtmlContentExtractor.cs (1)

34-34: LGTM!

The webPageUrl is correctly passed to UntypedExtract to populate the new OriginalUrl field. This ensures the original request URL is preserved through the extraction pipeline.

src/Elzik.Breef.Domain/Extract.cs (1)

3-8: LGTM!

The OriginalUrl field is a sensible addition to track the source URL through the domain model. Making it non-nullable is appropriate since all extractors should provide an original URL.

Verification confirmed: All usages of the Extract record have been properly updated. The record is instantiated exclusively through UntypedExtract.WithType() (src/Elzik.Breef.Infrastructure/ContentExtractors/UntypedExtract.cs:8), which correctly passes parameters in the new order: Title, Content, OriginalUrl, PreviewImageUrl, extractType. All extractor implementations and test cases properly construct UntypedExtract with the required four parameters in the correct sequence.

src/Elzik.Breef.Infrastructure/ContentExtractors/UntypedExtract.cs (1)

5-8: UntypedExtractExtract mapping is consistent

The new OriginalUrl parameter on UntypedExtract and its propagation via WithType to Extract(Title, Content, OriginalUrl, PreviewImageUrl, extractType) are consistent with the rest of the changes.

tests/Elzik.Breef.Infrastructure.Tests.Integration/ContentExtractors/Reddit/RedditPostContentExtractorTests.cs (1)

5-6: Good coverage of new URL-related behavior in integration tests

The updated setup—constructing RawRedditPostTransformer from IOptions<RedditOptions> and injecting a FakeTimeProvider into SubredditContentExtractor—looks correct and keeps host derivation and time behavior under test control. The added assertions on result.OriginalUrl and redditPost.Post.PostUrl nicely exercise the new URL fields without making the tests brittle against remote data shape.

Also applies to: 19-20, 24-29, 44-47, 69-70, 76-77, 150-151

tests/Elzik.Breef.Infrastructure.Tests.Unit/ContentExtractors/Reddit/SubRedditExtractorTests.cs (4)

1-6: Deterministic time and safe defaults for subreddit extraction tests

Injecting a FakeTimeProvider with a fixed instant and wiring _extractor through it makes the new time‑dependent behavior fully deterministic, and defaulting GetNewInSubreddit to new NewInSubreddit { Posts = [] } avoids null/empty ambiguity in content assertions. This setup matches the production extractor’s expectations and keeps the tests robust.

Also applies to: 17-28, 34-45


77-88: Custom/unknown Reddit instance tests correctly exercise AllDomains logic

Re‑instantiating SubredditContentExtractor with updated RedditOptions.AdditionalBaseAddresses and the shared _fakeTimeProvider cleanly validates both acceptance and rejection of non‑default Reddit hosts via AllDomains. The wiring looks correct and mirrors the production configuration path.

Also applies to: 90-92, 100-121


189-205: Title and OriginalUrl assertions accurately cover the new timestamped behavior

The updated expectations:

  • result.Title.ShouldBe("New in r/subreddit as of 2015-10-21 07:28");
  • result.Title.ShouldBe("New in r/dotnet as of 2015-10-21 07:28");
  • result.OriginalUrl.ShouldBe($"{url}#{_fakeTimeProvider.GetLocalNow():yyyy-MM-dd HH:mm}");

match the extractor’s construction logic and directly pin the contract for both the human‑readable title and the instance‑specific OriginalUrl. Using Posts = [] in the NewInSubreddit setups keeps the focus of these tests squarely on URL/title behavior.

Also applies to: 483-497, 500-515, 227-230, 540-541


568-590: HTTP mock handlers nicely encapsulate success/failure behavior

Both MockHttpMessageHandler and ThrowingMockHttpMessageHandler are straightforward and make the tests’ HTTP expectations explicit (success with optional failUrl override vs. hard failure with a specific exception), without leaking that logic into the test bodies. This keeps the subreddit image URL tests clear and focused.

Also applies to: 594-600

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1b7a14e and 49ef7db.

📒 Files selected for processing (8)
  • src/Elzik.Breef.Infrastructure/ContentExtractors/Reddit/SubRedditContentExtractor.cs (2 hunks)
  • src/Elzik.Breef.Infrastructure/SummarisationInstructions/RedditPostContent.md (1 hunks)
  • src/Elzik.Breef.Infrastructure/SummarisationInstructions/SubredditContent.md (1 hunks)
  • tests/Elzik.Breef.Api.Tests.Functional/Breefs/BreefTestsDocker.cs (1 hunks)
  • tests/Elzik.Breef.Api.Tests.Functional/DockerTestsCollection.cs (1 hunks)
  • tests/Elzik.Breef.Api.Tests.Functional/Health/HealthTestsDocker.cs (1 hunks)
  • tests/Elzik.Breef.Infrastructure.Tests.Unit/ContentExtractors/ContentExtractorBaseTests.cs (5 hunks)
  • tests/Elzik.Breef.Infrastructure.Tests.Unit/ContentExtractors/ContentExtractorStrategyTests.cs (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/Elzik.Breef.Infrastructure.Tests.Unit/ContentExtractors/ContentExtractorStrategyTests.cs
🧰 Additional context used
🧬 Code graph analysis (2)
tests/Elzik.Breef.Api.Tests.Functional/Health/HealthTestsDocker.cs (1)
tests/Elzik.Breef.Api.Tests.Functional/Breefs/BreefTestsDocker.cs (1)
  • Collection (8-205)
tests/Elzik.Breef.Api.Tests.Functional/Breefs/BreefTestsDocker.cs (1)
tests/Elzik.Breef.Api.Tests.Functional/Health/HealthTestsDocker.cs (1)
  • Collection (8-172)
🪛 LanguageTool
src/Elzik.Breef.Infrastructure/SummarisationInstructions/RedditPostContent.md

[grammar] ~7-~7: Use a hyphen to join words.
Context: ...ts. ## Input Structure JSON with a top level "Post" item containing: - Metadata...

(QB_NEW_EN_HYPHEN)


[grammar] ~32-~32: Use a hyphen to join words.
Context: ...maries of the top-level post and highest scoring replies

(QB_NEW_EN_HYPHEN)

src/Elzik.Breef.Infrastructure/SummarisationInstructions/SubredditContent.md

[grammar] ~31-~31: Use a hyphen to join words.
Context: ...reddit itself - Summaries of the highest scoring top-level posts

(QB_NEW_EN_HYPHEN)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: build-ubuntu
  • GitHub Check: build-ubuntu
  • GitHub Check: Analyze (csharp)
🔇 Additional comments (9)
tests/Elzik.Breef.Api.Tests.Functional/Breefs/BreefTestsDocker.cs (1)

8-8: Aligns Breef Docker tests with the shared serialized collection

Adding [Collection("Docker Tests")] on Line 8 correctly groups these Docker-based functional tests into the shared, non-parallelized collection, consistent with DockerTestsCollection. No further changes needed here.

tests/Elzik.Breef.Api.Tests.Functional/Health/HealthTestsDocker.cs (1)

8-8: Health Docker tests correctly added to the Docker test collection

Applying [Collection("Docker Tests")] on Line 8 brings these tests into the same serialized Docker collection as BreefTestsDocker, which matches the DockerTestsCollection definition and avoids concurrent container runs. Looks good.

tests/Elzik.Breef.Api.Tests.Functional/DockerTestsCollection.cs (1)

1-13: DockerTestsCollection correctly defines a non-parallel Docker test collection

The DockerTestsCollection class with [CollectionDefinition("Docker Tests", DisableParallelization = true)] is the standard xUnit pattern for grouping and serializing Docker-based tests. The name matches the [Collection("Docker Tests")] usages, and the documentation comment is clear. No changes needed.

src/Elzik.Breef.Infrastructure/SummarisationInstructions/SubredditContent.md (1)

1-31: LGTM!

The instruction file is well-structured with clear requirements for summarizing Reddit subreddit content. The HTML link examples correctly reference RedditPostContent.PostUrl for posts and RedditComment.PostUrl for comments, aligning with the code changes in this PR.

src/Elzik.Breef.Infrastructure/ContentExtractors/Reddit/SubRedditContentExtractor.cs (2)

47-49: Consider time zone implications for "as of" timestamp.

The code uses timeProvider.GetLocalNow() to generate a timestamp for the title and URL. This means the timestamp will reflect the server's local time zone, which may vary depending on deployment environment.

Consider whether:

  1. GetUtcNow() would be more consistent across deployments
  2. The time zone should be explicitly included in the format string (e.g., "yyyy-MM-dd HH:mm zzz")
  3. The current behavior is intentional for user-facing display

If local time is intended for user convenience, consider documenting this design decision.


51-51: LGTM!

The updated UntypedExtract constructor call correctly passes the instance-specific original URL and aligns with the signature changes across the codebase.

tests/Elzik.Breef.Infrastructure.Tests.Unit/ContentExtractors/ContentExtractorBaseTests.cs (3)

54-54: LGTM!

The new assertion correctly validates that OriginalUrl is preserved through the extraction pipeline, aligning with the expanded UntypedExtract constructor signature.


110-111: LGTM!

The UntypedExtract constructor call correctly includes webPageUrl as the third parameter (originalUrl), ensuring consistency with the test assertion on line 54.


121-121: LGTM!

All test helper classes have been consistently updated to use the four-parameter UntypedExtract constructor, properly passing the originalUrl parameter.

Also applies to: 131-131, 141-141

…tPostContent.md

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
src/Elzik.Breef.Infrastructure/SummarisationInstructions/RedditPostContent.md (3)

7-7: Add hyphen to compound adjective "top-level".

When used attributively before a noun, "top level" should be hyphenated.

-JSON with a top level "Post" item containing:
+JSON with a top-level "Post" item containing:

15-15: Add hyphen to compound adjective "highest-scoring".

When used attributively before a noun, "highest scoring" should be hyphenated.

-3. **Replies**: Summarise only the highest scoring replies in lower detail
+3. **Replies**: Summarise only the highest-scoring replies in lower detail

32-32: Add hyphen to compound adjective "highest-scoring".

When used attributively before a noun, "highest scoring" should be hyphenated.

-- Summaries of the top-level post and highest scoring replies
+- Summaries of the top-level post and highest-scoring replies
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 49ef7db and ad66d25.

📒 Files selected for processing (1)
  • src/Elzik.Breef.Infrastructure/SummarisationInstructions/RedditPostContent.md (1 hunks)
🧰 Additional context used
🪛 LanguageTool
src/Elzik.Breef.Infrastructure/SummarisationInstructions/RedditPostContent.md

[grammar] ~7-~7: Use a hyphen to join words.
Context: ...ts. ## Input Structure JSON with a top level "Post" item containing: - Metadata...

(QB_NEW_EN_HYPHEN)


[grammar] ~15-~15: Use a hyphen to join words.
Context: ... Replies: Summarise only the highest scoring replies in lower detail 4. **Sum...

(QB_NEW_EN_HYPHEN)


[grammar] ~32-~32: Use a hyphen to join words.
Context: ...maries of the top-level post and highest scoring replies

(QB_NEW_EN_HYPHEN)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: build-ubuntu
  • GitHub Check: build-ubuntu
  • GitHub Check: Analyze (csharp)

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7441b5b and c5afe8a.

📒 Files selected for processing (3)
  • src/Elzik.Breef.Infrastructure/SummarisationInstructions/HtmlContent.md (1 hunks)
  • src/Elzik.Breef.Infrastructure/SummarisationInstructions/RedditPostContent.md (1 hunks)
  • src/Elzik.Breef.Infrastructure/SummarisationInstructions/SubredditContent.md (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • src/Elzik.Breef.Infrastructure/SummarisationInstructions/RedditPostContent.md
  • src/Elzik.Breef.Infrastructure/SummarisationInstructions/SubredditContent.md
🧰 Additional context used
🪛 markdownlint-cli2 (0.18.1)
src/Elzik.Breef.Infrastructure/SummarisationInstructions/HtmlContent.md

12-12: Unordered list indentation
Expected: 0; Actual: 2

(MD007, ul-indent)


13-13: Unordered list indentation
Expected: 0; Actual: 2

(MD007, ul-indent)


14-14: Unordered list indentation
Expected: 0; Actual: 2

(MD007, ul-indent)


16-16: Unordered list indentation
Expected: 0; Actual: 2

(MD007, ul-indent)


17-17: Unordered list indentation
Expected: 0; Actual: 2

(MD007, ul-indent)


18-18: Unordered list indentation
Expected: 0; Actual: 2

(MD007, ul-indent)


19-19: Unordered list indentation
Expected: 0; Actual: 2

(MD007, ul-indent)

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
@sonarqubecloud
Copy link

@elzik elzik merged commit ce5b950 into main Nov 29, 2025
14 checks passed
@elzik elzik deleted the improve-per-extractor-instructions branch November 29, 2025 15:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant