Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add extraction filter #182

Merged
merged 7 commits into from
May 3, 2023
Merged

Add extraction filter #182

merged 7 commits into from
May 3, 2023

Conversation

MaxDall
Copy link
Collaborator

@MaxDall MaxDall commented Apr 26, 2023

This PR implements an extraction filter not to be confused with article filters #181.

Problem statement:
With #155 the completed attribute of Article got removed leaving Fundus without the functionality to filter on extraction results.

Solution:
This PR addresses this issue by adding a new parameter to the Crawler.crawl() method and the Scraper called only_complete respectively extraction_filter.

  • only_complete acts as a boolean flag and filter parameter at the same time giving users the possibility to only retrieve articles where all or partial extracted attributes' boolean value evaluates to True. The default value is False.
  • extraction_fiilter being a parameter for a slightly more low level object only excepts callables as a value specified through an ExtractionFilter protocol.

Example usage:

from typing import Dict, Any

from fundus import PublisherCollection, Crawler, Requires

crawler = Crawler(PublisherCollection.de.MDR)

# extracted values for all retrieved articles evaluate to True
for article in crawler.crawl(max_articles=2, only_complete=True):
    print(article)

# values named `body` and `title` are required in the extraction and have to evaluate to True
for article in crawler.crawl(max_articles=2, only_complete=Requires("body", "title")):
    assert article.body and article.title

# default case, no restrictions given, all articles are retrieved
for article in crawler.crawl(max_articles=2, only_complete=False):
    assert article.authors


# custom filter function
def custom_filter(extracted: Dict[str, Any]) -> bool:
    keys = list(extracted.keys())
    if "author" in keys and "authors" in keys:
        return False
    return True


for article in crawler.crawl(max_articles=2, only_complete=custom_filter):
    print(article)

@MaxDall MaxDall mentioned this pull request May 2, 2023
Copy link
Collaborator

@dobbersc dobbersc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool feature and very flexible. I like how you have implemented this :).

src/fundus/parser/data.py Outdated Show resolved Hide resolved
src/fundus/parser/data.py Outdated Show resolved Hide resolved
src/fundus/scraping/scraper.py Outdated Show resolved Hide resolved
src/fundus/scraping/pipeline.py Show resolved Hide resolved
src/fundus/scraping/scraper.py Outdated Show resolved Hide resolved
src/fundus/scraping/scraper.py Outdated Show resolved Hide resolved
src/fundus/scraping/scraper.py Outdated Show resolved Hide resolved
@dobbersc
Copy link
Collaborator

dobbersc commented May 2, 2023

Also, since you have given some nice examples in the PR description, wouldn't it be good to add them as tests as well?

MaxDall and others added 2 commits May 3, 2023 13:53
src/fundus/parser/data.py Outdated Show resolved Hide resolved
@MaxDall
Copy link
Collaborator Author

MaxDall commented May 3, 2023

Also, since you have given some nice examples in the PR description, wouldn't it be good to add them as tests as well?

Hmm. Adding them as they are would bring back external dependencies to the unit tests and I want to avoid this. Adding them without the external dependency will be really complicated at least.

@dobbersc
Copy link
Collaborator

dobbersc commented May 3, 2023

Also, since you have given some nice examples in the PR description, wouldn't it be good to add them as tests as well?

Hmm. Adding them as they are would bring back external dependencies to the unit tests and I want to avoid this. Adding them without the external dependency will be really complicated at least.

Yes, that is a problem. One way would be to mock a custom Source for the Scraper that yields articles from a resource folder.

@MaxDall MaxDall merged commit 762f6a5 into master May 3, 2023
@MaxDall MaxDall deleted the add_extraction_filter branch May 3, 2023 12:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants