Add extraction filter #182

MaxDall · 2023-04-26T19:12:16Z

This PR implements an extraction filter not to be confused with article filters #181.

Problem statement:
With #155 the completed attribute of Article got removed leaving Fundus without the functionality to filter on extraction results.

Solution:
This PR addresses this issue by adding a new parameter to the Crawler.crawl() method and the Scraper called only_complete respectively extraction_filter.

only_complete acts as a boolean flag and filter parameter at the same time giving users the possibility to only retrieve articles where all or partial extracted attributes' boolean value evaluates to True. The default value is False.
extraction_fiilter being a parameter for a slightly more low level object only excepts callables as a value specified through an ExtractionFilter protocol.

Example usage:

from typing import Dict, Any

from fundus import PublisherCollection, Crawler, Requires

crawler = Crawler(PublisherCollection.de.MDR)

# extracted values for all retrieved articles evaluate to True
for article in crawler.crawl(max_articles=2, only_complete=True):
    print(article)

# values named `body` and `title` are required in the extraction and have to evaluate to True
for article in crawler.crawl(max_articles=2, only_complete=Requires("body", "title")):
    assert article.body and article.title

# default case, no restrictions given, all articles are retrieved
for article in crawler.crawl(max_articles=2, only_complete=False):
    assert article.authors


# custom filter function
def custom_filter(extracted: Dict[str, Any]) -> bool:
    keys = list(extracted.keys())
    if "author" in keys and "authors" in keys:
        return False
    return True


for article in crawler.crawl(max_articles=2, only_complete=custom_filter):
    print(article)

dobbersc

Cool feature and very flexible. I like how you have implemented this :).

src/fundus/parser/data.py

src/fundus/scraping/scraper.py

src/fundus/scraping/pipeline.py

src/fundus/scraping/scraper.py

dobbersc · 2023-05-02T14:24:48Z

Also, since you have given some nice examples in the PR description, wouldn't it be good to add them as tests as well?

Co-authored-by: Conrad Dobberstein <[email protected]>

src/fundus/parser/data.py

MaxDall · 2023-05-03T12:17:43Z

Also, since you have given some nice examples in the PR description, wouldn't it be good to add them as tests as well?

Hmm. Adding them as they are would bring back external dependencies to the unit tests and I want to avoid this. Adding them without the external dependency will be really complicated at least.

dobbersc · 2023-05-03T12:40:30Z

Also, since you have given some nice examples in the PR description, wouldn't it be good to add them as tests as well?

Hmm. Adding them as they are would bring back external dependencies to the unit tests and I want to avoid this. Adding them without the external dependency will be really complicated at least.

Yes, that is a problem. One way would be to mock a custom Source for the Scraper that yields articles from a resource folder.

MaxDall added 4 commits April 26, 2023 20:31

implement extraction filter

f461b7d

black, mypy, isort

862c971

expose Required class

5d22383

rename Required -> Requires

ad7901a

MaxDall mentioned this pull request May 2, 2023

[GH-174] Proxy parser #183

Merged

dobbersc requested changes May 2, 2023

View reviewed changes

MaxDall and others added 2 commits May 3, 2023 13:53

Update src/fundus/parser/data.py

c79b5e9

Co-authored-by: Conrad Dobberstein <[email protected]>

removed sentinel

dd97f2b

dobbersc reviewed May 3, 2023

View reviewed changes

src/fundus/parser/data.py Outdated Show resolved Hide resolved

addressed some comments

1c2e705

dobbersc approved these changes May 3, 2023

View reviewed changes

MaxDall merged commit 762f6a5 into master May 3, 2023

MaxDall deleted the add_extraction_filter branch May 3, 2023 12:52

This was referenced May 11, 2023

What level of noise is acceptable in the articles? #195

Closed

[TODO] Adjust docs after ProxyParser PR is merged #198

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add extraction filter #182

Add extraction filter #182

MaxDall commented Apr 26, 2023 •

edited

Loading

dobbersc left a comment

dobbersc commented May 2, 2023

MaxDall commented May 3, 2023

dobbersc commented May 3, 2023

Add extraction filter #182

Add extraction filter #182

Conversation

MaxDall commented Apr 26, 2023 • edited Loading

dobbersc left a comment

Choose a reason for hiding this comment

dobbersc commented May 2, 2023

MaxDall commented May 3, 2023

dobbersc commented May 3, 2023

MaxDall commented Apr 26, 2023 •

edited

Loading