-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add extraction filter #182
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool feature and very flexible. I like how you have implemented this :).
Also, since you have given some nice examples in the PR description, wouldn't it be good to add them as tests as well? |
Co-authored-by: Conrad Dobberstein <[email protected]>
Hmm. Adding them as they are would bring back external dependencies to the unit tests and I want to avoid this. Adding them without the external dependency will be really complicated at least. |
Yes, that is a problem. One way would be to mock a custom |
This PR implements an extraction filter not to be confused with article filters #181.
Problem statement:
With #155 the
completed
attribute ofArticle
got removed leaving Fundus without the functionality to filter on extraction results.Solution:
This PR addresses this issue by adding a new parameter to the
Crawler.crawl()
method and theScraper
calledonly_complete
respectivelyextraction_filter
.only_complete
acts as a boolean flag and filter parameter at the same time giving users the possibility to only retrieve articles where all or partial extracted attributes' boolean value evaluates toTrue
. The default value isFalse
.extraction_fiilter
being a parameter for a slightly more low level object only excepts callables as a value specified through anExtractionFilter
protocol.Example usage: