-
Notifications
You must be signed in to change notification settings - Fork 171
Closed
Description
In #206, a new beautiful_soup_parser
configuration option was added that lets Beautiful Soup's HTML parser be specified:
MarkdownConverter(beautiful_soup_parser="lxml")
The specified parser keyword is passed to the features
parameter of the BeautifulSoup
constructor, which is the second option after the HTML markup
:
class BeautifulSoup(Tag):
# ...
def __init__(
self,
markup="",
features=None,
builder=None,
parse_only=None,
from_encoding=None,
exclude_encodings=None,
element_classes=None,
**kwargs,
):
# ...
But as shown above, the BeautifulSoup
constructor provides other options that might be useful, such as providing hints to the text encoding detection for the HTML document.
Before #206 ships in a production release, perhaps we could extend its implementation to generally support all Beautiful Soup configuration options using a kwargs-based approach (including new options in the future). For example,
MarkdownConverter(beautiful_soup_options={"features": "lxml"})
MarkdownConverter(beautiful_soup_options={"exclude_encodings": ["iso-8859-7"]})
Or perhaps a bit shorter to make up for the extra kwargs length,
MarkdownConverter(bs4_options={"features": "lxml"})
MarkdownConverter(bs4_options={"exclude_encodings": ["iso-8859-7"]})
Metadata
Metadata
Assignees
Labels
No labels