Skip to content

Add generalized support for specifying Beautiful Soup options #223

@chrispy-snps

Description

@chrispy-snps

In #206, a new beautiful_soup_parser configuration option was added that lets Beautiful Soup's HTML parser be specified:

MarkdownConverter(beautiful_soup_parser="lxml")

The specified parser keyword is passed to the features parameter of the BeautifulSoup constructor, which is the second option after the HTML markup:

class BeautifulSoup(Tag):
    # ...
    def __init__(
        self,
        markup="",
        features=None,
        builder=None,
        parse_only=None,
        from_encoding=None,
        exclude_encodings=None,
        element_classes=None,
        **kwargs,
    ):
        # ...

But as shown above, the BeautifulSoup constructor provides other options that might be useful, such as providing hints to the text encoding detection for the HTML document.

Before #206 ships in a production release, perhaps we could extend its implementation to generally support all Beautiful Soup configuration options using a kwargs-based approach (including new options in the future). For example,

MarkdownConverter(beautiful_soup_options={"features": "lxml"})

MarkdownConverter(beautiful_soup_options={"exclude_encodings": ["iso-8859-7"]})

Or perhaps a bit shorter to make up for the extra kwargs length,

MarkdownConverter(bs4_options={"features": "lxml"})

MarkdownConverter(bs4_options={"exclude_encodings": ["iso-8859-7"]})

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions