html5lib_to_markdown

This package offers a way to convert HTML to the Markdown format.

This package is currently targeting a SUBSET of full HTML->Markdown conversion to address internal needs. More functionality will be added as needed. Pull requests are welcome.

There are many packages that convert HTML to Markdown. Why create another package to do this task?

Licensing. This package is available via the permissive MIT license. There are no GPL restrictions, which affect about a third of the other libraries that perform this task.
Tests. This package ships with many tests to ensure things keep working as desired. Several existing libraries do not have tests or adequate test coverage.
Customized Feature: Use HTML for certain elements instead of Markdown syntax. Sometimes we WANT to use and tags, and not turn them into Markdown syntax.
Customized Feature: Clean up common html issues and make pretty Markdown. This library doesn't just create Markdown, but optimized/pretty Markdown. This library attempts to optimize-away extra newlines and spaces, creating a concise and readable Markdown version.
Customized Feature: ignore unwanted html tags and attributes.
Customized Feature: Idempotent when possible. This is more of a goal than a guarantee, but text that is processed through this library should not change if re-processed through this library whenever possible. In other words, we're aiming for this:

as_markdown = to_markdown(html) == to_markdown(as_markdown) == to_markdown(to_markdown(html)) == to_markdown(to_markdown(as_markdown))

This can't be guaranteed in all situations because of how Markdown and HTML work, but it is a goal. This library should not add artifacts.

At a minimum, our goal is this

as_markdown = to_markdown(html) == to_markdown(to_html(as_markdown))
as_html = to_html(as_markdown) == to_html(to_markdown(as_html))

Customized feature: A departure from the core Markdown specification was needed for a few elements:
1. Render img tags, not Markdown format
2. Render a tags, not Markdown format
Customized feature: Python2 and Python3 compatibility. This shouldn't be a feature, but it is. Some excellent packages in this space stopped supporting Python2 already. This package aims to keep Python2 support around a bit longer than the official cutoff date, because legacy systems exist.
Core Implementation Detail. This package is implemented as a htmllib5 "tree adapter", which means it can be potentially be layered into many htm5lib processing routines. Other packages use BeautifulSoup, lxml or HTMLParser. These other projects are all great, but require re-processing if you are already doing things with html5lib.

Unsupported Features

Angled links are not currently supported, for example:

<http://example.com>

They are not compatible with the html5lib parser, and trying to support them will require a lot of work.

Pretty Markdown?

What is pretty Markdown?

There is a max of 2 newlines (1 blank line) between elements.
blank lines are dropped to the lowest allowable nesting of blockquotes or lists
whitespace is shown via HTML rendering rules

Other Libraries

The interface was inspired by the bleach user-input sanitization library, which relies on html5lib

If you just need a standard and pure "HTML to Markdown" convertor, I recommend the following libraries:

antimarkdown https://github.com/Crossway/antimarkdown/
- built on lxml
- MIT license
markdownify https://github.com/matthewwithanm/python-markdownify
- built on BeautifulSoup
- MIT license

License & Copyright

Additional code and tests are isolated in the tests_working directory (and soon to be tests), with copyright attributed to the antimarkdown and markdownify projects. both are used under the MIT License and credited in the source

Environment Variables

MD_DEBUG_TOKENS - will use string representation for tokens (human readable!) instead of optimizing with ints
MD_DEBUG_STACKS - will print() the tokens during
MD_DEBUG_STACKS_SIMPLE - will print() the tokens in a simplified form

TODO

See TODO.txt

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
.github/workflows		.github/workflows
src/html5lib_to_markdown		src/html5lib_to_markdown
tests		tests
tests_working		tests_working
.pre-commit-config.yaml		.pre-commit-config.yaml
CHANGES.txt		CHANGES.txt
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
TESTING.txt		TESTING.txt
TODO.txt		TODO.txt
pyproject.toml		pyproject.toml
setup.cfg		setup.cfg
setup.py		setup.py
tox.ini		tox.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

html5lib_to_markdown

There are many packages that convert HTML to Markdown. Why create another package to do this task?

Unsupported Features

Pretty Markdown?

Other Libraries

License & Copyright

Environment Variables

TODO

About

Releases

Packages

Languages

License

jvanasco/html5lib_to_markdown

Folders and files

Latest commit

History

Repository files navigation

html5lib_to_markdown

There are many packages that convert HTML to Markdown. Why create another package to do this task?

Unsupported Features

Pretty Markdown?

Other Libraries

License & Copyright

Environment Variables

TODO

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages