Skip to content

Commit

Permalink
Update 1.0.2
Browse files Browse the repository at this point in the history
  • Loading branch information
dfop02 committed Feb 20, 2024
1 parent 450db53 commit 1e2106e
Show file tree
Hide file tree
Showing 9 changed files with 148 additions and 42 deletions.
64 changes: 58 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,22 +1,74 @@
# HTML FOR DOCX
Convert html to docx, this project is a fork from descontinued [pqzx/html2docx](https://github.com/pqzx/html2docx).

### To install
### How install

`pip install html-for-docx`

### Usage

The basic is
The basic usage: Add HTML formatted to an existing Docx

```python
from html4docx import HtmlToDocx

parser = HtmlToDocx()
html_string = '<h1>Hello world</h1>'
parser.add_html_to_document(html_string, filename_docx)
```

Take a look on [pqzx/html2docx](https://github.com/pqzx/html2docx) project to see examples of usage.
You can use `python-docx` to manipulate the file as well, here an example

```python
from docx import Document
from html4docx import HtmlToDocx

document = Document()
new_parser = HtmlToDocx()

html_string = '<h1>Hello world</h1>'
new_parser.add_html_to_document(html_string, document)

document.save('your_file_name')
```

Convert files directly

```python
from html4docx import HtmlToDocx

new_parser = HtmlToDocx()
new_parser.parse_html_file(input_html_file_path, output_docx_file_path)
```

Convert files from a string

```python
from html4docx import HtmlToDocx

new_parser = HtmlToDocx()
docx = new_parser.parse_html_string(input_html_file_string)
```

Change table styles

Tables are not styled by default. Use the `table_style` attribute on the parser to set a table style. The style is used for all tables.

```python
from html4docx import HtmlToDocx

new_parser = HtmlToDocx()
new_parser.table_style = 'Light Shading Accent 4'
```

To add borders to tables, use the `TableGrid` style:

```python
new_parser.table_style = 'TableGrid'
```

Default table styles can be found
here: https://python-docx.readthedocs.io/en/latest/user/styles-understanding.html#table-styles-in-default-template

### Why

Expand All @@ -28,17 +80,17 @@ My goal to fork and fix/update this package was to complete my current task at w
- Handle missing run for leading br tag | [dashingdove](https://github.com/dashingdove) from [PR](https://github.com/pqzx/html2docx/pull/53)
- Fix base64 images | [djplaner](https://github.com/djplaner) from [Issue](https://github.com/pqzx/html2docx/issues/28#issuecomment-1052736896)
- Handle img tag without src attribute | [johnjor](https://github.com/johnjor) from [PR](https://github.com/pqzx/html2docx/pull/63)
- Fix text-align bug when `!important` | [Dfop02](https://github.com/dfop02)
- Fix background-color always set default color | [Dfop02](https://github.com/dfop02)
- Fix bug when any style has `!important` | [Dfop02](https://github.com/dfop02)
- Fix 'style lookup by style_id is deprecated.' | [Dfop02](https://github.com/dfop02)

**New Features**
- Add Witdh/Height style to images | [maifeeulasad](https://github.com/maifeeulasad) from [PR](https://github.com/pqzx/html2docx/pull/29)
- Support px, cm and % for style margin-left to paragraphs | [Dfop02](https://github.com/dfop02)
- Support px, cm, pt and % for style margin-left to paragraphs | [Dfop02](https://github.com/dfop02)
- Improve performance on large tables | [dashingdove](https://github.com/dashingdove) from [PR](https://github.com/pqzx/html2docx/pull/58)
- Support for HTML Pagination | [Evilran](https://github.com/Evilran) from [PR](https://github.com/pqzx/html2docx/pull/39)
- Support Table style | [Evilran](https://github.com/Evilran) from [PR](https://github.com/pqzx/html2docx/pull/39)
- Support alternative encoding | [HebaElwazzan](https://github.com/HebaElwazzan) from [PR](https://github.com/pqzx/html2docx/pull/59)
- Refactory Tests to be more consistent and less 'human validation' | [Dfop02](https://github.com/dfop02)

## License

Expand Down
54 changes: 28 additions & 26 deletions html4docx/h4d.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,19 +21,17 @@
from io import BytesIO
from html.parser import HTMLParser

import docx
import docx.table

from html4docx import utils
from html4docx.colors import Color
from bs4 import BeautifulSoup

import docx
from docx import Document
from docx.shared import RGBColor, Pt, Cm, Inches
from docx.enum.text import WD_ALIGN_PARAGRAPH
from docx.oxml import OxmlElement
from docx.oxml.ns import qn

from bs4 import BeautifulSoup
from html4docx import utils
from html4docx.colors import Color

# values in inches
INDENT = 0.25
Expand Down Expand Up @@ -65,10 +63,7 @@ def set_initial_attrs(self, document=None):
'span': [],
'list': [],
}
if document:
self.doc = document
else:
self.doc = Document()
self.doc = document if document else Document()
self.bs = self.options['fix-html'] # whether or not to clean with BeautifulSoup
self.document = self.doc
self.include_tables = True # TODO add this option back in?
Expand All @@ -91,7 +86,8 @@ def get_cell_html(self, soup):

def add_styles_to_paragraph(self, style):
if 'text-align' in style:
align = style['text-align']
align = re.sub('!important', '', style['text-align'], flags=re.IGNORECASE)

if 'center' in align:
self.paragraph.paragraph_format.alignment = WD_ALIGN_PARAGRAPH.CENTER
elif 'right' in align:
Expand All @@ -102,25 +98,28 @@ def add_styles_to_paragraph(self, style):
if 'margin-left' in style and 'margin-right' in style:
margin_left = style['margin-left']
margin_right = style['margin-right']
if "auto" in margin_left and "auto" in margin_right:
if 'auto' in margin_left and 'auto' in margin_right:
self.paragraph.paragraph_format.alignment = WD_ALIGN_PARAGRAPH.CENTER
elif 'margin-left' in style:
margin = style['margin-left']
margin = re.sub('!important', '', style['margin-left'], flags=re.IGNORECASE)
units = re.sub(r'[0-9]+', '', margin)
margin = int(float(re.sub(r'[a-zA-Z\!]+', '', margin)))

if units == 'px':
self.paragraph.paragraph_format.left_indent = Inches(min(margin // 10 * INDENT, MAX_INDENT))
elif units == 'cm':
self.paragraph.paragraph_format.left_indent = Cm(min(margin // 10 * INDENT, MAX_INDENT) * 2.54)
elif units == 'pt':
self.paragraph.paragraph_format.left_indent = Pt(min(margin // 10 * INDENT, MAX_INDENT) * 72)
elif units == '%':
self.paragraph.paragraph_format.left_indent = MAX_INDENT * (units / 100)
# TODO handle more units
else:
# When unit is not supported
self.paragraph.paragraph_format.left_indent = None

def add_styles_to_table(self, style):
if 'text-align' in style:
align = style['text-align']
align = re.sub('!important', '', align, flags=re.IGNORECASE)
align = re.sub('!important', '', style['text-align'], flags=re.IGNORECASE)

if 'center' in align:
self.table.alignment = WD_ALIGN_PARAGRAPH.CENTER
Expand All @@ -135,26 +134,30 @@ def add_styles_to_table(self, style):
if 'auto' in margin_left and 'auto' in margin_right:
self.table.alignment = WD_ALIGN_PARAGRAPH.CENTER
elif 'margin-left' in style:
margin = style['margin-left']
margin = re.sub('!important', '', style['margin-left'], flags=re.IGNORECASE)
units = re.sub(r'[0-9]+', '', margin)
margin = int(float(re.sub(r'[a-zA-Z\!]+', '', margin)))

if units == 'px':
self.table.left_indent = Inches(min(margin // 10 * INDENT, MAX_INDENT))
elif units == 'cm':
self.table.left_indent = Cm(min(margin // 10 * INDENT, MAX_INDENT) * 2.54)
elif units == 'pt':
self.table.left_indent = Pt(min(margin // 10 * INDENT, MAX_INDENT) * 72)
elif units == '%':
self.table.left_indent = MAX_INDENT * (units / 100)
# TODO handle more units
else:
# When unit is not supported
self.table.left_indent = None

def add_styles_to_run(self, style):
if 'font-size' in style:
font_size = style['font-size']
font_size = re.sub('!important', '', style['font-size'], flags=re.IGNORECASE)
units = re.sub(r'[0-9]+', '', font_size)
font_size = int(float(re.sub(r'[a-zA-Z\!]+', '', font_size)))

if units == 'px':
font_size_unit = Inches(font_size)
font_size_unit = Inches(utils.px_to_inches(font_size))
elif units == 'cm':
font_size_unit = Cm(font_size)
elif units == 'pt':
Expand All @@ -168,8 +171,7 @@ def add_styles_to_run(self, style):
run.font.size = font_size_unit

if 'color' in style:
font_color = style['color'].lower()
font_color = re.sub('!important', '', font_color, flags=re.IGNORECASE)
font_color = re.sub('!important', '', style['color'].lower(), flags=re.IGNORECASE)

if 'rgb' in font_color:
color = re.sub(r'[a-z()]+', '', font_color)
Expand All @@ -187,7 +189,7 @@ def add_styles_to_run(self, style):
self.run.font.color.rgb = RGBColor(*colors)

if 'background-color' in style:
background_color = style['background-color'].lower()
background_color = re.sub('!important', '', style['background-color'].lower(), flags=re.IGNORECASE)

if 'rgb' in background_color:
color = re.sub(r'[a-z()]+', '', background_color)
Expand Down Expand Up @@ -361,7 +363,7 @@ def handle_table(self, current_attrs):

def handle_div(self, current_attrs):
# handle page break
if 'style' in current_attrs and "page-break-after: always" in current_attrs['style']:
if 'style' in current_attrs and 'page-break-after: always' in current_attrs['style']:
self.doc.add_page_break()

def handle_link(self, href, text):
Expand Down Expand Up @@ -435,7 +437,7 @@ def handle_starttag(self, tag, attrs):
elif tag == 'li':
self.handle_li()

elif tag == "hr":
elif tag == 'hr':
# This implementation was taken from:
# https://github.com/python-openxml/python-docx/issues/105#issuecomment-62806373
self.paragraph = self.doc.add_paragraph()
Expand Down Expand Up @@ -473,7 +475,7 @@ def handle_starttag(self, tag, attrs):
self.handle_table(current_attrs)
return

elif tag == "div":
elif tag == 'div':
self.handle_div(current_attrs)

# set new run reference point in case of leading line breaks
Expand Down
3 changes: 3 additions & 0 deletions html4docx/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,9 @@ def is_url(url):
parts = urlparse(url)
return all([parts.scheme, parts.netloc, parts.path])

def px_to_inches(px):
return px * 0.0104166667

def rgb_to_hex(rgb):
return '#' + ''.join(f'{i:02X}' for i in rgb)

Expand Down
4 changes: 2 additions & 2 deletions setup.cfg
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[metadata]
name = html-for-docx
version = 1.0.1
version = 1.0.2
url = https://github.com/dfop02/html4docx
project_urls =
Bug Tracker = https://github.com/dfop02/html4docx/issues
Expand Down Expand Up @@ -38,6 +38,6 @@ install_requires =
beautifulsoup4 >= 4.12.2

[flake8]
exclude = build,.git,tests/,html4docx/__init__.py,setup.py
exclude = build,.git,html4docx/__init__.py,setup.py
ignore = W504,W601,W292,E261,E302,E305
max-line-length = 127
2 changes: 1 addition & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@

here = os.path.abspath(os.path.dirname(__file__))
README = open(os.path.join(here, 'README.md')).read()
VERSION = '1.0.1'
VERSION = '1.0.2'

setup(
name = 'html-for-docx',
Expand Down
2 changes: 0 additions & 2 deletions tests/context.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,4 @@
import os
import sys
sys.path.insert(0, os.path.abspath(os.path.join(os.path.dirname(__file__), '..')))

from html4docx import HtmlToDocx
test_dir = 'tests'
Loading

0 comments on commit 1e2106e

Please sign in to comment.