-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Closed
Labels
is-robustness-issueFrom a users perspective, this is about robustnessFrom a users perspective, this is about robustnessworkflow-text-extractionFrom a users perspective, text extraction is the affected feature/workflowFrom a users perspective, text extraction is the affected feature/workflow
Description
See #1269 for further details.
Environment
Which environment were you using when you encountered the problem?
$ python -m platform
Linux-5.4.0-122-generic-x86_64-with-glibc2.29
$ python -c "import PyPDF2;print(PyPDF2.__version__)"
2.10.3Code + PDF
This is a minimal, complete example that shows the issue:
import PyPDF2
with open("Nano Energy 2022_2D Mosaic Heterostructure Overall Water Splitting.pdf", "rb") as f:
pdfreader = PyPDF2.PdfFileReader(f, strict=False)
content = " ".join([page.extractText() for page in pdfreader.pages])PDF used above: Nano Energy 2022_2D Mosaic Heterostructure Overall Water Splitting.pdf
More examples:
- Vigneau et al (2022) Food Qual Prefer.pdf
- Hossain et al. - 2021 - Marine Policy.pdf
- Cardello et al (2022) Food Qual Prefer.pdf
Traceback
This is the complete Traceback I see:
Traceback (most recent call last):
File "test4.py", line 4, in <module>
content = " ".join([page.extractText() for page in pdfreader.pages])
File "test4.py", line 4, in <listcomp>
content = " ".join([page.extractText() for page in pdfreader.pages])
File "/usr/local/lib/python3.8/dist-packages/PyPDF2/_page.py", line 1538, in extractText
return self.extract_text()
File "/usr/local/lib/python3.8/dist-packages/PyPDF2/_page.py", line 1510, in extract_text
return self._extract_text(
File "/usr/local/lib/python3.8/dist-packages/PyPDF2/_page.py", line 1146, in _extract_text
cmaps[f] = build_char_map(f, space_width, obj)
File "/usr/local/lib/python3.8/dist-packages/PyPDF2/_cmap.py", line 22, in build_char_map
map_dict, space_code, int_entry = parse_to_unicode(ft, space_code)
File "/usr/local/lib/python3.8/dist-packages/PyPDF2/_cmap.py", line 185, in parse_to_unicode
process_rg, process_char = process_cm_line(
File "/usr/local/lib/python3.8/dist-packages/PyPDF2/_cmap.py", line 245, in process_cm_line
parse_bfrange(l, map_dict, int_entry)
File "/usr/local/lib/python3.8/dist-packages/PyPDF2/_cmap.py", line 275, in parse_bfrange
unhexlify(fmt % a).decode(
binascii.Error: Odd-length string
Metadata
Metadata
Assignees
Labels
is-robustness-issueFrom a users perspective, this is about robustnessFrom a users perspective, this is about robustnessworkflow-text-extractionFrom a users perspective, text extraction is the affected feature/workflowFrom a users perspective, text extraction is the affected feature/workflow