Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ligature issue when converting PDF to text #1351

Open
gargarvin opened this issue Sep 16, 2022 · 5 comments
Open

Ligature issue when converting PDF to text #1351

gargarvin opened this issue Sep 16, 2022 · 5 comments
Labels
help wanted We appreciate help everywhere - this one might be an easy start! is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow

Comments

@gargarvin
Copy link

gargarvin commented Sep 16, 2022

I am having a ligature issue with this PDF.
'fi', 'fl' and 'ff' characters are returning NULL

#598 is similar to this issue.

MVCE: Code + PDF

from PyPDF2 import PdfReader

reader = PdfReader("Inspection_redacted.pdf")
for page in reader.pages:
    print(page.extract_text())

PDF

@pubpub-zz
Copy link
Collaborator

I did a quick analysis on the first page.
with some debug traces I've analysed the following line starting with PLUMBING SYSTEM - FAUCETS, VALVES AND CONNECTED FIXTURES:
looking at the sequence : ut off ha
The Font I've identified is F1.
the transcoding table is the following

8 beginbfchar
<03> <0020>
<05> <0022>
<18> <0035>
<1B> <0038>
<1D> <003A>
<62> <00A0>
<E9> <0000>
<EA> <0000>
endbfchar
6 beginbfrange
<09> <16> <0026>
<24> <2C> <0041>
<2E> <3D> <004B>
<44> <4C> <0061>
<4E> <53> <006B>
<55> <5C> <0072>
endbfrange

the following codes are transcoded and added (ut of:

b'\x00X' -> u
b'\x00W' -> t
b'\x00\x03' -> (space)
b'\x00R' -> o
b'\x00\xe9' -> (\x00)
b'\x00\x03' -> (space)
b'\x00K' -> h
b'\x00D' -> a

when using sumatrapdf and pdfminer.six, I'm getting the same results with '\x00'. The only tool which seems to report properly (using copy-paste) is Acrobat Reader but I don't know where it is getting the results.

Help to analysis this case would be welcomed (@MartinThoma can you set the labels in accordance)

@MartinThoma MartinThoma added help wanted We appreciate help everywhere - this one might be an easy start! workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow labels Sep 17, 2022
@gargarvin
Copy link
Author

Also of note - this tool seems to be able to convert the PDF successfully without using any sort of OCR.

@MartinThoma MartinThoma added the is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF label Sep 24, 2022
@PavelHightTower
Copy link

I resolved it like this, 'ff' case not work like other, that's why I replace it by chr(0).

page.extract_text().translate(str.maketrans({chr(0): 'ff', 0xFB01: 'fi', 0xFB02: 'fl', 0xFB03: 'ffi', 0xFB04: 'ffl'}))

@gargarvin
Copy link
Author

The above method seems to replace every ligature with 'ff'. I also noticed my original PDF does not load so here it is again.
Inspection_redacted.pdf

@ssjkamei
Copy link
Contributor

ssjkamei commented Oct 2, 2024

ActualText

I followed the code here and it seems that it is not a mapping issue, but that the data needs to be retrieved separately.

I think the problem can be solved by replacing the 0000 with the part that appears as operator = BDC, operands = ['/Span', {'/ActualText': b'\xfe\xff\x00f\x00i'}].

stefan6419846 pushed a commit that referenced this issue Oct 4, 2024
This is a fix for the problem that occurred when #2882 was changed.

The string length of characters was checked after conversion by cmap, but after cmap conversion, there is a pattern where the string length is more than one character, and it cannot be measured accurately.

This is necessary, for example, when considering whether to measure the distance from the ligature or the base character corresponding to the ligature in fixing #1351.

The change in handle_tj is because it cannot pass Ruff's check.
Error: PLR0915 Too many statements (nnn > 176)

The following code is only used to get the character code for a space.
However, I think it would be better to split the code into parts for obtaining the character code.
Style changes are considered in another PR.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted We appreciate help everywhere - this one might be an easy start! is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow
Projects
None yet
Development

No branches or pull requests

5 participants