Ligature issue when converting PDF to text #1351

gargarvin · 2022-09-16T17:29:47Z

I am having a ligature issue with this PDF.
'fi', 'fl' and 'ff' characters are returning NULL

#598 is similar to this issue.

MVCE: Code + PDF

from PyPDF2 import PdfReader

reader = PdfReader("Inspection_redacted.pdf")
for page in reader.pages:
    print(page.extract_text())

PDF

pubpub-zz · 2022-09-17T08:21:28Z

I did a quick analysis on the first page.
with some debug traces I've analysed the following line starting with PLUMBING SYSTEM - FAUCETS, VALVES AND CONNECTED FIXTURES:
looking at the sequence : ut oﬀ ha
The Font I've identified is F1.
the transcoding table is the following

8 beginbfchar
<03> <0020>
<05> <0022>
<18> <0035>
<1B> <0038>
<1D> <003A>
<62> <00A0>
<E9> <0000>
<EA> <0000>
endbfchar
6 beginbfrange
<09> <16> <0026>
<24> <2C> <0041>
<2E> <3D> <004B>
<44> <4C> <0061>
<4E> <53> <006B>
<55> <5C> <0072>
endbfrange

the following codes are transcoded and added (ut of:

b'\x00X' -> u
b'\x00W' -> t
b'\x00\x03' -> (space)
b'\x00R' -> o
b'\x00\xe9' -> (\x00)
b'\x00\x03' -> (space)
b'\x00K' -> h
b'\x00D' -> a

when using sumatrapdf and pdfminer.six, I'm getting the same results with '\x00'. The only tool which seems to report properly (using copy-paste) is Acrobat Reader but I don't know where it is getting the results.

Help to analysis this case would be welcomed (@MartinThoma can you set the labels in accordance)

gargarvin · 2022-09-19T16:00:03Z

Also of note - this tool seems to be able to convert the PDF successfully without using any sort of OCR.

PavelHightTower · 2023-11-28T18:47:57Z

I resolved it like this, 'ff' case not work like other, that's why I replace it by chr(0).

page.extract_text().translate(str.maketrans({chr(0): 'ff', 0xFB01: 'fi', 0xFB02: 'fl', 0xFB03: 'ffi', 0xFB04: 'ffl'}))

gargarvin · 2023-11-28T22:32:52Z

The above method seems to replace every ligature with 'ff'. I also noticed my original PDF does not load so here it is again.
Inspection_redacted.pdf

ssjkamei · 2024-10-02T18:54:31Z

I followed the code here and it seems that it is not a mapping issue, but that the data needs to be retrieved separately.

I think the problem can be solved by replacing the 0000 with the part that appears as operator = BDC, operands = ['/Span', {'/ActualText': b'\xfe\xff\x00f\x00i'}].

This is a fix for the problem that occurred when #2882 was changed. The string length of characters was checked after conversion by cmap, but after cmap conversion, there is a pattern where the string length is more than one character, and it cannot be measured accurately. This is necessary, for example, when considering whether to measure the distance from the ligature or the base character corresponding to the ligature in fixing #1351. The change in handle_tj is because it cannot pass Ruff's check. Error: PLR0915 Too many statements (nnn > 176) The following code is only used to get the character code for a space. However, I think it would be better to split the code into parts for obtaining the character code. Style changes are considered in another PR.

MartinThoma added help wanted We appreciate help everywhere - this one might be an easy start! workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow labels Sep 17, 2022

pubpub-zz mentioned this issue Sep 18, 2022

extract_text doesn't extract ligatures correctly #598

Closed

MartinThoma added the is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF label Sep 24, 2022

ssjkamei mentioned this issue Sep 28, 2024

BUG: Issue in text extraction (spaces) (#1153) #2882

Merged

This was referenced Oct 2, 2024

Spaces (that do not exist in the original PDF) appear in the output of extract_text() #2336

Closed

MAINT: Unnecessary character mapping process #2888

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ligature issue when converting PDF to text #1351

Ligature issue when converting PDF to text #1351

gargarvin commented Sep 16, 2022 •

edited by MartinThoma

Loading

pubpub-zz commented Sep 17, 2022

gargarvin commented Sep 19, 2022

PavelHightTower commented Nov 28, 2023

gargarvin commented Nov 28, 2023

ssjkamei commented Oct 2, 2024

Ligature issue when converting PDF to text #1351

Ligature issue when converting PDF to text #1351

Comments

gargarvin commented Sep 16, 2022 • edited by MartinThoma Loading

MVCE: Code + PDF

pubpub-zz commented Sep 17, 2022

gargarvin commented Sep 19, 2022

PavelHightTower commented Nov 28, 2023

gargarvin commented Nov 28, 2023

ssjkamei commented Oct 2, 2024

gargarvin commented Sep 16, 2022 •

edited by MartinThoma

Loading