Incomplete text values of ligatures #2787

pulsar314 · 2023-11-06T13:56:45Z

pulsar314
Nov 6, 2023

Description

Text extracted from PDF files may contain incomplete values of ligatures (and probably other glyphs) if they have composite names. More precisely, only the first Unicode symbol is present. In this file skakvattenbad_bs_milmedtek.pdf there are problems with fi in "Specifications & Ordering Information" ("Specifcations"), fl in "the bath fluid" ("fuid") and tt in "Multiple LED displays for setting various values"/"various ways of glassware settings" ("seting"/"setings").

Apparently, the issue is with decoding non-standard glyph names (not from Adobe Glyph List). Examples above are /f_i, /f_l and /t_t glyphs correspondingly. Accordingly the procedure described here https://github.com/adobe-type-tools/agl-specification#2-the-mapping, they should be mapped as sequences /f/i, /f/l and /t/t

This behavior seems not to depend on presence or absence of explicit /ToUnicode mappings.

How to reproduce

Read text from PDF using any method get_text, get_texttrace etc.

Expected behavior

Text extracted from PDF contains all Unicode symbols.

Configuration

Linux MINT
Python 3.8
PyMuPDF 1.23.5, installed via pip

JorjMcKie · 2023-11-07T08:40:43Z

JorjMcKie
Nov 7, 2023
Maintainer

This example file contains non-standard encodings, which makes it impossible for any text extraction tool to extract text satisfactorily.
I tried Adobe, PDF XChange, Nitro, XPDF, Foxit: none of them is able to correctly extract "glassware settings" or "bath fluid".

So this is a problem of this file, not an extraction tool issue.

0 replies

pulsar314 · 2023-11-07T09:53:56Z

pulsar314
Nov 7, 2023
Author

Hi @JorjMcKie. I used poppler tools for comparison and it prints text properly. And actually Firefox allows you to copy the correct text (not MS Edge though).

0 replies

JorjMcKie · 2023-11-07T10:32:12Z

JorjMcKie
Nov 7, 2023
Maintainer

Whatever - the fact remains that the PDF uses non-standard encoding for some fonts. So there is no basis for claiming a specific text extraction behavior.

Some tools may be successful by applying guesswork - potentially based on the name of the font or whatever.

Which browser appears to be doing things right or "wrong" depends on which PDF plugin it uses - not on the browser itself.

0 replies

pulsar314 · 2023-11-07T11:55:33Z

pulsar314
Nov 7, 2023
Author

I based my assumptions for the behavior on PDF specification (PDF 32000-1:2008 (1.7)). 9.10.2 Mapping Character Codes to Unicode Values option 2 refers to Adobe Glyph List, which I found here https://github.com/adobe-type-tools/agl-specification.

If this is not supposed to be mandatory, would it be possible to add the specific font identifier (name from page's /Fonts or its xref) to the text data? Probably to output of the get_texttrace method. Since given names are not unique and can be omitted completely, it is impossible to recover details outside of the library.

0 replies

JorjMcKie · 2023-11-07T12:03:24Z

JorjMcKie
Nov 7, 2023
Maintainer

No font is required to have this mapping - at all or for all the glyphs it contains.
A font may be a sloppily made thing: glyph mapping back to unicodes may be incomplete.
Some creators may even deliberately omit that mapping, to make text extraction impossible.

If ToUnicode is missing or incomplete, nothing on earth can be done - except OCR.
If a PDF creator does not want to use a ToUnicode then he may specify a standard encoding instead (e.g. WinANSIEncoding), often done. But also here some special glyphs often arer used that are not covered by the chosen standard encoding - and you land at this type of problem.

0 replies

JorjMcKie · 2023-11-07T12:17:11Z

JorjMcKie
Nov 7, 2023
Maintainer

BTW the font name already is part of page.get_texttrace() and also appears in page.get_text("dict"). By default this name omits the subset font identifier (like "ABCDEF+"). Setting a global option will however include it. Based on this info, the xref of the font can be found out.

BTW I will move this post to "Discussions" - it is no issue in any sense of the word.

0 replies

pulsar314 · 2023-11-07T12:28:58Z

pulsar314
Nov 7, 2023
Author

For the PDF from example (and another I've seen) following the described procedures gives the desired result. They seem a part of standard to me and could be included into implementation.

The problem here, you need the exact font object to get the encoding object (in this case WinAnsiEncoding btw.) and its /Differences. This information is unavailable in Python code. It is not ideal, but having xref of the font object would help ones to implement their hacks.

Available font names are not unique, several font objects can apparently have identical names (including the plus part). Besides it is allowed by the specification to not specify any name in some cases (for Type 3 fonts as far as I can see).

1 reply

heweisheng Dec 5, 2023

I am unable to copy some text from this PDF document, but I can use WPS's PDF to Word function to copy text from Word. How do I need to be compatible with this type of problem file in this situation?
乱码.pdf

JorjMcKie · 2023-12-05T11:36:49Z

JorjMcKie
Dec 5, 2023
Maintainer

The page uses multiple fonts, one of which causes those results:

In [5]: page.get_fonts()
Out[5]:
[(478, 'n/a', 'TrueType', 'Calibri', 'F1', 'WinAnsiEncoding'),
 (480, 'n/a', 'TrueType', 'SimHei', 'F2', 'WinAnsiEncoding'),
 (482, 'ttf', 'Type0', 'LVJWVW+SimHei', 'F3', 'Identity-H'),
 (496, 'n/a', 'TrueType', 'SimHei', 'KSPF1', 'WinAnsiEncoding'),
 (742, 'n/a', 'Type0', 'SimHei', 'KSPF466', 'GBK-EUC-H')]
In [6]: fitz.TOOLS.set_subset_fontnames(True)
Out[6]: True
In [7]: for b in page.get_text("dict",flags=fitz.TEXTFLAGS_TEXT)["blocks"]:
   ...:     for l in b["lines"]:
   ...:         for s in l["spans"]:
   ...:             print(s["font"],":",s["text"])
   ...:
Calibri :
Calibri :
Calibri :
Calibri :
Calibri :
Calibri :
SimHei :
SimHei : 2020
LVJWVW+SimHei :  �������
SimHei : P200
LVJWVW+SimHei : �������
SimHei : ：
SimHei : 含驾驶人员工资、车辆油费等所有费用

The � characters occur when the character mapping table (CMAP, pointed to by PDF key /ToUnicode) does not contain translations for a given glyph number.
Except OCR-ing the page nothing can be done here.
Other tools may try some guesswork and be even successful with ...

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Incomplete text values of ligatures #2787

Uh oh!

{{title}}

Uh oh!

Replies: 8 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Incomplete text values of ligatures #2787

Uh oh!

pulsar314 Nov 6, 2023

Description

How to reproduce

Expected behavior

Configuration

Replies: 8 comments · 1 reply

Uh oh!

JorjMcKie Nov 7, 2023 Maintainer

Uh oh!

pulsar314 Nov 7, 2023 Author

Uh oh!

JorjMcKie Nov 7, 2023 Maintainer

Uh oh!

pulsar314 Nov 7, 2023 Author

Uh oh!

JorjMcKie Nov 7, 2023 Maintainer

Uh oh!

JorjMcKie Nov 7, 2023 Maintainer

Uh oh!

pulsar314 Nov 7, 2023 Author

Uh oh!

heweisheng Dec 5, 2023

Uh oh!

JorjMcKie Dec 5, 2023 Maintainer

pulsar314
Nov 6, 2023

Replies: 8 comments 1 reply

JorjMcKie
Nov 7, 2023
Maintainer

pulsar314
Nov 7, 2023
Author

JorjMcKie
Nov 7, 2023
Maintainer

pulsar314
Nov 7, 2023
Author

JorjMcKie
Nov 7, 2023
Maintainer

JorjMcKie
Nov 7, 2023
Maintainer

pulsar314
Nov 7, 2023
Author

JorjMcKie
Dec 5, 2023
Maintainer