Replies: 8 comments 1 reply
-
This example file contains non-standard encodings, which makes it impossible for any text extraction tool to extract text satisfactorily. So this is a problem of this file, not an extraction tool issue. |
Beta Was this translation helpful? Give feedback.
-
Hi @JorjMcKie. I used poppler tools for comparison and it prints text properly. And actually Firefox allows you to copy the correct text (not MS Edge though). |
Beta Was this translation helpful? Give feedback.
-
Whatever - the fact remains that the PDF uses non-standard encoding for some fonts. So there is no basis for claiming a specific text extraction behavior. Some tools may be successful by applying guesswork - potentially based on the name of the font or whatever. Which browser appears to be doing things right or "wrong" depends on which PDF plugin it uses - not on the browser itself. |
Beta Was this translation helpful? Give feedback.
-
I based my assumptions for the behavior on PDF specification (PDF 32000-1:2008 (1.7)). If this is not supposed to be mandatory, would it be possible to add the specific font identifier (name from page's |
Beta Was this translation helpful? Give feedback.
-
No font is required to have this mapping - at all or for all the glyphs it contains. If |
Beta Was this translation helpful? Give feedback.
-
BTW the font name already is part of BTW I will move this post to "Discussions" - it is no issue in any sense of the word. |
Beta Was this translation helpful? Give feedback.
-
For the PDF from example (and another I've seen) following the described procedures gives the desired result. They seem a part of standard to me and could be included into implementation. The problem here, you need the exact font object to get the encoding object (in this case WinAnsiEncoding btw.) and its Available font names are not unique, several font objects can apparently have identical names (including the plus part). Besides it is allowed by the specification to not specify any name in some cases (for Type 3 fonts as far as I can see). |
Beta Was this translation helpful? Give feedback.
-
The page uses multiple fonts, one of which causes those results: In [5]: page.get_fonts()
Out[5]:
[(478, 'n/a', 'TrueType', 'Calibri', 'F1', 'WinAnsiEncoding'),
(480, 'n/a', 'TrueType', 'SimHei', 'F2', 'WinAnsiEncoding'),
(482, 'ttf', 'Type0', 'LVJWVW+SimHei', 'F3', 'Identity-H'),
(496, 'n/a', 'TrueType', 'SimHei', 'KSPF1', 'WinAnsiEncoding'),
(742, 'n/a', 'Type0', 'SimHei', 'KSPF466', 'GBK-EUC-H')]
In [6]: fitz.TOOLS.set_subset_fontnames(True)
Out[6]: True
In [7]: for b in page.get_text("dict",flags=fitz.TEXTFLAGS_TEXT)["blocks"]:
...: for l in b["lines"]:
...: for s in l["spans"]:
...: print(s["font"],":",s["text"])
...:
Calibri :
Calibri :
Calibri :
Calibri :
Calibri :
Calibri :
SimHei :
SimHei : 2020
LVJWVW+SimHei : �������
SimHei : P200
LVJWVW+SimHei : �������
SimHei : :
SimHei : 含驾驶人员工资、车辆油费等所有费用 The � characters occur when the character mapping table (CMAP, pointed to by PDF key |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Description
Text extracted from PDF files may contain incomplete values of ligatures (and probably other glyphs) if they have composite names. More precisely, only the first Unicode symbol is present. In this file skakvattenbad_bs_milmedtek.pdf there are problems with
fi
in "Specifications & Ordering Information" ("Specifcations"),fl
in "the bath fluid" ("fuid") andtt
in "Multiple LED displays for setting various values"/"various ways of glassware settings" ("seting"/"setings").Apparently, the issue is with decoding non-standard glyph names (not from Adobe Glyph List). Examples above are
/f_i
,/f_l
and/t_t
glyphs correspondingly. Accordingly the procedure described here https://github.com/adobe-type-tools/agl-specification#2-the-mapping, they should be mapped as sequences/f/i
,/f/l
and/t/t
This behavior seems not to depend on presence or absence of explicit /ToUnicode mappings.
How to reproduce
Read text from PDF using any method
get_text
,get_texttrace
etc.Expected behavior
Text extracted from PDF contains all Unicode symbols.
Configuration
Beta Was this translation helpful? Give feedback.
All reactions