Get text from a pdf text stream object #2793

Mark-Joy · 2023-11-09T03:27:09Z

Mark-Joy
Nov 9, 2023

Is there a way to turn below bytes object into a useful type in pymupdf and extract text from it?
I got it from text stream object from pikepdf

b'BT\n1 0 0 1 0 0 Tm\n/F1 23.00093 Tf\n27.60112 TL\n0.99996 0.009 -0.009 0.99996 1110 1326 Tm\n0 0 Td\n135.997 Tz\n<702e20> Tj\n47 0 Td\n138.3741 Tz\n<3234382e20> Tj\nET'

Answered by JorjMcKie

Nov 9, 2023

Interesting.
You could try this:

import fitz

doc = fitz.open()
page = doc.new_page()
page.insert_font()  # make page knowing the font Helvetica
xref = doc.get_new_xref()  # create a new xref
doc.update_object(xref, "<<>>")  # make it a PDF dictionary

# your bytes string
stream = b"BT\n1 0 0 1 0 0 Tm\n/F1 23.00093 Tf\n27.60112 TL\n0.99996 0.009 -0.009 0.99996 1110 1326 Tm\n0 0 Td\n135.997 Tz\n<702e20> Tj\n47 0 Td\n138.3741 Tz\n<3234382e20> Tj\nET"

# replace the fontname by Helvetica
stream = stream.replace(b"/F1", b"/helv")  # change fontname to Helevetica standard name

doc.update_stream(xref, stream)  # insert into our new object
page.set_contents(xref)  # define this to be the page's…

View full answer

JorjMcKie · 2023-11-09T09:12:59Z

JorjMcKie
Nov 9, 2023
Maintainer

Interesting.
You could try this:

import fitz

doc = fitz.open()
page = doc.new_page()
page.insert_font()  # make page knowing the font Helvetica
xref = doc.get_new_xref()  # create a new xref
doc.update_object(xref, "<<>>")  # make it a PDF dictionary

# your bytes string
stream = b"BT\n1 0 0 1 0 0 Tm\n/F1 23.00093 Tf\n27.60112 TL\n0.99996 0.009 -0.009 0.99996 1110 1326 Tm\n0 0 Td\n135.997 Tz\n<702e20> Tj\n47 0 Td\n138.3741 Tz\n<3234382e20> Tj\nET"

# replace the fontname by Helvetica
stream = stream.replace(b"/F1", b"/helv")  # change fontname to Helevetica standard name

doc.update_stream(xref, stream)  # insert into our new object
page.set_contents(xref)  # define this to be the page's /Contents

text = page.get_text(
    clip=fitz.INFINITE_RECT()
)  # extract the text wherever it has been written
print(text)

Delivers "p. 248."

This works, because the original (unknown) font F1 is a "simple" font, which means the glyph number equals the unicode. In more complex cases, the stuff in "<>" brackets only contains glyph numbers for wich you need the fontfile to backtranslate to unicodes.
We have been fortunate here.

2 replies

Mark-Joy Nov 10, 2023
Author

Wow, thank you for swift answer!
I suppose if the font was included as part of the stream(bytes string), then we can get text without replacing /F1 with /helv
I happen to know that for simple font we can use something like below to get the text:
bytes.fromhex("702e203234382e20").decode(encoding="Latin1")

Is there a way to insert_font knowing its stream form(bytes string)?

JorjMcKie Nov 10, 2023
Maintainer

Sure. page.insert_font(fontname="F1", fontbuffer=stream) would do the job and remove the need to replace b"/F1".

Depending on your ultimate goal (which I don't know, but maybe is just getting the text), you could do your own string analysis and interpret stuff coming before the Tj/TJ operators.
But that can get arbitrarily tedious - and still will only work for simple fonts (glyph number = unicode).
If you indeed have the font binary, the approach above would enable you to decipher other than simple fonts too and spare you the string parsing effort.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Get text from a pdf text stream object #2793

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Get text from a pdf text stream object #2793

Uh oh!

Mark-Joy Nov 9, 2023

Replies: 1 comment · 2 replies

Uh oh!

JorjMcKie Nov 9, 2023 Maintainer

Uh oh!

Mark-Joy Nov 10, 2023 Author

Uh oh!

JorjMcKie Nov 10, 2023 Maintainer

Mark-Joy
Nov 9, 2023

Replies: 1 comment 2 replies

JorjMcKie
Nov 9, 2023
Maintainer

Mark-Joy Nov 10, 2023
Author

JorjMcKie Nov 10, 2023
Maintainer