-
Is there a way to turn below bytes object into a useful type in pymupdf and extract text from it?
|
Beta Was this translation helpful? Give feedback.
Answered by
JorjMcKie
Nov 9, 2023
Replies: 1 comment 2 replies
-
Interesting. import fitz
doc = fitz.open()
page = doc.new_page()
page.insert_font() # make page knowing the font Helvetica
xref = doc.get_new_xref() # create a new xref
doc.update_object(xref, "<<>>") # make it a PDF dictionary
# your bytes string
stream = b"BT\n1 0 0 1 0 0 Tm\n/F1 23.00093 Tf\n27.60112 TL\n0.99996 0.009 -0.009 0.99996 1110 1326 Tm\n0 0 Td\n135.997 Tz\n<702e20> Tj\n47 0 Td\n138.3741 Tz\n<3234382e20> Tj\nET"
# replace the fontname by Helvetica
stream = stream.replace(b"/F1", b"/helv") # change fontname to Helevetica standard name
doc.update_stream(xref, stream) # insert into our new object
page.set_contents(xref) # define this to be the page's /Contents
text = page.get_text(
clip=fitz.INFINITE_RECT()
) # extract the text wherever it has been written
print(text) Delivers "p. 248." This works, because the original (unknown) font F1 is a "simple" font, which means the glyph number equals the unicode. In more complex cases, the stuff in "<>" brackets only contains glyph numbers for wich you need the fontfile to backtranslate to unicodes. |
Beta Was this translation helpful? Give feedback.
2 replies
Answer selected by
Mark-Joy
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Interesting.
You could try this: