Better ways to manipulate PDF raw object? IMPLEMENTED #841
Replies: 15 comments
-
I have thought about that couple of times. Actually not all too difficult. Obviously, such a function would have to be recursive. >>> print(doc.xref_object(1))
<<
/Lang (de-DE) # '()' indicates a string, '<>' a hex string
/Metadata 2 0 R # 'nnn 0 R' is a pointer to xref nnn
/OutputIntents [ <<
/DestOutputProfile 14 0 R
/Info (U.S. Web Coated \(SWOP\) v2)
/OutputConditionIdentifier (CGATS TR 001)
/RegistryName (http://www.color.org)
/S /GTS_PDFX
/Type /OutputIntent
>> ]
/Pages 3 0 R # resolve this???
/Type /Catalog
/ViewerPreferences <<
/Direction /L2R # '/L2R' is a name, not a string!
>>
>>
>>> Once having found a way to convert this to a Python nested dict, the truly interesting part is convert such a beast back to a correct (!) PDF object definition string. |
Beta Was this translation helpful? Give feedback.
-
Another approach could be, to be a bit less ambitious.
|
Beta Was this translation helpful? Give feedback.
-
What about handling only string type? That should be easier. So Even that would still simplify many simple tasks (at the moment, It will still be more verbose than pdfrw's notation in most cases. One of the advantages of PyMuPDF is its performance (most things are handled in C). Converting everything to Python would be slow. |
Beta Was this translation helpful? Give feedback.
-
Can you do me a favor and put this issue under "Discussions"? This would better reflect what we actually are doing here and hopefully also motivate others to join in. It would probably be fairly easy to make get/set functions for strings in object definitions ( |
Beta Was this translation helpful? Give feedback.
-
Regarding discussion: You should be able to do it... See Yes. That's probably easier (but also depends on the C API of mupdf), it's already in string format. |
Beta Was this translation helpful? Give feedback.
-
I will see to implement a "GET" C function which does the following:
The "SET" C function would go like this:
|
Beta Was this translation helpful? Give feedback.
-
This is what I have in terms of reading out PDF keys of a given xref. I made 2 methods:
Together with page xref = >>> for key in doc.xref_get_keys(doc.page_xref(0)):
print(key, "=", doc.xref_get_key(doc.page_xref(0), key))
Type = ('name', '/Page')
Annots = ('array', '[10 0 R]')
Parent = ('xref', '4 0 R')
Rotate = ('int', '0')
Contents = ('xref', '11 0 R')
MediaBox = ('array', '[0 0 595.32 841.92]')
Resources = ('dict', '<</Font<</R8 12 0 R/R10 13 0 R/R12 14 0 R/R14 15 0 R/R17 16 0 R/R20 17 0 R/R23 18 0 R/R27 19 0 R>>/ProcSet[/PDF/Text]/ExtGState<</R7 20 0 R>>>>') Specifying PDF path hierarchies is also possible: doc.xref_get_key(6, "Resources/Font")
('dict', '<</R8 12 0 R/R10 13 0 R/R12 14 0 R/R14 15 0 R/R17 16 0 R/R20 17 0 R/R23 18 0 R/R27 19 0 R>>')
# and:
>>> doc.xref_get_key(6, "Resources/Font/R8")
('xref', '12 0 R') so that a certain degree of recursion is also available. >>> for key in doc.xref_get_keys(12):
print(key, "=", doc.xref_get_key(12, key))
Type = ('name', '/Font')
Widths = ('array', '[226 0 0 0 0 0 0 0 0 0 0 0 0 0 267 0 507 507 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 529 630 488 459 0 0 267 0 0 423 874 0 676 532 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 494 0 418 537 503 316 474 537 246 0 480 246 813 537 538 537 0 355 399 347 537 473 0 459 474]')
Subtype = ('name', '/TrueType')
BaseFont = ('name', '/FNUUTH+Calibri-Bold')
LastChar = ('int', '121')
FirstChar = ('int', '32')
ToUnicode = ('xref', '25 0 R')
FontDescriptor = ('xref', '26 0 R')
>>> The opposite direction - updating one of those keys - will follow over the weekend I think. |
Beta Was this translation helpful? Give feedback.
-
I have a first version of
Please let me know if you can join testing it. I will generate OSX and Linux wheels. If you need Windows, drop me a note. |
Beta Was this translation helpful? Give feedback.
-
Linux and OSX wheels can be found in branches ofthis repo https://github.com/pymupdf/PyMuPDF-wheels |
Beta Was this translation helpful? Give feedback.
-
causes a segmentation fault. (try to set an element of an array as a dict) By the way, since it's possible to get and set inner dict, what about allowing listing inner dict as well? What about setting from a PDF object from a string too? (without putting it on a document first) Or set an item in an array?
It adds another key Or even
This could be because of MuPDF's API, so it's possible to warn the user to sanitize/escape the input before passing to the function. |
Beta Was this translation helpful? Give feedback.
-
First of all thank you very much for testing this! This set of functions is meant for experts, who know what they are doing. Should not be used by someone who is unclear about how PDF object syntax works. It is a help to edit PDF object syntax - not more, not less. It is still string manipulation basically. The basic inner mechanisms rely on "key" being a real PDF key. If a PDF path "key1/key2/key3..." is specified, all its components must be PDF dictionary keys to deliver anything other than "null". The set function executes these steps:
I need to check out your segfault. I hope it's not caused by MuPDF! That would be bad because it most probably can't be healed. I should probably also make sure, that only numbers, letters and "/" are contained in the "key" argument.
Exactly. Useful - maybe for scrubbing off potentially confidential data like page thumbnails. Also there have been multiple requests to remove image and font references from page definitions. |
Beta Was this translation helpful? Give feedback.
-
Parsing the string with Python isn't really reliable... import fitz
doc=fitz.Document()
xref=doc.get_new_xref()
doc.update_object(xref, "<< /B << /A [] >> >>")
doc.xref_set_key(xref, "A", "2")
print(doc.xref_object(xref, compressed=1))
# wrong: (currently)
# <</B<</A 2>>/A[]>>
#
# correct:
# <</B<</A[]>>/A 2>> You can add Looks like there's |
Beta Was this translation helpful? Give feedback.
-
True, but can be healed: it all depends on inserting an intermediate vaue that is (1) recognizable, and (2) improbable enough for not being already part of the PDF. I also have found the reason for your segfault: because of illegally trying to insert a new subdict key |
Beta Was this translation helpful? Give feedback.
-
Also found a serious error: If As part of the housekeeping I also exclude now empty strings for key and value. |
Beta Was this translation helpful? Give feedback.
-
I will add the above to the coming version. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Is your feature request related to a problem? Please describe.
Currently it's necessary to manipulate the dict with string operations manually to modify the objects given the xref. (get the string with
xrefObject
, modify it with regex/something, put it back withupdateObject
)The library could provide functions for that.
Describe the solution you'd like
Perhaps something like
Document.{add,set,del}_object_dict(xref: int, key: str[, value: Any])
; or make some function to convert the PDF object string to Python object and back.Describe alternatives you've considered
Additional context
(no context.)
Beta Was this translation helpful? Give feedback.
All reactions