Better ways to manipulate PDF raw object? IMPLEMENTED #841

user202729 · 2021-01-16T08:42:03Z

user202729
Jan 16, 2021

Is your feature request related to a problem? Please describe.
Currently it's necessary to manipulate the dict with string operations manually to modify the objects given the xref. (get the string with xrefObject, modify it with regex/something, put it back with updateObject)

The library could provide functions for that.

Describe the solution you'd like
Perhaps something like Document.{add,set,del}_object_dict(xref: int, key: str[, value: Any]); or make some function to convert the PDF object string to Python object and back.

Describe alternatives you've considered

Additional context
(no context.)

JorjMcKie · 2021-01-16T09:01:32Z

JorjMcKie
Jan 16, 2021
Maintainer

I have thought about that couple of times. Actually not all too difficult. Obviously, such a function would have to be recursive.
But the devil is in the details. Here is a PDF catalog dictionary:

>>> print(doc.xref_object(1))
<<
  /Lang (de-DE)  # '()' indicates a string, '<>' a hex string
  /Metadata 2 0 R  # 'nnn 0 R' is a pointer to xref nnn
  /OutputIntents [ <<
        /DestOutputProfile 14 0 R
        /Info (U.S. Web Coated \(SWOP\) v2)
        /OutputConditionIdentifier (CGATS TR 001)
        /RegistryName (http://www.color.org)
        /S /GTS_PDFX
        /Type /OutputIntent
      >> ]
  /Pages 3 0 R  # resolve this???
  /Type /Catalog
  /ViewerPreferences <<
    /Direction /L2R  # '/L2R' is a name, not a string!
  >>
>>
>>>

Once having found a way to convert this to a Python nested dict, the truly interesting part is convert such a beast back to a correct (!) PDF object definition string.
Doable, but a lot of work.
Why don't you try and contribute something?

0 replies

JorjMcKie · 2021-01-16T09:22:53Z

JorjMcKie
Jan 16, 2021
Maintainer

Another approach could be, to be a bit less ambitious.

have a list often used PDF object keys, e.g. /Rect, /Matrix, Bbox, /Type, ..
provide get / set functions for those

0 replies

user202729 · 2021-01-16T09:43:14Z

user202729
Jan 16, 2021
Author

What about handling only string type? That should be easier.

So Document.get_dict(xref, key) will return the string representation, and there's Document.set_dict(xref, key, value: str).

Even that would still simplify many simple tasks (at the moment, <...>.OutputIntents=[] (just as an example, might not result in a valid PDF) would involve parsing the indentation or the brackets to determine where the list ends)

It will still be more verbose than pdfrw's notation in most cases.

One of the advantages of PyMuPDF is its performance (most things are handled in C). Converting everything to Python would be slow.

0 replies

JorjMcKie · 2021-01-16T10:12:20Z

JorjMcKie
Jan 16, 2021
Maintainer

Can you do me a favor and put this issue under "Discussions"? This would better reflect what we actually are doing here and hopefully also motivate others to join in.

It would probably be fairly easy to make get/set functions for strings in object definitions (OutputIntents by the way is not string). In the above example, only /Lang is a string.
Maybe you meant getting / setting any given xref key as a string?

0 replies

user202729 · 2021-01-16T10:31:12Z

user202729
Jan 16, 2021
Author

Regarding discussion: You should be able to do it... See https://github.com/github/feedback/discussions/2952 (formatted as code to avoid cross-reference)

Yes. That's probably easier (but also depends on the C API of mupdf), it's already in string format.

0 replies

JorjMcKie · 2021-01-16T11:29:16Z

JorjMcKie
Jan 16, 2021
Maintainer

the conversion feature to discussion has been removed - don't know why. Anyway.

I will see to implement a "GET" C function which does the following:

doc.xref_get_key(xref: int, key: str). It returns a tuple (type, content), where content is the the PDF-key's content formatted as a string in compressed (?) format, and type is the PDF object type of the key, i.e. one of: "bool", "int", "float", "str", "name", "array", "dict", "null". The type is a help to interpret the content, but more importantly necessary for setting the value.
If a non-existent key is specified, either an exception could be raised or ("null", "") be returned.

The "SET" C function would go like this:

doc.xref_set_key(xref:int, key:str, type:str, content:str). Or maybe the last two again as a tuple (type, content). This would internally be executed as follows:
- replace any previously existing /key by its empty equivalent (or maybe /key null ?). Like for an array /key [] would be inserted in the xref object.
- internally make a string representation of the resulting xref object
- replace "/key []" by "/key [%s]" % content
- update xref object.

0 replies

JorjMcKie · 2021-01-16T18:24:50Z

JorjMcKie
Jan 16, 2021
Maintainer

This is what I have in terms of reading out PDF keys of a given xref. I made 2 methods:

doc.xref_get_keys(xref) list dictionary key of the xref
doc.xref_get_kex(xref, key) returns the above emtnioned tuple

Together with page xref = doc.page_xref(pno) for pno=page.number, this is some example outcome for page 0 of some document:

>>> for key in doc.xref_get_keys(doc.page_xref(0)):
	print(key, "=", doc.xref_get_key(doc.page_xref(0), key))

	
Type = ('name', '/Page')
Annots = ('array', '[10 0 R]')
Parent = ('xref', '4 0 R')
Rotate = ('int', '0')
Contents = ('xref', '11 0 R')
MediaBox = ('array', '[0 0 595.32 841.92]')
Resources = ('dict', '<</Font<</R8 12 0 R/R10 13 0 R/R12 14 0 R/R14 15 0 R/R17 16 0 R/R20 17 0 R/R23 18 0 R/R27 19 0 R>>/ProcSet[/PDF/Text]/ExtGState<</R7 20 0 R>>>>')

Specifying PDF path hierarchies is also possible:

doc.xref_get_key(6, "Resources/Font")
('dict', '<</R8 12 0 R/R10 13 0 R/R12 14 0 R/R14 15 0 R/R17 16 0 R/R20 17 0 R/R23 18 0 R/R27 19 0 R>>')
# and:
>>> doc.xref_get_key(6, "Resources/Font/R8")
('xref', '12 0 R')

so that a certain degree of recursion is also available.
When I return type = "xref", it actually means "indirect" object. Extracting the xref number allows following that chain like this:

>>> for key in doc.xref_get_keys(12):
	print(key, "=", doc.xref_get_key(12, key))

	
Type = ('name', '/Font')
Widths = ('array', '[226 0 0 0 0 0 0 0 0 0 0 0 0 0 267 0 507 507 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 529 630 488 459 0 0 267 0 0 423 874 0 676 532 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 494 0 418 537 503 316 474 537 246 0 480 246 813 537 538 537 0 355 399 347 537 473 0 459 474]')
Subtype = ('name', '/TrueType')
BaseFont = ('name', '/FNUUTH+Calibri-Bold')
LastChar = ('int', '121')
FirstChar = ('int', '32')
ToUnicode = ('xref', '25 0 R')
FontDescriptor = ('xref', '26 0 R')
>>>

The opposite direction - updating one of those keys - will follow over the weekend I think.

0 replies

JorjMcKie · 2021-01-17T14:26:05Z

JorjMcKie
Jan 17, 2021
Maintainer

I have a first version of doc.xref_set_key(xref, key, value).
It works as follows:

key: a PDF key like "Resources" or "Matrix", or a PDF key path like "Reources/ExtGState". The leading "/" must always be omitted. Other slashes indicate a hierarchy of objects.
value: always a string. Depending on the desired object type, different rules apply for formatting the string, here is a preliminary list:
- xref -- must be provided as "nnn 0 R" with a valid xref number of the PDF.
- array -- a string like "[a b c d e f ...]". The brackets are required, array members must be separated by at least one space. An empty array "[]" is possible, and equivalent to removing the key.
- dict -- a string like "<< ... >>". The brackets are required and must enclose a valid PDF dictionary definition. An empty dictionary "<<>>" is possible and equivalent to removing the key.
- int -- an integer formatted as a string.
- float -- a float formatted as a string. Scientific notation (with exponents) is not supported.
- null -- the string "null". This effectively will remove the key.
- bool -- one of the strings "true" or "false".
- name -- a valid PDF name with a leading slash: "/PageLayout".
- string -- a valid PDF string. Depending on content, enclosed in bracket types "(...)" or "<...>", and reserved PDF characters must be escaped. If in doubt, we strongly recommend to use fitz.getPDFstr! This function automatically determines the required format.

Please let me know if you can join testing it. I will generate OSX and Linux wheels. If you need Windows, drop me a note.

0 replies

JorjMcKie · 2021-01-17T15:05:37Z

JorjMcKie
Jan 17, 2021
Maintainer

Linux and OSX wheels can be found in branches ofthis repo https://github.com/pymupdf/PyMuPDF-wheels

0 replies

user202729 · 2021-01-18T06:30:15Z

user202729
Jan 18, 2021
Author

This code

import fitz
import faulthandler
faulthandler.enable()
doc=fitz.Document(filename=folderWithPDF/"PDF32000_2008.pdf")
doc.xref_set_key(1, "OpenAction/0", "")

causes a segmentation fault. (try to set an element of an array as a dict)

By the way, since it's possible to get and set inner dict, what about allowing listing inner dict as well? What about setting from a PDF object from a string too? (without putting it on a document first) Or set an item in an array?

An empty dictionary "<<>>" is possible and equivalent to removing the key.

Documentation error? The key isn't removed.

Update: Okay, "equivalent to removing the key". That means "all PDF readers will not care about it"?

Even garbage=2, clean=1 doesn't remove the key. Is it ever useful for someone to want to remove a key?
I tried something else like this

doc.xref_set_key(1, "MarkInfo/Marked", "1 /Other 2")

It adds another key /Other into the dict (it doesn't look very type-safe...)

Or even

doc.xref_set_key(1, "MarkInfo/Marked", "1 >> /Other << ")

This could be because of MuPDF's API, so it's possible to warn the user to sanitize/escape the input before passing to the function.

0 replies

JorjMcKie · 2021-01-18T08:55:28Z

JorjMcKie
Jan 18, 2021
Maintainer

First of all thank you very much for testing this!

This set of functions is meant for experts, who know what they are doing. Should not be used by someone who is unclear about how PDF object syntax works.
It will ever always be possible to render a PDF unusable by using them ... both, because of syntax or semantics errors. MuPDF does its own syntax checking before accepting the changes. So I will not even try to insert my own syntax validation.

It is a help to edit PDF object syntax - not more, not less. It is still string manipulation basically.

The basic inner mechanisms rely on "key" being a real PDF key. If a PDF path "key1/key2/key3..." is specified, all its components must be PDF dictionary keys to deliver anything other than "null".
This is also the reason, why getting / setting single array items is not possible.

The set function executes these steps:

With the given "key" argument, insert key[] (empty array) into the xref object. This will remove any previous "key" value.
Read the resulting xref object as a string => "objstring".
Python split("/") the given key and use the last list item of the resulting list and execute objstring.replace("/item[]", "/item value") with the "value" argument of the function.
Hand the resulting objstring over to MuPDF and request replacing the xref content by this new object definition. MuPDF will reject and raise an exception if it finds syntax problems at this point. But at present in this case, the object will remain changed because of the empty array insertion! ... Maybe I should add my own recovery code here ...

I need to check out your segfault. I hope it's not caused by MuPDF! That would be bad because it most probably can't be healed.

I should probably also make sure, that only numbers, letters and "/" are contained in the "key" argument.

Update: Okay, "equivalent to removing the key". That means "all PDF readers will not care about it"?

Exactly. Useful - maybe for scrubbing off potentially confidential data like page thumbnails. Also there have been multiple requests to remove image and font references from page definitions.

0 replies

user202729 · 2021-01-18T10:40:50Z

user202729
Jan 18, 2021
Author

Parsing the string with Python isn't really reliable...

import fitz

doc=fitz.Document()

xref=doc.get_new_xref()

doc.update_object(xref, "<< /B << /A [] >> >>")

doc.xref_set_key(xref, "A", "2")
print(doc.xref_object(xref, compressed=1))
# wrong: (currently)
# <</B<</A 2>>/A[]>>
#
# correct:
# <</B<</A[]>>/A 2>>

You can add /key [] to both the beginning and the end of the string then replace the first occurrence, but that sounds like an ugly workaround (and is the key order even guaranteed by MuPDF?)

Looks like there's pdf_dict_get_<type> in MuPDF, but it's necessary to handle a lot of types. (there are 11 functions start with pdf_dict_get.)

0 replies

JorjMcKie · 2021-01-18T11:45:36Z

JorjMcKie
Jan 18, 2021
Maintainer

Parsing the string with Python isn't really reliable...

True, but can be healed: it all depends on inserting an intermediate vaue that is (1) recognizable, and (2) improbable enough for not being already part of the PDF.
By doing so, your above example works on my machine now.

I also have found the reason for your segfault: because of illegally trying to insert a new subdict key /0 to something that is not a dict but an array (/OpenAction), MuPDF has raised an exception (you should have seen the error message). So no new object was created, which I did not handle corrently. Now that example works and correctly raises and handles an exception.

0 replies

JorjMcKie · 2021-01-18T13:44:56Z

JorjMcKie
Jan 18, 2021
Maintainer

Also found a serious error:

If /OpenAction 1289 0 R exists and object 1289 0 R is a dict, then doc.xref_set_key(xref, "OpenAction/0", "(something)") will not fail, but insert a new subdict /0 with my eyecatcher string in xref 1289!!
So I had to introduce an additional check (sigh!), to prevent this from happening. This means: for a path of at least two levels "Key1/Key2/Key3.../Keyn" that does not yet exists, none of the sub-paths "Key1", "Key1/Key2", ... may point to an indirect object.

As part of the housekeeping I also exclude now empty strings for key and value.

0 replies

JorjMcKie · 2021-01-19T19:50:35Z

JorjMcKie
Jan 19, 2021
Maintainer

I will add the above to the coming version.
Thanks for coming up with this idea and your time to help testing it!

0 replies

Better ways to manipulate PDF raw object? IMPLEMENTED #841

Uh oh!

Uh oh!

user202729 Jan 16, 2021

Replies: 15 comments

Uh oh!

JorjMcKie Jan 16, 2021 Maintainer

Uh oh!

JorjMcKie Jan 16, 2021 Maintainer

Uh oh!

Uh oh!

user202729 Jan 16, 2021 Author

Uh oh!

JorjMcKie Jan 16, 2021 Maintainer

Uh oh!

Uh oh!

user202729 Jan 16, 2021 Author

Uh oh!

JorjMcKie Jan 16, 2021 Maintainer

Uh oh!

JorjMcKie Jan 16, 2021 Maintainer

Uh oh!

JorjMcKie Jan 17, 2021 Maintainer

Uh oh!

JorjMcKie Jan 17, 2021 Maintainer

Uh oh!

Uh oh!

user202729 Jan 18, 2021 Author

Uh oh!

JorjMcKie Jan 18, 2021 Maintainer

Uh oh!

Uh oh!

user202729 Jan 18, 2021 Author

Uh oh!

Uh oh!

JorjMcKie Jan 18, 2021 Maintainer

Uh oh!

JorjMcKie Jan 18, 2021 Maintainer

Uh oh!

JorjMcKie Jan 19, 2021 Maintainer

user202729
Jan 16, 2021

JorjMcKie
Jan 16, 2021
Maintainer

JorjMcKie
Jan 16, 2021
Maintainer

user202729
Jan 16, 2021
Author

JorjMcKie
Jan 16, 2021
Maintainer

user202729
Jan 16, 2021
Author

JorjMcKie
Jan 16, 2021
Maintainer

JorjMcKie
Jan 16, 2021
Maintainer

JorjMcKie
Jan 17, 2021
Maintainer

JorjMcKie
Jan 17, 2021
Maintainer

user202729
Jan 18, 2021
Author

JorjMcKie
Jan 18, 2021
Maintainer

user202729
Jan 18, 2021
Author

JorjMcKie
Jan 18, 2021
Maintainer

JorjMcKie
Jan 18, 2021
Maintainer

JorjMcKie
Jan 19, 2021
Maintainer