-
Hi, firstly, thanks for a great project. I am looking for a way to remove specific images (not all of them) from a PDF. Is there a way to do that using PyMuPDF? Thanks. |
Beta Was this translation helpful? Give feedback.
Replies: 30 comments 3 replies
-
That is possible, but not an easy one and highly depends on a few things:
An image leaves its mark at more than one place:
In comparison, just suppressing the image to appear (and not physically removing the image object from the PDF) is fairly doable: >>> import fitz
>>> doc=fitz.open("PyMuPDF.pdf")
>>> doc.getPageImageList(0) # images on page 0
[[270, 0, 261, 115, 8, 'DeviceRGB', '', 'Im1', 'DCTDecode']]
>>> # 270 is the image object xref
>>> # the page references it via name 'Im1'
>>> doc[0]._getContents()
[274]
>>> # xref 274 is the only /Contents object of the page (could be
>>> c = doc._getXrefStream(274) # read the stream source
>>> c.find(b"/Im1 Do") # try find the image display command
217
>>> cnew = c.replace(b"/Im1 Do", b"") # remove it
>>> doc._updateStream(274, cnew) # replace page's /Content object
>>> Now the image should no longer be shown on that page.
|
Beta Was this translation helpful? Give feedback.
-
Thank You for your answer. Is it also possible to extract location (BBox) of the image I am 'removing'? Currently I am playing around with two approaches: |
Beta Was this translation helpful? Give feedback.
-
Your comment is right on spot: The only thing you can do, is trying some heuristics: Image display in a PDF in principle is coded like this:
In the page's object definition we will find >>> doc = fitz.open("PyMuPDF.pdf")
>>> page=doc[0]
>>> page.xref
264
>>> print(doc._getXrefString(264))
<<
/Type /Page
/Contents 269 0 R
/Resources 268 0 R
/MediaBox [ 0 0 612 792 ]
/Parent 273 0 R
>>
>>> print(doc._getXrefString(268)) # print resources object
<<
/Font <<
/F38 271 0 R
/F39 272 0 R
>>
/XObject <<
/Im1 265 0 R
>>
/ProcSet [ /PDF /Text /ImageC ]
>>
>>> doc.getPageImageList(0) # compare with this output:
[[265, 0, 261, 115, 8, 'DeviceRGB', '', 'Im1', 'DCTDecode']]
>>> # now look at /Contents source of the page:
>>> cont = doc._getXrefStream(269).decode() # decode bytes to string
>>> print(cont[:500])
...
q
1 0 0 1 72 710.536 cm
[]0 d 0 J 0.996 w 0 0 m 468 0 l S
Q
0 g 0 G
0 g 0 G
q
195.75 0 0 86.25 344.25 616.814 cm % concatenate matrix
/Im1 Do % display image
Q
BT
... To do the suggested heuristic compare, try this ... >>> # applying brute force ...
>>> img = doc.extractImage(265)
>>> img.keys()
dict_keys(['ext', 'smask', 'width', 'height', 'colorspace', 'xres', 'yres', 'cs-name', 'image'])
>>> img["cs-name"]
'DeviceRGB'
>>> img["ext"]
'jpeg'
>>> blks = page.getText("dict")["blocks"]
>>> blks[0].keys()
dict_keys(['type', 'bbox', 'width', 'height', 'ext', 'image'])
>>> blks[0]["image"] == img["image"]
True
>>> # voilà, found a match! |
Beta Was this translation helpful? Give feedback.
-
Thanks a lot, this helped me to solve the problem. |
Beta Was this translation helpful? Give feedback.
-
In the meantime, I have developed a function, which extracts the bbox of images on PDF pages without using The bbox calculator is pure Python, but using PyMuPDF. It currently supports images, which are not inserted with a rotation other than integer multiples of 90 degrees. The following ZIP contains this function Maybe you are interested in trying it out. So, this function helps solve your problem in the following way: >>> import fitz
>>> from get_image_bbox import get_image_bbox
>>> doc=fitz.open("PyMuPDF.pdf")
>>> page=doc[0]
>>> imglist=doc.getPageImageList(page.number)
>>> imglist
[[266, 0, 261, 115, 8, 'DeviceRGB', '', 'Im1', 'DCTDecode']]
>>> bbox = get_image_bbox(page, imglist[0])
>>> print(bbox)
Rect(344.25, 88.93597412109375, 540.0, 175.18597412109375)
>>> # just as a cross check:
>>> blocks = [b["bbox"] for b in page.getText("dict")["blocks"] if b["type"] == 1]
>>> blocks[0]
(344.25, 88.93597412109375, 540.0, 175.18597412109375)
>>> # as can be seen: (practically) the same rectangle |
Beta Was this translation helpful? Give feedback.
-
Advantages over my previous suggestion are:
|
Beta Was this translation helpful? Give feedback.
-
Well, in the above case, it was even exactly the same rectangle. |
Beta Was this translation helpful? Give feedback.
-
I intend to do something similar if you can help me out here. Using PyMuPDF, I want to extract all images from pdf and save them separately and replace all images with just their image names at the same image place and save as another document. I can save all images with following code |
Beta Was this translation helpful? Give feedback.
-
@ahmedalmuqri - with the new v1.17.0 this has become much simpler. Code snippet: >>> import fitz
>>> doc=fitz.open("nur-ruhig.pdf")
>>> page = doc[0]
>>> doc.getPageImageList(0)
[(5, 0, 439, 501, 8, 'DeviceRGB', '', 'fzImg0', 'DCTDecode')]
>>> rect = page.getImageBbox("fzImg0")
>>> # colors
>>> blue = (0, 0, 1)
>>> gold = (1, 1, 0)
>>> page.addRedactAnnot(rect, "Here was fzImg0",
align=fitz.TEXT_ALIGN_CENTER, fill=gold, text_color=blue)
'Redact' annotation on page 0 of nur-ruhig.pdf
>>> page.apply_redactions()
True
>>> doc.save("image-removed.pdf", garbage=3, deflate=True)
>>> Some notes:
|
Beta Was this translation helpful? Give feedback.
-
Thanks for above Sir, you are definitely pro. |
Beta Was this translation helpful? Give feedback.
-
Use the flags parameter of method |
Beta Was this translation helpful? Give feedback.
-
I am trying to do the same thing like @ahmedalmuqri mentioned above, but my task to extract all images from a page and if any image is having full of zero pixel values (in my case >90%) then i have to remove those images from pdf page, remaining images has to be retained as it is. Here is my code snippet:
|
Beta Was this translation helpful? Give feedback.
-
I don't understand. Is that what you want? Or do you want to change the PDF itself? |
Beta Was this translation helpful? Give feedback.
-
OOps, sorry - your skript code is difficult to read, I didn't see all of it in the first place. |
Beta Was this translation helpful? Give feedback.
-
Something must be going wrong with your image selection. Are you sure it is filtering out the right images - did you debug it at all? Anyway - if you filter images correctly, removing them via redactions should work as you coded it. I would execute |
Beta Was this translation helpful? Give feedback.
-
Yes, i am extracting right image itself, along with the right images it also deletes other images from a page.
|
Beta Was this translation helpful? Give feedback.
-
@JorjMcKie i have done it alternate way... a) First removed images which has full of black pixels- upon checking output, instead of removing only logic passing images all images has been removed. b) Now i have started to inserting all images which i want to retain it in a page- Its a double work, however it behaves as expected. |
Beta Was this translation helpful? Give feedback.
-
@JorjMcKie i more challenge i have noticed here, i have some text over the image , which i want to retain it. But here when i remove image rect, text also vanishes along with image. |
Beta Was this translation helpful? Give feedback.
-
Sorry to say: this works as designed and as it should with redactions. |
Beta Was this translation helpful? Give feedback.
-
Redactions are supposed to remove everything having a non-empty overlap with the redaction rectangle. Please read the documentation. |
Beta Was this translation helpful? Give feedback.
-
Redactions are hence maybe the wrong approach altogether. The following function could be an idea to remove unwanted images: def remove_images(doc, pno, unwanted):
un_list = [b"/%s Do" % u.encode() for u in unwanted]
page=doc[pno] # read the page
page.cleanContents() # unify / format the commands
xref=page.getContents()[0] # get its XREF
cont=page.readContents().splitlines() # read commands as list of lines
for i in range(len(cont)): # walk thru the lines
if cont[i] in un_list: # invokes an unwanted image
cont[i] = b"" # remove command
doc.updateStream(xref, b"\n".join(cont)) # replace cleaned command object
page.cleanContents() # removes now unreferenced images from page definition If an image should be remove include its reference name in the |
Beta Was this translation helpful? Give feedback.
-
@JorjMcKie It works fine for the one which has text over image rect, but in case of 15-15.pdf(attached above) it brings some hidden images at front! not sure from where these are referenced in this page. |
Beta Was this translation helpful? Give feedback.
-
I tried it out and removed all images. Worked just fine. |
Beta Was this translation helpful? Give feedback.
-
In the hope to convince you, that the problem is your image selection algorithm, here an overview of the various images overlapping one another ... |
Beta Was this translation helpful? Give feedback.
-
To whom it may apply: |
Beta Was this translation helpful? Give feedback.
-
I have a signature image on pdf and it is transparent from the sides but when I remove the image it removes PDF contents under that image |
Beta Was this translation helpful? Give feedback.
-
You mean: using redactions? |
Beta Was this translation helpful? Give feedback.
-
There is a way now to remove images that can be identified by an xref - without impacting the rest of the page in any way. Have a look at this script. |
Beta Was this translation helpful? Give feedback.
-
first of all, thank your for your jobs on pymupdf, it helps us a lot. many thanks!! import fitz
import os
import io
from PIL import Image
fp ='pdf_font_garbled.pdf'
doc = fitz.open(fp)
delete_xref_set=set()
for page in doc:
img_list = page.get_images(full=True)
for img in img_list:
xref = img[0]
pix = fitz.Pixmap(doc, xref)
img = Image.open(io.BytesIO(pix.tobytes()))
width, height = img.size
for i in range(width):
for j in range(height):
if sum(img.getpixel((i, j))) ==690:
if xref in delete_xref_set:
continue
delete_xref_set.add(xref)
page.delete_image(xref) # AttributeError: 'Document' object has no attribute 'is_image'
pix = None
else:
print(f'delete xref set :{delete_xref_set}')
# output: delete xref set :{35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 48, 53, 54, 55, 56, 57, 58, 59, 60, 62, 63, 64, 65, 70}
doc.save("output_rid_watermark.pdf")
doc.close() relevant pdf file: |
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
There is a way now to remove images that can be identified by an xref - without impacting the rest of the page in any way. Have a look at this script.