How to remove an image from PDF? Updated: look at the bottom. #1667

chest3x · 2019-08-02T11:21:09Z

chest3x
Aug 2, 2019

Hi,

firstly, thanks for a great project.

I am looking for a way to remove specific images (not all of them) from a PDF.
Possibly also replacing them by text, but that seems doable, as the location of the image is exposed already.

Is there a way to do that using PyMuPDF?

Thanks.

Answered by JorjMcKie

Apr 7, 2022

There is a way now to remove images that can be identified by an xref - without impacting the rest of the page in any way. Have a look at this script.

View full answer

JorjMcKie · 2019-08-02T12:46:24Z

JorjMcKie
Aug 2, 2019
Maintainer

That is possible, but not an easy one and highly depends on a few things:

which software inserted the image
on how many pages is the image used (could be more than one!)
how much coding effort are you willing to invest

An image leaves its mark at more than one place:

in the /Resources object used by the page object(s)
in the /Contents object(s) of every page using it.

In comparison, just suppressing the image to appear (and not physically removing the image object from the PDF) is fairly doable:

>>> import fitz
>>> doc=fitz.open("PyMuPDF.pdf")
>>> doc.getPageImageList(0)  # images on page 0
[[270, 0, 261, 115, 8, 'DeviceRGB', '', 'Im1', 'DCTDecode']]
>>> # 270 is the image object xref
>>> # the page references it via name 'Im1'
>>> doc[0]._getContents()
[274]
>>> # xref 274 is the only /Contents object of the page (could be 
>>> c = doc._getXrefStream(274) # read the stream source
>>> c.find(b"/Im1 Do") # try find the image display command
217
>>> cnew = c.replace(b"/Im1 Do", b"") # remove it
>>> doc._updateStream(274, cnew) # replace page's /Content object
>>>

Now the image should no longer be shown on that page.
Possible complications:

there could be more than one /Contents object not just a one-element list [274].
The command /Im1 Do could contain an arbitrary number of spaces or \n, and the two tokens could even be on two separate /Contents (not to be expected however).
if same image appears on several pages, the xref 270 remains the same, but the name Im1 could different.

0 replies

chest3x · 2019-08-07T08:44:41Z

chest3x
Aug 7, 2019
Author

Thank You for your answer.

Is it also possible to extract location (BBox) of the image I am 'removing'?

Currently I am playing around with two approaches:
page.getText("dict")['blocks'] - here I am capable of extracting the BBox of the image, but I am not able to get image reference here
doc.getPageImageList(0) - here I am capable of getting the image reference, but not the BBox.

0 replies

JorjMcKie · 2019-08-07T10:01:20Z

JorjMcKie
Aug 7, 2019
Maintainer

Your comment is right on spot:
Unfortunately, this is not possible right now :-(.
This is the reason:
The page.gettext() methods works for all supported document types - not just for PDFs, which are the only to contain cross reference numbers.

The only thing you can do, is trying some heuristics:
The image information in page.getText("dict")['blocks'] contains the bbox as location information, but also the original image information, width, height, type (extension), bpi,...
doc.getPageImageList also provides some image information (not exactly the same, however).
If you compare this information, you may get a sure cros identification in many cases.
If you do doc.extractImage(xref) with the xref you find in doc.getPageImageList, the returned dictionary returns even more complete image information to match with that of page.getText("dict")['blocks'].

Image display in a PDF in principle is coded like this:
In the page's /Contents source we will find

a b c d e f cm % 'cm' = concatenate matrix, a,b,c,d,e,f are matrix elements (floats)
...            % any number of other PDF commands
/Im1 Do        % display an image named 'Im1'
...

In the page's object definition we will find

>>> doc = fitz.open("PyMuPDF.pdf")
>>> page=doc[0]
>>> page.xref
264
>>> print(doc._getXrefString(264))
<<
  /Type /Page
  /Contents 269 0 R
  /Resources 268 0 R
  /MediaBox [ 0 0 612 792 ]
  /Parent 273 0 R
>>
>>> print(doc._getXrefString(268)) # print resources object
<<
  /Font <<
    /F38 271 0 R
    /F39 272 0 R
  >>
  /XObject <<
    /Im1 265 0 R
  >>
  /ProcSet [ /PDF /Text /ImageC ]
>>
>>> doc.getPageImageList(0) # compare with this output:
[[265, 0, 261, 115, 8, 'DeviceRGB', '', 'Im1', 'DCTDecode']]
>>> # now look at /Contents source of the page:
>>> cont = doc._getXrefStream(269).decode() # decode bytes to string
>>> print(cont[:500])
...
q
1 0 0 1 72 710.536 cm
[]0 d 0 J 0.996 w 0 0 m 468 0 l S
Q
0 g 0 G
0 g 0 G
q
195.75 0 0 86.25 344.25 616.814 cm % concatenate matrix
/Im1 Do                            % display image
Q
BT
...

To do the suggested heuristic compare, try this ...

>>> # applying brute force ...
>>> img = doc.extractImage(265)
>>> img.keys()
dict_keys(['ext', 'smask', 'width', 'height', 'colorspace', 'xres', 'yres', 'cs-name', 'image'])
>>> img["cs-name"]
'DeviceRGB'
>>> img["ext"]
'jpeg'
>>> blks = page.getText("dict")["blocks"]
>>> blks[0].keys()
dict_keys(['type', 'bbox', 'width', 'height', 'ext', 'image'])
>>> blks[0]["image"] == img["image"]
True
>>> # voilà, found a match!

0 replies

chest3x · 2019-08-09T14:34:53Z

chest3x
Aug 9, 2019
Author

Thanks a lot, this helped me to solve the problem.

0 replies

JorjMcKie · 2019-08-13T13:27:35Z

JorjMcKie
Aug 13, 2019
Maintainer

In the meantime, I have developed a function, which extracts the bbox of images on PDF pages without using page.getText("dict").
It does this by parsing the PDF commands defining a page's layout (/Contents and similar PDF objects).

The bbox calculator is pure Python, but using PyMuPDF. It currently supports images, which are not inserted with a rotation other than integer multiples of 90 degrees.
I am continuing to extend the solution for those as well.

The following ZIP contains this function get_image_bbox.py and a test script, which can be used as a front-end for testing purposes.

Maybe you are interested in trying it out.

get-bbox.zip

So, this function helps solve your problem in the following way:
Given a PDF page n and the list of images on it like this:

>>> import fitz
>>> from get_image_bbox import get_image_bbox
>>> doc=fitz.open("PyMuPDF.pdf")
>>> page=doc[0]
>>> imglist=doc.getPageImageList(page.number)
>>> imglist
[[266, 0, 261, 115, 8, 'DeviceRGB', '', 'Im1', 'DCTDecode']]
>>> bbox = get_image_bbox(page, imglist[0])
>>> print(bbox)
Rect(344.25, 88.93597412109375, 540.0, 175.18597412109375)
>>> # just as a cross check:
>>> blocks = [b["bbox"] for b in page.getText("dict")["blocks"] if b["type"] == 1]
>>> blocks[0]
(344.25, 88.93597412109375, 540.0, 175.18597412109375)
>>> # as can be seen: (practically) the same rectangle

0 replies

JorjMcKie · 2019-08-13T13:34:15Z

JorjMcKie
Aug 13, 2019
Maintainer

Advantages over my previous suggestion are:

very much faster
does not require allocation of (usually) large memory areas using getText and extractImage -- those are not used.
supports a wider range of images because it doesn't rely on binary equality of image streams delivered by getText and extractImage.

0 replies

JorjMcKie · 2019-08-13T15:20:34Z

JorjMcKie
Aug 13, 2019
Maintainer

Well, in the above case, it was even exactly the same rectangle.
That needs not always be so due to rounding issues.

0 replies

ahmedalmuqri · 2020-05-23T07:12:26Z

ahmedalmuqri
May 23, 2020

I intend to do something similar if you can help me out here. Using PyMuPDF, I want to extract all images from pdf and save them separately and replace all images with just their image names at the same image place and save as another document. I can save all images with following code
import fitz
doc = fitz.open("Article_Example_1_2.pdf")
for i in range(len(doc)):
print(doc[i]._getContents())
for img in doc.getPageImageList(i):
print(img)
xref = img[0]
pix = fitz.Pixmap(doc, xref)
if pix.n - pix.alpha < 4: # this is GRAY or RGB or pix.n < 5
pix.writePNG("p%s-%s.png" % (i, xref))
else: # CMYK: convert to RGB first
pix1 = fitz.Pixmap(fitz.csRGB, pix)
pix1.writePNG("p%s-%s.png" % (i, xref))
pix1 = None
pix = None
doc.save(filename=r"new.pdf")
doc.close()
but not sure how to replace them all in pdf with their stored images name.

0 replies

JorjMcKie · 2020-05-23T09:26:29Z

JorjMcKie
May 23, 2020
Maintainer

@ahmedalmuqri - with the new v1.17.0 this has become much simpler.
You can use Redaction annotations to achieve this. Here is a basic example:
PDF page before:

Code snippet:

>>> import fitz
>>> doc=fitz.open("nur-ruhig.pdf")
>>> page = doc[0]
>>> doc.getPageImageList(0)
[(5, 0, 439, 501, 8, 'DeviceRGB', '', 'fzImg0', 'DCTDecode')]
>>> rect = page.getImageBbox("fzImg0")
>>> # colors
>>> blue = (0, 0, 1)
>>> gold = (1, 1, 0)
>>> page.addRedactAnnot(rect, "Here was fzImg0",
            align=fitz.TEXT_ALIGN_CENTER, fill=gold, text_color=blue)
'Redact' annotation on page 0 of nur-ruhig.pdf
>>> page.apply_redactions()
True
>>> doc.save("image-removed.pdf", garbage=3, deflate=True)
>>>

PDF page after:

Some notes:

Images handled like this will be physically removed from the PDF. In case they have been also used on other pages, they won't appear there anymore. If redactions do not cover the whole image, only the resp. part will be blanked out.
Page method apply_redactions() applies all redactions on that page. It supports removal of text, images and links. It does not support removal of embedded other PDF pages (like inserted via page.showPDFpage()) - even when images are contained therein.
By all means use deflate=True to regain space previously covered by images which are now gone.

0 replies

ahmedalmuqri · 2020-05-23T09:55:31Z

ahmedalmuqri
May 23, 2020

Thanks for above Sir, you are definitely pro.
Actually I was aiming to convert pdf to html using pymupdf but using page.getText("html") produced html file with image data embedded within. This made the file larger too.
I wanted to extract all images from pdf, save them in another folder and replace all of the images(e.g img_1) with their corresponding image filenames(e.g. c:/foldername/img_1) and save as a temporary pdf and then convert to html so that in html file, at places of images source I would have image filenames instead of image data like <img_src=''c:/foldername/img_1>
Now I think above code would not solve problem as i initially thought. Can you also please guide me here, would be very grateful.

0 replies

JorjMcKie · 2020-05-23T10:07:13Z

JorjMcKie
May 23, 2020
Maintainer

Use the flags parameter of method page.getText(option, flags=n). With this you can exclude images being part of the text extraction.
There is a section in the documentation dealing specifically with this.
The simplest way: page.getText("html", flags=0)

0 replies

baleris · 2020-09-15T07:19:26Z

baleris
Sep 15, 2020

I am trying to do the same thing like @ahmedalmuqri mentioned above, but my task to extract all images from a page and if any image is having full of zero pixel values (in my case >90%) then i have to remove those images from pdf page, remaining images has to be retained as it is.
When i tried with this approach i lost all images including one which is not to be removed (not having zero pixels). How i can retain the images which is not passed through my condition of zero pixel ?

Here is my code snippet:

    for each_img, each_rect in black_img_dict.items():
        print(each_rect)
        print(each_img)
        page.addRedactAnnot(each_rect," ")
        page.apply_redactions()
        
doc.save("image-removed.pdf", garbage=3)

@JorjMcKie

0 replies

JorjMcKie · 2020-09-15T09:05:44Z

JorjMcKie
Sep 15, 2020
Maintainer

I don't understand.
Your skript extracts images to separate files. It does not change the PDF at all.

Is that what you want? Or do you want to change the PDF itself?

0 replies

JorjMcKie · 2020-09-15T09:08:22Z

JorjMcKie
Sep 15, 2020
Maintainer

OOps, sorry - your skript code is difficult to read, I didn't see all of it in the first place.

0 replies

JorjMcKie · 2020-09-15T09:16:11Z

JorjMcKie
Sep 15, 2020
Maintainer

Something must be going wrong with your image selection. Are you sure it is filtering out the right images - did you debug it at all?
The code that is ill-formatted extracts images to separate files and does not seem to connect to the better readable part.

Anyway - if you filter images correctly, removing them via redactions should work as you coded it. I would execute page.apply_redactions() only once per page, and not for each image.

0 replies

baleris · 2020-09-15T10:14:32Z

baleris
Sep 15, 2020

Yes, i am extracting right image itself, along with the right images it also deletes other images from a page.
I have attached sample pdf here along with the code.
15-15.pdf

    print(black_img_dict)
    for each_img, each_rect in black_img_dict.items():
        print(each_rect)
        print(each_img)
        white=(0,1,1)
        page.addRedactAnnot(each_rect," ",fill=False)
        page.apply_redactions()

doc.save("image-removed.pdf", garbage=3, deflate=True)

0 replies

baleris · 2020-09-15T10:50:07Z

baleris
Sep 15, 2020

Something must be going wrong with your image selection. Are you sure it is filtering out the right images - did you debug it at all?
The code that is ill-formatted extracts images to separate files and does not seem to connect to the better readable part.

Anyway - if you filter images correctly, removing them via redactions should work as you coded it. I would execute page.apply_redactions() only once per page, and not for each image.

@JorjMcKie i have done it alternate way... a) First removed images which has full of black pixels- upon checking output, instead of removing only logic passing images all images has been removed. b) Now i have started to inserting all images which i want to retain it in a page- Its a double work, however it behaves as expected.

0 replies

baleris · 2020-09-15T12:25:47Z

baleris
Sep 15, 2020

@JorjMcKie i more challenge i have noticed here, i have some text over the image , which i want to retain it. But here when i remove image rect, text also vanishes along with image.

0 replies

JorjMcKie · 2020-09-15T12:50:24Z

JorjMcKie
Sep 15, 2020
Maintainer

Sorry to say: this works as designed and as it should with redactions.

0 replies

JorjMcKie · 2020-09-15T12:52:29Z

JorjMcKie
Sep 15, 2020
Maintainer

Redactions are supposed to remove everything having a non-empty overlap with the redaction rectangle. Please read the documentation.

0 replies

JorjMcKie · 2020-09-15T13:08:50Z

JorjMcKie
Sep 15, 2020
Maintainer

Redactions are hence maybe the wrong approach altogether.
How about removing the image display commands for the unwanted images?
Page appearance is built by commands in the /Contents object(s). Image display commands all look like /refname Do, where refname is item[7] in the list doc.getPageImaList(pno).

The following function could be an idea to remove unwanted images:

def remove_images(doc, pno, unwanted):
    un_list = [b"/%s Do" % u.encode() for u in unwanted]
    page=doc[pno]  # read the page
    page.cleanContents()  # unify / format the commands
    xref=page.getContents()[0]  # get its XREF
    cont=page.readContents().splitlines()  # read commands as list of lines
    for i in range(len(cont)):  # walk thru the lines
        if cont[i] in un_list:  # invokes an unwanted image
            cont[i] = b""  # remove command
    doc.updateStream(xref, b"\n".join(cont))  # replace cleaned command object
    page.cleanContents()  # removes now unreferenced images from page definition

If an image should be remove include its reference name in the unwanted list. Then call the function with that list.

0 replies

baleris · 2020-09-15T13:40:54Z

baleris
Sep 15, 2020

@JorjMcKie It works fine for the one which has text over image rect, but in case of 15-15.pdf(attached above) it brings some hidden images at front! not sure from where these are referenced in this page.

0 replies

JorjMcKie · 2020-09-16T20:31:49Z

JorjMcKie
Sep 16, 2020
Maintainer

I tried it out and removed all images. Worked just fine.
Contrary to the visual impression (what we see looks like one or two part images), but is made up of more than 10 single images. Heaven knows what happens if you delete some of them and let others survive, just because they pass your filter ... which I cannot judge how reliable it actually is.

0 replies

JorjMcKie · 2020-09-16T20:50:57Z

JorjMcKie
Sep 16, 2020
Maintainer

In the hope to convince you, that the problem is your image selection algorithm, here an overview of the various images overlapping one another ...

0 replies

JorjMcKie · 2020-10-27T12:50:26Z

JorjMcKie
Oct 27, 2020
Maintainer

To whom it may apply:
There is a GUI script (wxPython) which lets you insert, reposition or remove images on PDF pages under your visual control. This is the script and this is its documentation.

0 replies

vbgsmanzoor · 2021-09-22T16:08:39Z

vbgsmanzoor
Sep 22, 2021

I have a signature image on pdf and it is transparent from the sides but when I remove the image it removes PDF contents under that image

0 replies

JorjMcKie · 2021-09-22T16:39:43Z

JorjMcKie
Sep 22, 2021
Maintainer

"... but when I remove the image ..."

You mean: using redactions?
If yes: this is inevitable. It even belongs to the definition of redactions.

0 replies

JorjMcKie · 2022-04-07T13:46:57Z

JorjMcKie
Apr 7, 2022
Maintainer

There is a way now to remove images that can be identified by an xref - without impacting the rest of the page in any way. Have a look at this script.

0 replies

BriskyGates · 2023-04-17T06:35:05Z

BriskyGates
Apr 17, 2023

first of all, thank your for your jobs on pymupdf, it helps us a lot.
I aim to get rid of some specific watermarks
quesion1:
by using page.delete_image(xref_no), some errors occurs(# AttributeError: 'Document' object has no attribute 'is_image')
quesion2:
could you please give me some advice on improving the code execution speed(currently the time complexity is n^4)
question3:
the output of delete xref set is {35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 48, 53, 54, 55, 56, 57, 58, 59, 60, 62, 63, 64, 65, 70}, but I only need to delete the xref between 35 and 44, do we have some other features to user except the grey value
clue of question3:
most pictures of watermark only have two kinds of colors, grey and white, we can use this feature

many thanks!!

import fitz
import os
import io
from PIL import Image
fp ='pdf_font_garbled.pdf'
doc = fitz.open(fp)
delete_xref_set=set()
for page in doc:
    img_list = page.get_images(full=True)
    for img in img_list:
        xref = img[0]
        pix = fitz.Pixmap(doc, xref)
        img = Image.open(io.BytesIO(pix.tobytes()))
        width, height = img.size
        for i in range(width):
            for j in range(height):
                if sum(img.getpixel((i, j))) ==690: 
                    if xref in delete_xref_set:
                        continue
                    delete_xref_set.add(xref)
                    page.delete_image(xref)  # AttributeError: 'Document' object has no attribute 'is_image'
        pix = None
    else:
        print(f'delete xref set :{delete_xref_set}')
        # output: delete xref set :{35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 48, 53, 54, 55, 56, 57, 58, 59, 60, 62, 63, 64, 65, 70}

doc.save("output_rid_watermark.pdf")
doc.close()

relevant pdf file:
pdf_font_garbled.pdf

what images needed to get rid of:

3 replies

JorjMcKie Apr 17, 2023
Maintainer

The bug causing AttributeError: 'Document' object has no attribute 'is_image' has been fixed in v1.22.0.

JorjMcKie Apr 17, 2023
Maintainer

You can use a Pixmap method to check its color distribution, pix.color_count(). Try to play with this snippet - it already removes some of the stuff you seem to get rid of:

import fitz
doc=fitz.open("pdf_font_garbled.pdf")
for page in doc:
    # list of unique image xrefs on page
    xrefs = list(set([item[0] for item in page.get_images()]))
    for xref in xrefs:
        pix = fitz.Pixmap(doc, xref)
        colors = pix.color_count()  # number of unique colors in the image
        pix = None  # remove pixmap at earliest possible time
        if colors < 50:  # you may want to change this number: unique colors in image
            print("deleting image at xref", xref)
            page.delete_image(xref)

BriskyGates Apr 18, 2023

I really appreciate it! :>

AwesomeYuer · 2023-10-24T01:38:05Z

AwesomeYuer
Oct 24, 2023

@AwesomeYuer

0 replies

How to remove an image from PDF? Updated: look at the bottom. #1667

Uh oh!

Replies: 30 comments · 3 replies

Uh oh!

JorjMcKie Aug 2, 2019 Maintainer

Uh oh!

chest3x Aug 7, 2019 Author

Uh oh!

Uh oh!

JorjMcKie Aug 7, 2019 Maintainer

Uh oh!

chest3x Aug 9, 2019 Author

Uh oh!

JorjMcKie Aug 13, 2019 Maintainer

Uh oh!

JorjMcKie Aug 13, 2019 Maintainer

Uh oh!

JorjMcKie Aug 13, 2019 Maintainer

Uh oh!

Uh oh!

Uh oh!

JorjMcKie May 23, 2020 Maintainer

Uh oh!

Uh oh!

JorjMcKie May 23, 2020 Maintainer

Uh oh!

Uh oh!

Uh oh!

JorjMcKie Sep 15, 2020 Maintainer

Uh oh!

JorjMcKie Sep 15, 2020 Maintainer

Uh oh!

JorjMcKie Sep 15, 2020 Maintainer

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

JorjMcKie Sep 15, 2020 Maintainer

Uh oh!

JorjMcKie Sep 15, 2020 Maintainer

Uh oh!

JorjMcKie Sep 15, 2020 Maintainer

Uh oh!

Uh oh!

JorjMcKie Sep 16, 2020 Maintainer

Uh oh!

JorjMcKie Sep 16, 2020 Maintainer

Uh oh!

JorjMcKie Oct 27, 2020 Maintainer

Uh oh!

Uh oh!

JorjMcKie Sep 22, 2021 Maintainer

Uh oh!

JorjMcKie Apr 7, 2022 Maintainer

Uh oh!

Uh oh!

Uh oh!

Replies: 30 comments 3 replies

JorjMcKie
Aug 2, 2019
Maintainer

chest3x
Aug 7, 2019
Author

JorjMcKie
Aug 7, 2019
Maintainer

chest3x
Aug 9, 2019
Author

JorjMcKie
Aug 13, 2019
Maintainer

JorjMcKie
Aug 13, 2019
Maintainer

JorjMcKie
Aug 13, 2019
Maintainer

JorjMcKie
May 23, 2020
Maintainer

JorjMcKie
May 23, 2020
Maintainer

JorjMcKie
Sep 15, 2020
Maintainer

JorjMcKie
Sep 15, 2020
Maintainer

JorjMcKie
Sep 15, 2020
Maintainer

JorjMcKie
Sep 15, 2020
Maintainer

JorjMcKie
Sep 15, 2020
Maintainer

JorjMcKie
Sep 15, 2020
Maintainer

JorjMcKie
Sep 16, 2020
Maintainer

JorjMcKie
Sep 16, 2020
Maintainer

JorjMcKie
Oct 27, 2020
Maintainer

JorjMcKie
Sep 22, 2021
Maintainer

JorjMcKie
Apr 7, 2022
Maintainer