RuntimeError: xref 732 is not an annot of this page #2066

Abh4git · 2022-11-19T08:43:48Z

Abh4git
Nov 19, 2022

Please provide all mandatory information!

I am using a pdf file and trying extracting the highlighted text

Describe the bug (mandatory)

My code:

def main():
doc = fitz.open("ACMSurvey.pdf")
# Total page in the pdf
print(len(doc))
page = doc.load_page
# taking page for further processing
highlights = []
for page in doc:
for annot in page.annots():
highlight_text = page.get_textbox(annot.rect)
print(highlight_text)
highlights.append(highlight_text)
#print(highlights)
return

To Reproduce (mandatory)

I try running the above code. I am using a pdf file and trying extracting the highlighted text

Explain the steps to reproduce the behavior, For example, include a minimal code snippet, example files, etc.

File "\pythonextractHighLightFromPdf\main.py", line 11, in main
for annot in page.annots():
File "\pythonextractHighLightFromPdf\venv\lib\site-packages\fitz\fitz.py", line 6698, in annots
annot = self.load_annot(xref)
File "\pythonextractHighLightFromPdf\venv\lib\site-packages\fitz\fitz.py", line 6147, in load_annot
val = self._load_annot(name, xref)
File "\pythonextractHighLightFromPdf\venv\lib\site-packages\fitz\fitz.py", line 6048, in _load_annot
return _fitz.Page__load_annot(self, name, xref)
RuntimeError: xref 732 is not an annot of this page

For problems when building or installing PyMuPDF, give the full output of the build/install command so that, for example, all pip/compiler/linker errors/warnings can be seen.

Expected behavior (optional)

Describe what you expected to happen (if not obvious).

Screenshots (optional)

If applicable, add screenshots to help explain your problem.

Your configuration (mandatory)

Operating system, potentially version and bitness - Windows 10
Python version, bitness - Pythin 3.9
PyMuPDF version, installation method (wheel or generated from source).
PyMuPDF 1.21.0

Installed using pip

For example, the output of print(sys.version, "\n", sys.platform, "\n", fitz.__doc__) would be sufficient (for the first two bullets).

Additional context (optional)

Add any other context about the problem here.

Answered by JorjMcKie

Nov 20, 2022

Line page = doc.load_page makes no sense at all and can be deleted
If you iterate like this for annot in page.annots(): you will select every item in the page's annotations array - not only the highlight annotations which you actually want. Therefore, on page 3 (page.number = 2) you are running into that page page's array item 732 0 R which is no existing PDF object in this document. Clearly an invalid specification.

BTW not every error inside a PDF makes it totally unusable. This specific error for example, does not prevent most things, neither for PDF readers nor (Py-) MuPDF. But iterating over annotations occurs under the assumption, that we in fact deal with annotations - and not …

View full answer

JorjMcKie · 2022-11-19T09:01:46Z

JorjMcKie
Nov 19, 2022
Maintainer

This issue is not reproducible because there is no reproducing document.
Unfortunately the meaning of your code must be guessed because you left out all indentations, but I re-checked what I concluded from it - and it does work!
So your file is probably damaged.

0 replies

JorjMcKie · 2022-11-19T19:00:54Z

JorjMcKie
Nov 19, 2022
Maintainer

Please confirm that you determined your file is corrupt or let us have it to reproduce the problem.

0 replies

Abh4git · 2022-11-20T06:10:43Z

Abh4git
Nov 20, 2022
Author

ACMSurvey.pdf

0 replies

Abh4git · 2022-11-20T06:11:47Z

Abh4git
Nov 20, 2022
Author

Uploaded the file used as example. PDF is opening everywhere else as expected. I don't see any corruption aspect. Thank you for your time and looking into it.

0 replies

Abh4git · 2022-11-20T06:14:46Z

Abh4git
Nov 20, 2022
Author

main.txt
Code used added as .txt file

0 replies

JorjMcKie · 2022-11-20T06:55:30Z

JorjMcKie
Nov 20, 2022
Maintainer

This is no bug, therefore continuing research as "Discussions" item.

1 reply

Abh4git Nov 20, 2022
Author

ok. It was throwing me the error RuntimeError: xref 732 is not an annot of this page

JorjMcKie · 2022-11-20T07:14:51Z

JorjMcKie
Nov 20, 2022
Maintainer

Line page = doc.load_page makes no sense at all and can be deleted
If you iterate like this for annot in page.annots(): you will select every item in the page's annotations array - not only the highlight annotations which you actually want. Therefore, on page 3 (page.number = 2) you are running into that page page's array item 732 0 R which is no existing PDF object in this document. Clearly an invalid specification.

BTW not every error inside a PDF makes it totally unusable. This specific error for example, does not prevent most things, neither for PDF readers nor (Py-) MuPDF. But iterating over annotations occurs under the assumption, that we in fact deal with annotations - and not with all sorts of trash.

Modifying your iterator like this will make sure you are only dealing with annotations, and only dealing with annotation types you actually want.

for annot in page.annots(
            types=(fitz.PDF_ANNOT_HIGHLIGHT, fitz.PDF_ANNOT_UNDERLINE) # add more types as you like
        ):
...

5 replies

Abh4git Nov 20, 2022
Author

Thank you for the pointers and change suggestions. I did make suggested changes, but still come across same error. I am now curious why page.annots is listing some which is not in the same page. Am I missing something.

JorjMcKie Nov 20, 2022
Maintainer

Hm???!!!
The changed script runs flawlessly on my computer.

import fitz
def main():
    doc = fitz.open("ACMSurvey.pdf")
    # Total page in the pdf
    print(len(doc))
    highlights = []
    for page in doc:
        for annot in page.annots(
            types=(fitz.PDF_ANNOT_HIGHLIGHT, fitz.PDF_ANNOT_UNDERLINE)
        ):
            highlight_text = page.get_textbox(annot.rect)
            highlights.append(highlight_text)
    for line in highlights:
        print(line)
    return


if __name__ == "__main__":
    main()

JorjMcKie Nov 20, 2022
Maintainer

The PyMuPDF code walks through a page's /Annots array, which consists of xref numbers. This array looks like this on page 3:

/Annots [ 14 0 R 15 0 R 16 0 R 17 0 R 18 0 R 19 0 R 20 0 R 21 0 R
      22 0 R 23 0 R 24 0 R 25 0 R 732 0 R ]

xref 732 is an invalid object in the file, referenced nowhere else except in this array.

Abh4git Nov 20, 2022
Author

So looks like there is some corruption in the file. Thank you for your time and effort!

JorjMcKie Nov 20, 2022
Maintainer

your script works now?

JorjMcKie · 2022-11-20T07:20:51Z

JorjMcKie
Nov 20, 2022
Maintainer

Just as a hint: to include code in Github, wrap its lines with the two lines having three backticks ("`"). This will preserve leading spaces, etc.

1 reply

Abh4git Nov 20, 2022
Author

Thank you!

pachhaipurna · 2023-10-18T04:58:43Z

pachhaipurna
Oct 18, 2023

@JorjMcKie I received xref 25051 is not an annot of this page while deleting annot using delete_annot(). Could you please suggest me the solution?

8 replies

pachhaipurna Oct 19, 2023

@JorjMcKie This will delete all the annotation in the PDF, I am only trying to delete specific annotation based on the condition,
Below is my code, When I used your code, it is deleting all annotation,Could you please help me to resolve.

annot = page.first_annot
while annot:
annot = page.delete_annot(annot)

##Mine code
`for annot_attr in page_obj.annots(types=(None)):
annot_attr.info['subject']

if annot_attr.rect:
    colors_val = annot_attr.colors
    if "stroke" in colors_val:

        if colors_val['stroke'] is None:
           
            if annot_attr.type == "FreeText":
                
                if 'subject' in annot_attr.info:
                    subject_key = annot_attr.info['subject']
                    if subject_key.encode("utf-8") in japanese_unicode_lst:
                        pass
                    else:
                        if annot_attr.rect:
                            page_obj.delete_annot(annot_attr)
                            
                else:
                    if annot_attr.rect:
                        page_obj.delete_annot(annot_attr)
                        annot_attr.update()
            else:
                if 'subject' in annot_attr.info:
                    subject_key = annot_attr.info['subject']
                    if subject_key.encode("utf-8") in japanese_unicode_lst:
                        pass
                        
                    else:
                        if annot_attr.rect:
                            page_obj.delete_annot(annot_attr)
                else:
                    if annot_attr.rect:
                        page_obj.delete_annot(annot_attr)`

JorjMcKie Oct 19, 2023
Maintainer

If you want to reuse that while loop, you could modify it in the following way:

annot = page.first_annot
while annot:
    if <condition for deleting annot is true>:
        annot = page.delete(annot)
    else:  # do not touch this annot
        annot = annot.next

pachhaipurna Oct 19, 2023

@JorjMcKie will it handle the issue that I mentioned earlier ? If the annotation is not of this page .

JorjMcKie Oct 19, 2023
Maintainer

If the while loop worked before, it should still work now.

pachhaipurna Oct 19, 2023

@JorjMcKie Thank you, the code is working now,
Once I tested the output and all the logical conditions, I will message you If any issue.
Thank you for the help and support.

RuntimeError: xref 732 is not an annot of this page #2066

Uh oh!

Uh oh!

Abh4git Nov 19, 2022

Describe the bug (mandatory)

To Reproduce (mandatory)

Expected behavior (optional)

Screenshots (optional)

Your configuration (mandatory)

Additional context (optional)

Replies: 9 comments · 15 replies

Uh oh!

Uh oh!

JorjMcKie Nov 19, 2022 Maintainer

Uh oh!

JorjMcKie Nov 19, 2022 Maintainer

Uh oh!

Abh4git Nov 20, 2022 Author

Uh oh!

Uh oh!

Abh4git Nov 20, 2022 Author

Uh oh!

Abh4git Nov 20, 2022 Author

Uh oh!

JorjMcKie Nov 20, 2022 Maintainer

Uh oh!

Abh4git Nov 20, 2022 Author

Uh oh!

JorjMcKie Nov 20, 2022 Maintainer

Uh oh!

Abh4git Nov 20, 2022 Author

Uh oh!

JorjMcKie Nov 20, 2022 Maintainer

Uh oh!

JorjMcKie Nov 20, 2022 Maintainer

Uh oh!

Abh4git Nov 20, 2022 Author

Uh oh!

JorjMcKie Nov 20, 2022 Maintainer

Uh oh!

JorjMcKie Nov 20, 2022 Maintainer

Uh oh!

Abh4git Nov 20, 2022 Author

Uh oh!

pachhaipurna Oct 18, 2023

Uh oh!

pachhaipurna Oct 19, 2023

Uh oh!

JorjMcKie Oct 19, 2023 Maintainer

Uh oh!

pachhaipurna Oct 19, 2023

Uh oh!

JorjMcKie Oct 19, 2023 Maintainer

Uh oh!

pachhaipurna Oct 19, 2023

Abh4git
Nov 19, 2022

Replies: 9 comments 15 replies

JorjMcKie
Nov 19, 2022
Maintainer

JorjMcKie
Nov 19, 2022
Maintainer

Abh4git
Nov 20, 2022
Author

Abh4git
Nov 20, 2022
Author

Abh4git
Nov 20, 2022
Author

JorjMcKie
Nov 20, 2022
Maintainer

Abh4git Nov 20, 2022
Author

JorjMcKie
Nov 20, 2022
Maintainer

Abh4git Nov 20, 2022
Author

JorjMcKie Nov 20, 2022
Maintainer

JorjMcKie Nov 20, 2022
Maintainer

Abh4git Nov 20, 2022
Author

JorjMcKie Nov 20, 2022
Maintainer

JorjMcKie
Nov 20, 2022
Maintainer

Abh4git Nov 20, 2022
Author

pachhaipurna
Oct 18, 2023

JorjMcKie Oct 19, 2023
Maintainer

JorjMcKie Oct 19, 2023
Maintainer