page.extract_text() with visitor_text #1795

TunaFFish · 2023-04-16T12:35:08Z

TunaFFish
Apr 16, 2023

I want to extract highlight annotations.

Following example at:
https://pypdf.readthedocs.io/en/latest/user/reading-pdf-annotations.html#highlights
How do I use these coordinates to extract the text from in the last line:
x1, y1, x2, y2, x3, y3, x4, y4 = coords

I understand I should use the visitor_text
https://pypdf.readthedocs.io/en/latest/user/extract-text.html?highlight=extract_text#using-a-visitor
https://pypdf.readthedocs.io/en/latest/modules/PageObject.html?highlight=visitor_text#pypdf._page.PageObject.extract_text

But the use of this function is very confusing to me and I can't seem to wrap my head around the 2 examples provided (Ignore header and footer, Extract rectangles and texts into a SVG-file)

Anybody so kind to show me the link between following code examples:

from pypdf import PdfReader

reader = PdfReader("commented.pdf")
for page in reader.pages:
    if "/Annots" in page:
        for annot in page["/Annots"]:
            subtype = annot.get_object()["/Subtype"]
            if subtype == "/Highlight":
                coords = annot.get_object()["/QuadPoints"]
                x1, y1, x2, y2, x3, y3, x4, y4 = coords
            # page.extract_text(visitor_text=visitor_body) ???

parts = []
def visitor_body(text, cm, tm, font_dict, font_size):
    # ???
    parts.append(text)
text_body = "".join(parts)
print(text_body)

TunaFFish · 2023-04-18T19:32:21Z

TunaFFish
Apr 18, 2023
Author

Hello Martin Thoma, I see you edited my post almost immediately, can you please help me with an example on how to extract the text from the coordinates with the visitor?
I see a lot of your activity on this functionality also on Stackoverflow (you are the maintainer of both pypdf and pymupdf?) but you never seem to provide a working example...
I just wanted to know if it's possible with pypdf or should I look elsewhere.
Thanks.

3 replies

MartinThoma Apr 18, 2023
Maintainer

I see a lot of your activity on this functionality also on Stackoverflow (you are the maintainer of both pypdf and pymupdf?)

I'm the maintainer of pypdf and PyPDF2.

you never seem to provide a working example

https://pypdf.readthedocs.io/en/stable/user/extract-text.html#example-1-ignore-header-and-footer

lucasgadams Sep 28, 2024

The provided example does not work

stefan6419846 Sep 28, 2024
Maintainer

@lucasgadams Please avoid referring to the same topic at multiple places - this just introduces some overhead for us.

MartinThoma · 2023-04-18T19:51:34Z

MartinThoma
Apr 18, 2023
Maintainer

The first snippet is a part of the annotations documentation, the second one is about visitor functions. They are completely different. I don't know why you think there is a link between the two.

0 replies

TunaFFish · 2023-04-18T19:57:19Z

TunaFFish
Apr 18, 2023
Author

Can you please provide an example on how to EXTRACT the text from the annotations coordinates?
The example is not complete:
coords = annot.get_object()["/QuadPoints"] x1, y1, x2, y2, x3, y3, x4, y4 = coords

2 replies

MartinThoma Apr 18, 2023
Maintainer

That's simply not how it works:

You set the annotation by the coordinates (a few need the /QuadPoints, most dont)
You can get the text, given the annotation. You can also get the coordinates (/Rect) of the annotation.

pypdf does not have an interface for extracting annotations at a specific coordinate. Just iterate over all annotations and check the intersection with your desired points - most documents have so few that this is very fast.

MartinThoma Apr 18, 2023
Maintainer

If you want to extract non-annotation text within a specific rectangle, it's way more complicated. pdfminer.six is way better suited for that.

TunaFFish · 2023-04-20T04:31:45Z

TunaFFish
Apr 20, 2023
Author

OK, so the simple answer is: No, pypdf can NOT handle extracting text from highlight annotations.

This answers also: #701

Some libraries that CAN handle this:
Poppler-qt5 (if you make it to install)
PyMuPDF
PDFminer.six
PDFannots (based on pdfminer.six)

1 reply

MartinThoma Apr 20, 2023
Maintainer

OK, so the simple answer is: No, pypdf can NOT handle extracting text from highlight annotations.

That is wrong. You seem to miss that an annotation has a text, but the annotation can also be over a text. Correct would be: It's hard to get the text within a region with pypdf.

page.extract_text() with visitor_text #1795

Uh oh!

Uh oh!

TunaFFish Apr 16, 2023

Replies: 4 comments · 6 replies

Uh oh!

TunaFFish Apr 18, 2023 Author

Uh oh!

MartinThoma Apr 18, 2023 Maintainer

Uh oh!

lucasgadams Sep 28, 2024

Uh oh!

stefan6419846 Sep 28, 2024 Maintainer

Uh oh!

MartinThoma Apr 18, 2023 Maintainer

Uh oh!

Uh oh!

TunaFFish Apr 18, 2023 Author

Uh oh!

MartinThoma Apr 18, 2023 Maintainer

Uh oh!

MartinThoma Apr 18, 2023 Maintainer

Uh oh!

TunaFFish Apr 20, 2023 Author

Uh oh!

MartinThoma Apr 20, 2023 Maintainer

TunaFFish
Apr 16, 2023

Replies: 4 comments 6 replies

TunaFFish
Apr 18, 2023
Author

MartinThoma Apr 18, 2023
Maintainer

stefan6419846 Sep 28, 2024
Maintainer

MartinThoma
Apr 18, 2023
Maintainer

TunaFFish
Apr 18, 2023
Author

MartinThoma Apr 18, 2023
Maintainer

MartinThoma Apr 18, 2023
Maintainer

TunaFFish
Apr 20, 2023
Author

MartinThoma Apr 20, 2023
Maintainer