get table names #2735
-
How to extract the table names in addition to table contents? Currently only header, columns, data inside table can be extracted. |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments
-
I think you are referring to captions. Sorry, this is something highly dependent on the special making of each document. The caption may be missing altogether, or be above or below a table. But you are the one who knows the document(s) that you are processing. So you can add code that extracts text in the table's neighbourhood, analyze it (whether it qualifies as a caption). |
Beta Was this translation helpful? Give feedback.
-
thanks, any suggestion on how to identify the text in table neighbourhood(right before/after the table)? |
Beta Was this translation helpful? Give feedback.
-
Not really failsafe ones. Maybe you try rectangle extractions above and below the table bbox. |
Beta Was this translation helpful? Give feedback.
Not really failsafe ones. Maybe you try rectangle extractions above and below the table bbox.
Like if
tbbox
is the table bbox create a rectangle above for some heighth
like this:ubbox = fitz.Rect(tbbox.x0, tbbox.y0 - h, tbbox.x1, tbbox.y0)
. Then extract the text in there:caption = page.get_textbox(ubbox)
and see how far this takes you.