Skip to content

Florents-Tselai/pgpdf

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

33 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

pgPDF: pdf type for Postgres

build GitHub Repo stars Docker Pulls

This extension for PostgreSQL provides a pdf data type and assorted functions.

You can create a pdf type, by casting either a text filepath or bytea column.

SELECT '/tmp/pgintro.pdf'::pdf;
                                       pdf                                        
----------------------------------------------------------------------------------
 PostgreSQL Introduction                                                         +
 Digoal.Zhou                                                                     +
 7/20/2011Catalog                                                                +
  PostgreSQL Origin 

If you don’t have the PDF file in your filesystem, but have already stored its content in a bytea column, you can just cast it to pdf.

SELECT pg_read_binary_file('/tmp/pgintro.pdf')::bytea::pdf;

Why?: This allows you to work with PDFs in an ACID-compliant way. The usual alternative relies on external scripts or services which can easily make your data ingestion pipeline brittle and leave your raw data out-of-sync.

The actual PDF parsing is done by poppler.

Also check blog:

Usage

Download some PDFs.

wget https://wiki.postgresql.org/images/e/ea/PostgreSQL_Introduction.pdf -O /tmp/pgintro.pdf
wget https://pdfobject.com/pdf/sample.pdf -O /tmp/sample.pdf

Create a table with a pdf column:

CREATE TABLE pdfs(name text primary key, doc pdf);

INSERT INTO pdfs VALUES ('pgintro', '/tmp/pgintro.pdf');
INSERT INTO pdfs VALUES ('pgintro', '/tmp/sample.pdf');

Parsing and validation should happen automatically. The files will be read from the disk only once!

Note

The filepath should be accessible by the postgres process / user! That's different than the user running psql. If you don't understand what this means, as your DBA!

Reading from URLs

You can combine pgpdf with pgsql-http to quickly grab remote PDFs into Postgres, by fetching the remote content as bytea and then treat it as a PDF.

CREATE EXTENSION pgpdf;
CREATE EXTENSION http;

SELECT pdf_read_bytes(text_to_bytea(content))
FROM http_get('https://wiki.postgresql.org/images/e/e3/Hooks_in_postgresql.pdf');

String Functions and Operators

Standard Postgres String Functions and Operators should work as usual:

SELECT 'Below is the PDF we received ' || '/tmp/pgintro.pdf'::pdf;
SELECT upper('/tmp/pgintro.pdf'::pdf::text);
SELECT name
FROM pdfs
WHERE doc::text LIKE '%Postgres%';

Full-Text Search (FTS)

You can also perform full-text search (FTS), since you can work on a pdf file like normal text.

SELECT '/tmp/pgintro.pdf'::pdf::text @@ to_tsquery('postgres');
 ?column? 
----------
 t
(1 row)
SELECT '/tmp/pgintro.pdf'::pdf::text @@ to_tsquery('oracle');
 ?column? 
----------
 f
(1 row)

Document similarity with pg_trgm

You can use pg_trgm to get the similarity between two documents:

CREATE EXTENSION pg_trgm;

SELECT similarity('/tmp/pgintro.pdf'::pdf::text, '/tmp/sample.pdf'::pdf::text);

Metadata

The following functions are available:

  • pdf_title(pdf) → text

  • pdf_author(pdf) → text

  • pdf_num_pages(pdf) → integer

    Total number of pages in the document

  • pdf_page(pdf, integer) → text

    Get the i-th page as text

  • pdf_creator(pdf) → text

  • pdf_keywords(pdf) → text

  • pdf_metadata(pdf) → text

  • pdf_version(pdf) → text

  • pdf_subject(pdf) → text

  • pdf_creation(pdf) → timestamp

  • pdf_modification(pdf) → timestamp

SELECT pdf_title('/tmp/pgintro.pdf');
        pdf_title        
-------------------------
 PostgreSQL Introduction
(1 row)
SELECT pdf_author('/tmp/pgintro.pdf');
 pdf_author 
------------
 周正中
(1 row)

Getting a subset of pages

SELECT pdf_num_pages('/tmp/pgintro.pdf');
 pdf_num_pages 
---------------
            24
(1 row)
SELECT pdf_page('/tmp/pgintro.pdf', 1);
           pdf_page           
------------------------------
 Catalog                     +
  PostgreSQL Origin         +
  Layout                    +
  Features                  +
  Enterprise Class Attribute+Case
(1 row)
SELECT pdf_subject('/tmp/pgintro.pdf');
 pdf_subject 
-------------
 
(1 row)
SELECT pdf_creation('/tmp/pgintro.pdf');
       pdf_creation       
--------------------------
 Wed Jul 20 11:13:37 2011
(1 row)
SELECT pdf_modification('/tmp/pgintro.pdf');
     pdf_modification     
--------------------------
 Wed Jul 20 11:13:37 2011
(1 row)
SELECT pdf_creator('/tmp/pgintro.pdf');
            pdf_creator             
------------------------------------
 Microsoft® Office PowerPoint® 2007
(1 row)
SELECT pdf_metadata('/tmp/pgintro.pdf');
 pdf_metadata 
--------------
 
(1 row)
SELECT pdf_version('/tmp/pgintro.pdf');
 pdf_version 
-------------
 PDF-1.5
(1 row)

Installation

Install poppler dependencies

Linux

sudo apt install -y libpoppler-glib-dev pkg-config

Homebrew/MacOS

brew install poppler pkgconf
cd /tmp
git clone https://github.com/Florents-Tselai/pgpdf.git
cd pgpdf
make
make install # may need sudo

After the installation, in a session:

CREATE EXTENSION pgpdf;

Docker

Get the Docker image with:

docker pull florents/pgpdf:pg17

This adds pgpdf to the Postgres image (replace 17 with your Postgres server version, and run it the same way).

Run the image in a container.

docker run --name pgpdf -p 5432:5432 -e POSTGRES_PASSWORD=pass florents/pgpdf:pg17

Through another terminal, connect to the running server (container).

PGPASSWORD=pass psql -h localhost -p 5432 -U postgres

Warning

Reading arbitrary binary data (PDF) into your database can pose security risks. Only use this for files you trust.

About

pdf type for Postgres

Topics

Resources

License

Stars

Watchers

Forks

Sponsor this project

 

Packages

No packages published