There is a common view that extracting text from a PDF document should not be too difficult. After all, the text is right there in front of our eyes and humans consume PDF content all the time with great success. Why would it be difficult to automatically extract the text data?
Turns out, much how working with human names is difficult due to numerous edge cases and incorrect assumptions, working with PDFs is difficult due to the extreme flexibility given by the PDF format.
The main problem is that PDF was never really designed as a data input format, but rather, it was designed as an output format giving fine grained control over the resulting document.
At its core, the PDF format consists of a stream of instructions describing how to draw on a page. In particular, text data isn’t stored as paragraphs - or even words - but as characters which are painted at certain locations on the page. As a result, most of the content semantics are lost when a text or word document is converted to PDF - all the implied text structure is converted into an almost amorphous soup of characters floating on pages.
As part of building FilingDB, we’ve extracted text data from tens of thousands of PDF documents. In the process, we have seen how every single assumption we had about how PDF files are structured was proven incorrect. Our mission was particularly difficult as we had to process PDF documents coming from a variety of sources, with wildly different styling, typesetting and presentation choices.
The list below documents some of the ways PDF files have made it difficult (or even impossible) to extract text contents.
PDF read protection
You may have come across PDF files which refuse to let you copy their text content. For example, here is what SumatraPDF shows when attempting to copy text from a copy-protected document.