Jay Taylor's notes

back to listing index

What's so hard about PDF text extraction? | Hacker News

[web search]
Original source (news.ycombinator.com)
Tags: parsers pdf content-extraction
Clipped on: 2020-09-14

Image (Asset 1/2) alt=

Image (Asset 2/2) alt=
Here where I work we are parsing PDFs with https://github.com/flexpaper/pdf2json. It works very well, and returns an array of {x, y, font, text}.

If you are familiar with Docker, here is how you can add it to your Dockerfile.

ARG PDF2JSON_VERSION=0.71 RUN mkdir -p $HOME/pdf2json-$PDF2JSON_VERSION \ && cd $HOME/pdf2json-$PDF2JSON_VERSION \ && wget -q https://github.com/flexpaper/pdf2json/releases/download/$PDF... \ && tar xzf pdf2json-$PDF2JSON_VERSION.tar.gz \ && ./configure > /dev/null 2>&1 \ && make > /dev/null 2>&1 \ && make install > /dev/null \ && rm -Rf $HOME/pdf2json-$PDF2JSON_VERSION \ && cd

Your command is cut off

PDFs are the bane of my existence as someone who relies on machine translation everyday. The worst is that so many event flyers and things, even important local government information will just be dumped online as a pdf without there being any other effort to make the contents available. I don't know how blind people are supposed participate in civil life here..

The problem is accessibility features are totally invisible to normal users. Someone with good intentions creates a pdf, and it works for them. They don't use screen reader tools or know how they work so they don't even realize there is a problem.

And the problem with that is that the screen reader tools are $1200, because of the huge associated R&D costs and incredibly small target market.

The sad thing is that the very complexity required to implement a screen reader is soley because of the technical nightmare information accessibility currently is.

It's reasonable to think that "if only" everything could be made ubiquitous and everyone (= developers) could become collectively aware of these accessibility considerations, maybe there would be a shift towards greater semanticity.

Thing is, though, that NVDA is open source, and iOS has put a built-in free screen-reader in the hands of every iPhone user... and not much has changed.

So it's basically one of those technologies forever stuck in the "initial surge of moonshot R&D to break new ground and realign status quo" phase. :(

How much good would moonshot-level R&D even be capable of doing? Without realigning the world around a new portable format for printable documents, isn't this 99% Adobe's problem to solve? Or are the hooks there to make much more accessible PDFs, and the issue is that various popular generators of those documents (especially WYSIWYG ones like MS Word) either don't populate them, or perhaps don't even have the needed semantic data in the first place?

For my part, I would love to see PDFs which can seamlessly be viewed in continuous, unpaged mode (for example, for better consumption on a small-screen device like a phone or e-reader). Even just the minimal effort required to tag page-specific header/footer type data could make a big difference here, and I expect that type of semantic information would be useful for a screen reader also.

I thought flowed text in PDFs was possible, but rarely used because it removes half of the benefit of PDFs (that is, page-exact references and rendering).

There already is a solution, based on html and very easy to deal with: epub. Publishers are gradually shifting to it, even for things like science textbooks, as they realize that PDFs might look pretty but they're a usability nightmare. There are still some stupid things publishers do with epubs, but they'll get it right eventually; meanwhile the main text and font settings can be overridden by ereaders.

It's been a while since I've looked at the spec but I don't remember anything like that.

> Or are the hooks there to make much more accessible PDFs, and the issue is that various popular generators of those documents (especially WYSIWYG ones like MS Word) either don't populate them, or perhaps don't even have the needed semantic data in the first place?

Tagged PDF is a thing, and MS Word supports it. THe problem is the very long tail of programs that generate PDFs and don't support tagged PDF. Even some widely used PDF generators, like pdflatex, don't generate tagged PDF, at least not by default.

Could governments insist that PDF software they buy be screen-reader friendly? If this were rigorously done, you'd have all government documents be readable by default, and then anyone else who ran the same software commercially would, too.

You could also impose requirements on public companies to provide corporate documents in accessible formats- these sorts of documents are already regulated.

There's various levers that could be pulled, maybe those aren't the right ones. But you could do it.

To do that you would have to identify a "pdf - the accessible parts" of the spec, or perhaps "pdf - possible accessible layouts" and the implementer would have to stick to that. This might come into conflict with government regulations regarding how particular pdfs should be laid out - that is to say if there was a layout that would break natural screen reading flow it would be inaccessible and thus not allowed by Law 2, but be required by Law 1.

This becomes difficult when writing Law 2 because probably you don't know all variations that can be required by Law 1 (where Law 1 is actually a long list of laws and regulations)

Depending on the legal system of a particular country writing Law 2 might not be actually feasible unless you know what Law 1 entails.

Why not just mandate that the content must be accessible to the blind/deaf/etc and let the implementors figure out how best to make that true? For example some municipalities might just choose to provide alternate formats in addition to PDF and that might be fine for them.

as per my original post, you might not be able to mandate something like that given your legal system.

If I have a law saying that a layout needs to look like X layout and X layout is not achievable if it also needs to accessible then depending on the type of legal system you are in you can say Law Y supersedes all previous laws and requires you to make all PDFs accessible. If a legally mandated layout cannot be made accessible then the closest similar layout that still meets accessibility requirements should be used instead (long descriptions of how closest is determined in legal format would of course follow)

I think this will work in the common law system, but I don't think it would work in a Napoleonic system of law (could be wrong, just think you would need to specify exactly what laws it superseded)

As an example when I was part of the efaktura project for the Danish Government https://en.wikipedia.org/wiki/OIOXML when the law was published mandating that all invoices to the government be sent as oioxml efaktura we ran into the problem that there was a previous law requiring that all telecommunications invoices (or maybe utility invoices, I forget) be sent with information that there was no provision for in the UBL standard that efaktura was based on.

Luckily I had introduced an extension method that we could use to put the extra data in with, but otherwise we would have had two competing laws.

As far as mandating layouts in government PDFS, you normally see that kind of thing in military documents, but laws and bureaucracies are such that there might be any number of rules and regulations that cannot be overwritten in a system by drafting an overarching accessibility bill.

Theoretically this would be covered under federal law, but it needs a lawsuit.

The problem is the PDF spec itself is not screen reader friendly.

The spec isn't, but you can make accessible PDFs. https://www.adobe.com/accessibility/pdf/pdf-accessibility-ov...

Accessible PDFs are also more easily machine readable.

It most certainly is!

ISO 32000-1 (PDF 1.7) and ISO 32000-2 (PDF 2.0) are both compliant with PDF/UA (ISO 14289) - the standard for PDF accessibility

PDF is mostly just a wrapper around Postscript, isn't it?

You could just put the original text in comments or something, wrapped in more tags to say what it is.

PDF is a document format, Postscript is a Turing complete programming language (and rather a fun one IMHO).

No, the ideas are the same, but it follows another implementation concept.

I'm not sure there isn't a middle solution? What about a screenreader emulator? I go to a website and it offers me the ability to upload a pdf, link to a website and then shows it to me like a screenreader would.

There may already be good tools that do this, but until it's super easy to see and "everyone" knows about it, then people won't think, "better just give it a quick look in the screenreader". Obviously a next good step to more adoption, is clear feedback about how to easily fix whatever issues the person is seeing...

This is one of the biggest reasons why I think that HTML is still important for general application development.

HTML isn't perfect, it could be a lot better. But it goes a long way towards forcing developers to put their interface in pure text, and to do the visual layout afterwards.

I think separating content from styling has gone a long way towards improving accessibility, precisely because it makes the accessibility features less invisible. I suspect there are additional gains we could make there, and other ways that HTML could force visual interfaces to be built on top of accessible ones.

> The worst is that so many event flyers and things, even important local government information will just be dumped online as a pdf

Well, they could have uploaded it to some online service that shows the document in the browser as an undownloadable but browsable thingy (I don't even know what to call that, and I am definitely not going give any publicity to the online service by spelling out the name)

One certain online service (the one that lets you look at advertisements and fake new while you talk to your Aunt) is very popular for sharing events. On mobile they go out of their way to disable the ability to copy text. Even with the "copy" app set as my assist app, it seems to block or muck up the scrapping of text from the screen. I have to go to mbasic.thissite.com (which doesn't include all the same content) to get things in plain text. It's a real barrier to my participation in society where there are characters in the text I cant read, and I just want to copy them to a translate app..

It also made me late for a Teams based interview, as I foolishly tried to pass the join link between phones using this webpage, and couldn't copy it on the other end.

I pretty much only access said website in Firefox with the extension that puts it in a jail. The mobile app is useless for anything but posting photos of the kids for the grandparents and aunts to see.

Same app implements seriously dark patterns to get users to install their messenger app. Said chat works in the desktop web browser but not a mobile browser. I shall never install it on my phone. Mobile presents as if you have pending messages even when you don't (inbox empty on the web version).

You can use mbasic.thissite.com and it lets you use chat features from mobile browsers.

On android or at least pixels, go in to the app switcher mode and with the app still on screen, copy the text. This uses OCR to copy and you can even copy text from screenshots.

When I read the first half of your comment, I thought you were cleverly describing and would pivot at the end to suggest HTML and web servers.

There's probably a different format, facebook!

everything is on facebook instead of the real interent, personal pet peeve...

Is there a good library for decoding the object tree in a PDF document?

Depends on your programming language. xpdf does a pretty good job from what I’ve heard.

I hate it too. At work we have a solution that use PDF text extraction software and the result is sometimes not great. This in turn breaks the feature I’m owning, making it hard to work reliably. Of course users aren’t aware of that, so I’m the one taking the blame :/

Always use PDF/A-1a (Tagged PDF) which contains the text in accessible format. For many governments this a legal requirement.

With tagged PDF it's easy to get the text out.

LaTeX does not support PDF/A btw.

Not by default, but the pdfx package enables this, Peter Selinger has a nice guide: https://www.mathstat.dal.ca/~selinger/pdfa/

The linked instructions cover PDF/A-1b (have plain text version of contents), but Tagged PDF is more than that -- it's about encoding the structure of the document.

There is a POC package for producing Tagged PDF here https://github.com/AndyClifton/accessibility but it's not a complete solution yet.

Here is an excellent review article that talks about all the other options for producing Tagged PDFs from LaTeX (spoiler — there is no solution currently): https://umij.wordpress.com/2016/08/11/the-sad-state-of-pdf-a... via https://news.ycombinator.com/item?id=24444427

It's not automatic, but also not that difficult to support


> For many governments this a legal requirement.

As in Healthcare.

I'm confused as to what this comment means. Would you please clarify it?

Is he implying that providing healthcare is a legal requirement for most governments? If so, that seems off-topic.

They are saying that, like many governments, healthcare requires the use of the PDF/A-1a standard.

You mean, like many (non-US) governments, (the US) healthcare requires the use of the PDF/A-1a? It's just a somewhat bizarre contraposition: "Like many animals, Helianthus annuus needs water to survive". Wait, but it's a plant, right? "Yes, and it needs water, just like animals".

I just assumed that they meant that there are many healthcare orgs that require this standard.

As (it also is) in healthcare.

I'm surprised no ever uses XDP.

A flattened XFA/XDP PDF (no interactive elements) is more or less a PDF/A.

Do popular tools support creating XDP PDF?

A lot of people here seem to knock PDF, but I love it. Anyone who has tried to use OpenOffice full time probably does too. We have 'descriptive text' in MS various formats, or even html/css. The problem is every implementer does things in slightly different ways. So my beautiful OpenOffice resume renders with odd spacing and pages with 1 line of text in MS Office. With PDF, everyone sees the same thing.

> A lot of people here seem to knock PDF, but I love it.

People will disabilities that rely on screen-readers don't love it. There is no such a problem with HTML/CSS which should be the norm for internet documents.

> With PDF, everyone sees the same thing.

Yes, provided you can see at first place...

> There is no such a problem with HTML/CSS which should be the norm for internet documents.

It should be. But meanwhile everybody seems to think it is perfectly ok that there is a bunch of JavaScript that needs to run before the document will display any text at all and how that text makes it into the document is anybody's guess.

It was an amazing shift in priorities that I feel like I somehow missed the discussion for. We went from being worried about hiding content with CSS to sending nothing but script tags in the document body within 5 years or so. The only concern we had when making the change seemed to be "but can Google read it?". When the answer to that became "Uh maybe" we jumped the shark.

My bashful take is that nobody told the rest of the web development world that they aren't Facebook, and they don't need Facebook like technology. So everyone is serving React apps hosted on AWS microservices filled in by GraphQL requests in order to render you a blog article.

I am being hyperbolic of course, but I was taken completely off guard by how quickly we ditched years of best practices in favour of a few JS UI libraries.

This compain can be applied to paper too. PDF is not much more than precise document to be printed or exactly visually presented.

Is PDF not supposed to improve on paper...? I'm rather surprised at the revelation that PDF is not accessible.

PDF is tied to page layout. PDF is a way to digitally describe something that's intended to be printed to paper, on a sheet of a certain size.

And as a format, it's much more sane than, say, Word or Excel.

Even if the focus on "where do I put this glyph" means the original text isn't in there by default.

Yea, but that's a terrible explanation for lack of basic accessibility in 2020. Literally just laziness.

FWIW this is not a technical barrier; it would be absolutely trivial to associated blocks of non-flowed text with the layed out text.

My use is exclusively "this will end up on paper in a minute", so any improvement would be irrelevant.

Why would I want document type that can't even refloat on display size to represent any longer written text that is supposed to be consumed on digital device?

I would ask a blind person.

Adobe Reader has an option to Read Aloud PDF files. I don't know how well or poor that works, but I'm writing this comment just in case you were not aware of that function.

The problem is that you cant parse PDFs reliably. Half of the time they are just bitmap images from a scanner.

Some Word documents are like that too! Can't blame the PDF format if the source material is a bunch of scans.

html/css has a similar problem. A lot of my email is just a collection of images that contain text.

Non-conforming email should be filtered by the scam filter before it ever gets to your MUA.

What is that email not conforming to?

I absolutely hate those, but the HTML spec does require the alt attribute which is actually used in practice quite commonly.

I've never read the HTML spec directly until checking it now to verify your comment. I usually use MDN, which says that the alt attribute is not mandatory. [1]

[1] https://developer.mozilla.org/en-US/docs/Web/HTML/Element/Im...

PDF is digital paper.

You like it for that very quality: immutable, reproducible rendering.

Those who have to extract data from PDFs face nearly the same problem as those who have to deal with paper scans: no reliable structure in the data, the source of truth is the optical recognition, by human or by machine.

I agree, but my argument here is that it's up to the producer. They obviously wanted it to be a digital paper, for some reason, and not a data mining source. We should blame the producers, not the format. It's equivalent to saying it's hard to get the source code from the pesky exe files people distribute, so exe is a mess.

PDF is great for what it was originally designed for: a portable format for instructing printers on how to print a document. The problem is people using it in ways it wasn't designed for. Sharing a PDF of your document is about as useful as sharing an SVG export of your document (actually, an SVG probably has more semantic information). It is a vector image format, not a document format.

> They obviously wanted it to be a digital paper, for some reason, and not a data mining source.

"Because it's pretty" is it. 99% of people don't care about text being a data mining source.

The original goal of PDF was to create documents that could "view and print anywhere" (literally the original tagline of the Acrobat project), substantially the same as how the document creator intended them. What Adobe was trying to solve was the problem of sending someone a document that looked a particular way and when they rendered it on their printer or display, it looked different, e.g. having a different number of pages because subtle font differences caused word-wrapping to change the number of lines and thus the page flow. It wasn't about it being "pretty," it was about having functional differences due to local rendering and font availability. In this regard, the format is an emphatic success.

I do wish they had focused a bit more on non-visual aspects such as screen-reader data, but to say the whole point is "because it's pretty" is a bit uncharitable. The format doesn't solve the problem you wish it solved, but it does solve a problem other than making things "pretty."

Alternatively, "the journal only accepts LaTeX."

I quite like PDFs, but this thread has been an eye-opener.

> Anyone who has tried to use OpenOffice full time probably does too.

I agree, I do too (LibreOffice), but for the opposite reason. Even internally, the font rendering in LibreOffice with many fonts is often quite bad. This is especially noticeable for text inside graphs in Calc.

If I'm going to read something lengthy that's a LibreOffice document, I open it (in LibreOffice), and export it to a PDF. LibreOffice consistently exports beautiful PDFs (and SVG graphs), which tells me that it "knows" internally how to correctly render fonts, just that its actual renderer is quite bad.

Is the renderer dealing with the classic small text glyph hinting problem?

Could be. I'm not sure what the issue is. Firefox, Chrome, and basically every other thing I use works fine.

The audience here is developers and other geeks who get stuck dealing with PDFs. The issue when you read into it is usually about structured data delivered via PDF — which I would wholeheartedly agree is a monstrous and unnecessary misuse of the format.

The other thing that is unfair is assholes who deliver tabular data in PDF format usually don’t want you to have it. When your county clerk prints a report, photocopies it 30 times, crumples it and scans to PDF without OCR, that’s not a file format issue.

Yes, thank you! I have exactly the same feelings, because I like the write in old versions of iWork. With a PDF, I know that whatever I export will look the same for whoever I send it to.

I sometimes see people complain about how PDF sucks because it doesn't look quite the same everywhere (namely, non-Adobe readers), but if you're not doing anything fancy is pretty much does. It is, at minimum, more reliable than any other "open" format I'm aware of, save actual images.

The problem you have is that its likely, increasingly likely, that your CV is the exact document that will next be 'read' by computer rather than a human.

I know something about that area. Today, perhaps a 10th of CVs are sorted and prescreened by software. That fraction will only increase.

> With PDF, everyone sees the same thing.

SVG does that too, but it can also have aria tags to improve accessibility and have text that can be extracted much more easily.

Ugh, yeah I use LibreOffice for all my internal stuff but I have to keep MS Office installed for editing externally-visible documents, so I can be (somewhat) sure the formatting isn't going to get screwed up.

PDF is very good for what it was designed to do, which is to represent pages for printing. It is not so good for use cases where parsing the text is more important than preserving layout.

Indeed, I often say there's a special place in Hell where there are programmers trying to extract data from PDFs.

The souls who labour there, in life, posted PDFs to websites when HTML would have sufficed.

> everyone sees the same thing

Mostly. I've seen issues where PDF looked fine on a Mac but not on Windows.

Also, the fact that you see the same thing everywhere is good if you have one context of looking at things - e.g. if everyone uses big screen or if everyone prints the document, that's fine. But reading PDFs on e-book readers or smartphones can be a nightmare.

If the PDF uses a font that the creator neglected to embed, the reader’s system will have to supply the font, which could be a substitute. This is the only case I’ve seen where the PDF did not render exactly the same on all systems.

I like PDF a lot. It's got its drawbacks but the universal format is really helpful for layout-driven stuff. Shrug. It gets it done.

As someone who actually used to program in PostScript, I am happy as a clam with PDFs!

There are two issues with parsing them however.

  1) PDF is an output format and was never intended to have the display text be parseable.
  2) PDF is PostScript++, which means that is is a programming language.
     This means that a PDF is also an input description to the output that we
     are all familiar with seeing on a page.

PS I don't know if it is the case anymore, but Macs used to have a display server that handled all screen images in PDF format. That was an optimization from the NeXT display server, which displayed using Display PostScript.

>PDF is PostScript++, which means that is is a programming language.

The big change that came with PDF was removing the programming capabilities. A PDF file is like an unrolled version of the same PostScript file. There is still a residue of PostScript left but in no way can it be described as a programming language.

PDF is absolutely a programming language. It is not a general purpose programming language but a page description language. You are referring to looping constructs and procedures being removed, but a loop does not a language make. Similarly, LaTeX and sed are programming languages.

What features do you think make it a programming language? Because I have spent quite a bit of time working with it and all I can see is a file format.

> PS I don't know if it is the case anymore, but Macs used to have a display server that handled all screen images in PDF format. That was an optimization from the NeXT display server, which displayed using Display PostScript.

Quartz! https://en.wikipedia.org/wiki/Quartz_(graphics_layer)#Use_of...

AHA! Thank you.

And yet everyone is different, on different devices.

Why should they all see the same thing?

Using PDF here is self serving. It’s actively user-hostile.

Don't get me started on PDF's obsession with nice ligatures.. For the love god, when you have a special ligature like "ti", please convert the text in the copy buffer to a "ti" instead of an unpasteable nothing.

Nothing is more annoying than having to manual search a document that has been exported from PDF and have to make sure you catch all the now incorrect spellings when all the ligatures have just disappeared "action" -> "ac on", "finish" -> " nish".

I'm a bit of a typography buff, so just chiming in to say that there is nothing inherently wrong with ligatures in PDF!

As far as I understand, PDFs can be generated such that ligatures can be correctly cut'n'pasted from most PDF readers. I have seen PDFs where ligatures in links (ending in ”.fi”) cause problems, and I believe that's just an incorrectly generated PDF; ligatures done wrong.

Considering that PDF a programming language designed to draw stuff on paper, going backwards from the program back to clean data is not something that one should expect to always work.

In case this helps, here is a mapping from Unicode ligature-->ascii for all the ligatures I know of (the ones supported by LaTeX fonts): https://github.com/ivanistheone/arXivLDA/blob/master/preproc...

This is assuming you cleaning up the output of `pdftotext` which in my experience is the best command line tool for extracting plain text.

Seven, billions, ups.

I play (tabletop) RPGs online, we use a simple, free rule system (Tiny Six) and any time that I have to copypaste a specific paragraph of the rules in chat I always discover that there are missing characters (so I have to reread the block and fix it, ... in the end it would be faster to just read it aloud in voice).

Same might happen with scene or room descriptions taken from modules etc.

In my case, I think the correct expression would be "what's so hard about meaningful PDF ext extraction"

My company uses the services of, and has some sort of partnership with, a company that makes it's business out of parsing CVs.

Recently we've seen a surge in CVs that after parsing return no name and / or no email, or the wrong data is being fetched (usually, from referees).

So, out of curiosity I took one (So far) pdf and extracted the text with python.

Besides the usual stuff that is already known (As in, the text is extracted as found, e.g., if you have 2 columns the lines of text will appear after the line at the same level in the document that is in the other column) what I found - obviously take this with a grain of salt as this is all anecdotally so far - is that some parts of the document have spaces between the characters, e.g.:


P r e s i d e n t o f t h e U n i t e d S t a t e s o f A m e r i c a

These CVs have the characteristics to be highly graphical. Also anecdotally, the metadata in the CV I parsed stated it was from Canvas [1]


How meaningful the text is is going to depend on how the PDF was generated.

Consider that creating a PDF is generally just the layout software rendering into a PDF context — no different as far as it is concerned than rendering to the screen.

Space are not necessary for display (although they might help for text selection so often are present). It is not important that headers are drawn first, or footers last — so often these scraps of text will appear in unexpected places....

PDF has support for screen readers, but of course very few PDFs in the wild were created with this extra feature.

You're completely correct but unfortunately, this doesn't matter in practice. It's true that thanks to formats like PDF/UA, PDFs can have decent support for accessibility features. Problem is, no one uses them. Even the barest minimum for accessibility provided by older formats like PDF/A, PDF/A-1a are rarely used. Heck, just something basic like correct meta-data is already asking for too much.

This means getting text out of PDFs requires rather sophisticated computational geometry and machine learning algorithms and is an active area of research. And yet, even after all that, it will always be the case that a fair few words end up mangled because trying to infer words, word order, sentences and paragraphs from glyph locations and size is currently not feasible in general.

Even if better authoring tools were to be released, it would still take a long time for these tools to percolate and then for the bulk of encountered material to have good accessibility.

This recent hn post is relevant: https://umij.wordpress.com/2016/08/11/the-sad-state-of-pdf-a...

Yes, I know that. And I take it a large percentage of the people that uses HN does that. But the question is, does the people that actually uses the files, knows that? The example I'm giving, is of people using PDFs with a certain format because they think the graphic appeal would make them stand out from the crowd in a very important matter (They are trying to land a new job, after all) and of people trying to find the right fit for their empty position. Neither of them know about this; but it seriously affects the outcomes.

It wouldn't be the internet if donald trump wasn't dragged into it somehow

Sorry couldn't help myself. It was totally benign, thought. May be I should had used the boris instead as an example.

I'm parsing PDFs and extracting tabular data - I am using this library https://github.com/coolwanglu/pdf2htmlEX to convert the PDF into HTML and then parsing thereafter. It works reasonably well for my use-case but there are all kinds of hacks that I've had to put in place. The system is about 5-6 years old and has been running since.

The use-case is basically one where there is a tabular PDF uploaded every week and the script parses it to extract the data. Thereafter a human interacts with the data. In such a scenario, every ~100 parses it fails and I'll have to patch the parser.

Sometimes text gets split up, but as long as the parent DOM node is consistent I can pull the text of the entire node and it seems to work fine.

Did you publish your fixes so this could help others too? it seems like this repo is unmaintained atm.

They are very specific changes for my use-cases unfortunately

Read all the comments here about people struggling with PDF. All the energy and code wasted! I have watched this madness for my entire career.

PDF is being used wrongly. For information exchange, data should be AUTHORED in a parseable format with a schema.

Then PDF should be generated from this as a target format.

There is a solution: xml, xsl-fo.

   > There is a solution: xml, xsl-fo.
You're right about that, but sadly years of "xml-abuse" in the early naughts has given xml a bad reputation. So much so that other, inferior, markups were created like json and yaml. We ain't ever going back.

Meanwhile, pdf just worked-- until you the first time you crack it open and see what's inside the pdf file. I'll never forget the horror after I committed to a time-critical project where I claimed... "Oh, I'll just extract data from the PDF, how bad could it possibly be!"

> xml-abuse

Bad programmers, as usual, frustrated by their own badness.

Today's coders want us to use Jackson Pollock Object Notation everywhere for everything.

> We ain't ever going back.

Not so, friend. ODF and DOCX are XML. And these formats won't become JPON anytime soon.

It is not uncommon for some (or all) of the PDF content to actually be a scan. In these cases, there is no text data to extract directly, so we have to resort to OCR techniques.

I've also seen a similar situation, but in some ways quite the opposite --- where all the text was simply vector graphics. In the limited time I had, OCR worked quite well, but I wonder if it would've been faster to recognise the vector shapes directly rather than going through a rasterisation and then traditional bitmap OCR.

Here's another 'opposite' - I had to process PDFs to find images in them.. and the PDFs were alternating scans of text + actual images.

Using ML it is possible to parse PDF pages and interpret their layout and contents. Coupled with summarisation and text to speech it could make PDFs accessible to blind people. The parsing software would run on the user system and be able to cover all sorts of inputs, even images and web pages, as it is just OCR, CV and NLP.

The advantage is that PDFs become accessible immediately as opposed to the day all PDF creators would agree on a better presentation.

Example: A layout dataset - https://github.com/ibm-aur-nlp/PubLayNet

and: SoundGlance: Briefing the Glanceable Cues of Web Pages for Screen Reader Users http://library.usc.edu.ph/ACM/CHI2019/2exabs/LBW1821.pdf

A related problem is invoice/receipt/menu/form information extraction. Also, image based ad filtering.

It seems like it should be doable to train a two-tower model, or something similar, that simultaneously runs OCR on the image and tries to read through the raw PDF, that should be able to use the PDF to improve the results of the OCR.

Does anyone know of any attempt at this?

Blah blah blah transformer something BERT handwave handwave. I should ask the research folks. :-)

Yes. But. The main problem is that in many cases the visual structure is some highly custom form that is hard to present to the user in text.

And on top of this many times there are no text data in the PDF just a JBIG2 image per page.

I remember a prior boss of mine was once asked if our reporting product could use PDF as an input. He chuckled and said "No, there is no returning from chaos"

I've come up with an idea of PDDF (Portable Data Document Format). The PDF format allows embedding files into documents. Why not embed an SQLite database file right in PDF document with all the information nicely structured? The both formats are very well documented and there are lots of tools on any platform to deal with them. Humans see the visual part of PDF, while machine processing works with the SQLite part.

Imagine that instead of parsing a PDF invoice, you just extract the SQLite file embedded in it, and it has a nice schema with invoice headers, details, customer details, etc.

Anything else in PDF would work nicely either - vector graphics, long texts, forms - all can be embedded as an SQLite file in a PDF.

It seems the PDF standard is ahead of you: https://news.ycombinator.com/item?id=24467959 :-)

Haven't you heard Mozilla foundation? You cant just embed SQLite database!!!1 Its all about those developer aesthetics! https://hacks.mozilla.org/2010/06/beyond-html5-database-apis...

Libreoffice can export hybrid PDFs containing the whole libreoffice document inside the PDF.

How do you validate that the machine readable sqlite db has the same content as the human readable?

You can't presumably, you have to take it on trust. It's a neat idea. Getting structure from text dumps of PDFs is no fun.

I once took a job for $25 to replace some text on a set of PDFs programmatically assuming it would probably take an hour max, only to end up spending 8 hours before I found a decent enough solution. I have never touched PDF manipulation tasks after that.

If you are interested in to extract pdf Tables, I recommend you Tabula, here an example https://www.ikkaro.net/convert-pdf-to-excel-csv/

also tetpdf (paid) works really well (used it for extracting transactions from account statement pdfs) (demo works for 2-3 pages). I actually used a combination of tabula and tetpdf.

> Turns out, much how working with human names is difficult due to numerous edge cases and incorrect assumptions

I think a good interview question would be, given a list of full names, return each person’s first name. It’s a great little problem where you can show meaningful progress in less than a minute, but you could also work on it full time for decades.

If I got that question in an interview, I would write a program that asks the user what their first name is. That's the only correct solution to that problem.

The fact that it’s unsolvable is what makes it a good interview problem. Seeing someone solve a problem with a correct solution doesn’t really tell you anything about the person or their thought process.

This sounds like the analogue of "I would offer the barometer to the building manager if he tells me how tall the building is."

Okay, but - "programmatically determine the users name" is one of the classic foibles of inexperienced programmers. It's not that it's a hard problem, it's an impossible problem that people shouldn't be attempting, yet somehow still do, not unlike validating an email address with regex.

> it's an impossible problem that people shouldn't be attempting, yet somehow still do

You see exactly this kind of probabilistic algorithm being used every day in huge production apps. E.g. how do you think Gmail shows something like "me .. Vince, Tim" in the left column of your email inbox?

Correctly 100% of the time, as long as all of your contacts have names that match "Firstname Lastname". Google Contacts has separate fields, which enables "Vince" but at the expense of assuming that his full name is "Vince Sato" when he may write it "Sato Vince".

The problem with probabilistic algorithms is that you trade predictability for getting it right more frequently (but not 100% of the time). Eg. I could match common Japanese family names or look for a .jp TLD, but neither of these are guarantees that the family name comes first, and even less does their absence imply that they lead with the first name.

I imagine Google's algorithm is no more sophisticated than:

1. Are they in your contacts? Use their first name from your contacts. 2. Are they a Google user with a profile you can view? Use the first name they provided. 3. Use your locale to make a wild guess.

Right exactly, and that's basically the correct approach. But imho that's a super interesting problem.

I have a very similar project where I’ve extracted text and tables from over 1mm PDF filings in large bankruptcy cases - bankrupt11.com

I still haven’t found a good way of paragraph detection. Court filings are double spaced, and the white space between paragraphs is the same as the white space between lines of text. I also can’t use tab characters because of chapter headings and lists which don’t start with a <tab>. I was hoping to get some help from anyone who has done it before.

I imagine looking for lines that end prematurely would get you pretty far. Not all the way, since some last lines of paragraphs go all the way to the right margin, but combined with other heuristics it would probably work pretty well, especially if the page is justified.

Not a bad idea!

If you are want to extract tabular data from a text-based PDF. Check out the Tabula project: https://tabula.technology/

The core is tabula-java, and there are bindings for R in tabulizer, Node.js in tabula-js, and Python in tabula-py.

I love used tabula and recommend Camelot over it. The Camelot folks even put together a head to head comparison page (on their website) that shows their results consistently coming out ahead of Tabula.

My other complaint with Tabula is total lack of metadata. It’s impossible to know even what page of the PDF the tables are located on! You either have to extract one page at a time or you just get a data frame with no idea which table is located on which page.

The best I've used is PDFPlumber. Camelot lists it on its comparison page[1] but I've had better results.

Both are better than Tabula though.

[1] https://github.com/camelot-dev/camelot/wiki/Comparison-with-...

Thanks - I cannot get Camelot to run in parallel (I use celery workers to process PDFs), there is some bug in Ghostscript that SEGFAULTS. I’ll try using PDFPlumber instead! By the way Apache Tika has been the best for basic text extraction - even outputs to HTML which is neat.

So interesting to see this just 2 days after this on HN:

"The sad state of PDF-Accessibility of LaTex Documents"


PDF's have accessibility features to make semantic text extraction easy... but it depends on the PDF creator to make it happen. It's crazy how best-case PDF's have identifiable text sections like headings... but worst-case is gibberish remapped characters or just bitmap scans...

PDFs are terrible for screen readers for very much all the reasons listed here.

This article does make me wonder though if we'll ever get to a point where OCR tech is sufficiently accurate and efficient that screen readers will start incorporating OCR in some form.

It will have to out-pace the DRM tech that will stop you capturing pixels from pdfs deemed "protected"

Would that apply here? You can still just point a camera at the screen. If the OCR works well, there's little concern for loss of quality/accuracy from doing this.

Any lawyers here able to comment on whether that would be a DMCA violation?

Can you link to something that describes this tech?

There are way too many approaches to enumerate in a reply, but the StackOverflow answer covers just a few of the approaches I’ve seen used on Windows: https://stackoverflow.com/a/22218857/17027

The GPU approach is considerably harder to work around, fwiw. Still possible, of course.

You will always be able to decode it, today running VM or just Chrome with head on a box is simple way to bypass these techniques most of the time.

Even if DRM on video output becomes common(HDMI has them unlike VGA), Ultimately video protocols have to emit analog signal they will have to render to photons your eye can see, CAM rippers use this. It will be possible always to decode.[1] [2]

[1] Until neuralink kind of tech becomes mainstream and every content owner requires the interface to consume the content, and the interface can biologically authenticate the user .

[2] They can trace a CAM ripper, basis unique IDs embedded watermark etc, but they can never "stop" them from actually ripping, they can only block/penalize the legal source the rip originated from.

I am mostly upset by whatever change chrome made to extensions that stopped Copyfish from being able to capture clipped areas from webpages.

As someone who knows nothing about OCR, aren't we already there?

I recall being very impressed when I first saw the OCR in the Google Translate mobile app, and that was about 6 years ago.

Can screen readers correctly parse clean PDFs? As in, if I'm able to copy paste text from a PDF and get clean results, are the current screen readers able to read from those types of PDFs or is it a completely unimplemented format?

I'm asking because I don't think the problem is with PDF itself as much as it is with certain PDF output programs.

Related: Why GOV.UK content should be published in HTML and not PDF (https://gds.blog.gov.uk/2018/07/16/why-gov-uk-content-should...)

In Linux, pdftotext works pretty well for the basic extraction needs I've had over the years. I have many times used this extremely simple Ruby wrapper https://gist.github.com/rietta/90ae2187606953bee9735c00f3a6e....

Isn't the reason extraction is almost impossible that everything in a PDF is placed inside a box? So every line has it's own box and paragraphs don't exist. Paragraphs are just multiple boxes placed together. The context of sentences is completely gone.

Basically. The text placement syntax is essentially "put this glyph or string of glyphs at these coordinates". So it could even be as granular as a textbox for each individual glyph.

Just last week I was trying to extract text from a PDF and it had extra spaces between all the characters. Eventually in frustration I exported to PNG, made a new PDF, and OCRed it.

It felt dirty.

pdfsandwich helps make this a one command process


This looks quite useful but it appears it hasn’t been updated since 2010, and I’d imagine there’s been some advancements in this domain in the last decade. If anyone knows of similar tools for aligning and improving book scans, I’d love to hear of them. Thanks for this in any case!

For me, foxit reader solves this problem most of the time. Selecting in Sumatra I often get these annoying spaces, I guess Foxit filters them somehow...

i did this too. OCR is much better than text extraction from PDF

In some sense it's also the correct solution to the problem. PDFs are programs that generate screen/print output. Trying to find the text in the program will always fail sometimes. The only general solution is to process the output directly.

PDF are a pain, and Konfuzio is an AI software to make PDF content machine readable. On our journey to structure PDF content or even scans we have been supported by large enterprises in the banking, insurance and audit industry. We are in closed beta for data scientists, so feel free to request a demo and free user account.


Disclaimer: I am the co-founder of Konfuzio, a start-up founded in 2016 based in Germany.

The same interpretation hassles crop up when trying to extract text from PCL documents. This happened to me when implementing a SaaS feature allowing customers to use a printer driver to insert documents into a document repository.

The weirdest one? an application where the PCL code points used the EBCDIC character set. For once, being an old-timer helped.

Going the other way is terrible too. I had to work on making text in pdfs that our software generated selectable and searchable. PDF simply doesn't have any way to communicate the semantics to the viewer, so things like columns, sidebars, drop letters, etc. Would all confuse viewers. It didn't help that every pdf viewer uses its own heuristics to figure out how text is supposed to be ordered. Ironically, Acrobat was one of the worst at figuring out the intended layout.

I've worked at a html > pdf engine, I don't remember that much issues with text itself beyond making the pdf metrics the same as html so that linke breaks occurred on the same places.

did you write the column one after another (first all left, then all right) or going left right line of text then top down?

It was a few years ago, but I think it was all one column then another.

And it worked ok in some viewers, but not others.

Despite all recent OCR, AI and NLP technology, PDFs won't make "computational" sense unless you know in advance what kind of data you are looking for. The truth of the matter is that PDF was designed to convey pretty "printed" information in an era where "eyes only" was all that mattered. Today the PDF format just can't provide the throughput and reliability that interop between systems requires.

I really wish this company had some sort of product. However I spoke with them and they are essentially just a dev shop that specializes in PDF manipulation. Just to save other people some time if you think they simply have a product you can use.

While it still has issues, I have pretty good luck using pdftotext, then in the cases where the output isn't quite right, adding the --layout or --table options.

Yeah, I've had good luck with pdftotext -layout in the past. And in one case I've used pdftohtml -xml (plus some post-processing), which is useful if the layout is complex or you want to capture stuff like bold/italic.

Probably preaching to the choir, but for me step one is to use popper tools. pdftotext -layout

> Copying the text gives: “ch a i r m a n ' s s tat em en t” Reconstructing the original text is a difficult problem to solve generally.

Why not looking for stretches characters with spaces between them, then concatenate, check against a dictionary and if a match is found, remove the spaces.

> “On_April_7,_2013,_the_competent_authorities”

Same here.

It's doable, but it's only the easiest cases. If it's a one off script, or some automation you give as a base to a human to review afterwards it can be a good first step.

If it's supposed to be a somewhat final result, ran against a dataset you have little control over (people sending you PDFs they made, vs. PDFs coming from a know automated generator) you'll hit all the other not so edge cases very fast.

Like, otherwise invisible characters inserted in the middle of your text, layout that makes no logical sense and puts the text in a weird order but was fine when it was displayed on the page, characters missing because of weird typographic optimization (ligatures, characters only in specific embeded fonts etc.). Basically everything in the article is pretty easy to find in the wild.

But it might already be hard for a simple procedure to know if it should be chair man or chairman here.

What I hate the most are those job sites they extract the text from your PDFs. They are close to 99% failure rate, requiring me to manually tweak the content they extracted

* Please upload an up-to-date resume detailing your education and job history.

* Please fill out this series of forms on our website detailing your education and job history.

* Congrats, you have been selected for phone screening! Please be prepared to discuss your education and job history with the recruiter.


I built and successfully sold an entire start up around solving all of these problems and more. I'm interested to learn about projects where I can contribute my knowledge. Hit me up if you need some help with something significant. contact details in my about.

"What's so hard about PDF text extraction?" - if Weird Al ever wrote an Elvis Costello parody

I've written a little overview of the open source options for text extraction available in C# https://dev.to/eliotjones/reading-a-pdf-in-c-on-net-core-43e...

At some point I need to port PdfTextStripper from PDFBox, it seems to be among the most reliable libraries for extracting text in a generic way.

>PDF read protection

Click print, copy all you want in the print preview window. Brilliant protection scheme.

Since there are so many books available both as PDF and in other formats, where it's easier to extract the text, shouldn't it be possible to train a neural network on them?

I wonder if there is a flag or feature one can set in {major-vendor text editor} when exporting to PDF to make it as accessible and machine-readable as possible?

Isn't the solution to render a PDF one wants to extract text from, with a high dpi, and then OCR the result?

Addressed in TFA under "Why not OCR all the time?'.

There are drawbacks. Particularly at scale.

At our firm, evolution.ai, we prefer always to read directly from pixels. Gives you much more dependable results, for the reasons layed out here.

What software is the current leader in OCR solutions?

Look into various paid offerings on the cloud. Azure, GCP, AWS all have OCR-as-a-service they outperforms local software.

For anything long term and high volume it's far cheaper to use open source solutions and hire a manual review team in the developing world where English is a common second language. Unless you are comfortable with the high rate of errors there's still a need for review.

For small one-off tasks the cloud solutions do fine.

Be it handwritten or printed text, by far, Microsoft's Azure OCR is best in recognizing text.

For one-offs, you can use SensusAccess. I've found their quality to be pretty good.

Abbyy FineReader is generally better than any of the major cloud vendors in my testing.

See also this previous discussion: https://news.ycombinator.com/item?id=20470439

Also quite expensive since it is at the head of the pack IIRC. There is probably some value in making a competitor with new deep learning techniques provided you have a sufficiently diverse training set. It would take years to build tho.

I've done some work on deep learning approaches - specifically table extraction.

It's not at all obvious how to make this work - there is a lot of human judgement involved in judging what a header is vs what are values, especially with merged header column/row columns.

Yeah I remember Abbyy also has an interface to define layouts for this kind of problem. I.E., this thing is a table and here are the headers etc.

Sorry, I was not trying to say deep learning would be a substitute for all such issues, just that new approaches may help a smaller team build those tools more efficiently.

I don't know if Abbyy combines its layout tool with training a model for customers, but it seems like a reasonable thing to build and expose.

Thanks, I had looked that over but thought things may have changed in the last year.

I've (partially) done this (for ASX filings, not EU).

God what horrible mess it is. They are right about OCR being the best approach sometimes, but then there is tables. Tables in PDFs are.. well there is active academic research into the best way to extracting data.

PDF is so bad for representing text that when you tell people about it, they think you're trolling or wrong.


Representing text isn't PDF's job. PDF is meant to be "virtual paper".

Yup, the use case of PDFs is "export to PDF and check it immediately before sending to the printer".

All other use cases of PDF are better served by other formats.

don't forget to preflight your pdf, so you actually get out what you put in, versus common issues like not embedding fonts.

> All other use cases of PDF are better served by other formats.

If that were the case, we would all be using PDF/A, or better still a subset of that. The fact that PDF has much more stuff in the spec than that suggests that a large number of people find PDF to be the best format for what they're trying to do.

> a large number of people find PDF to be the best format for what they're trying to do.

Or they just don't know any better. I could certainly see this being the case for many office workers.

I mean, I'm certainly an advocate for using PDF/A. I agree with you about what PDF should be used for, but it's hard to imagine the spec got that inflated even though nobody thought it needed all those features.

As we all know, it is an Adobe technology first and foremost. That could explain why somebody thought it needed those features, despite the fact that it really shouldn't.

More clarification of what I mean, in the form of a dangerous conversation:

OfficeCorp Exec: Hey adobe, you know what would be great! Fillable form fields! And dynamic stuff!.

Adobe Sales: Great! Wonderful feedback, we can do it!

Edit: Fix tone. Or try to. Maybe I'm just overthinking what I say?

Adobe was driven by upgrade revenue from new features. They could never resist putting in new features hence the mess it is now.

Compare it to XPS to see what it should have been.

Please tell that to the entire world, where the only other “standard” document format is Office '97 or OOXML.

Given that PDFs support fillable forms, scripts, video, 3D objects; it is a massive error that at no point did anyone think to have some plain text equivalent field or something like that. It would be pretty darn simple for most generating applications to produce one of those.

Yes, I know it was largely meant as a preprint/graphics format, but clearly it hasn't been just that for a long time.

Being a high-wire trapeze artist isn't my job. That doesn't mean I wouldn't be terrible at it.

excellent text extraction using pdfbox.apache.org

Tldr, text can be encoded as plaintext, image or binary.

There’s a little more nuance than that. Even if text is drawn using plaintext data there’s no guarantee that the characters/words appear in the correct order or have the proper white space between them.

The best method is probably to render the PDF and use OCR.

Unfortunately that's obnoxiously inefficient if you're trying to run it through text-to-speech in real time.

Much more than that. Even as text the difference of characters in the file and different types of line breaks makes the whole thing very tedious and error pront

Not just like breaks. I’ve seen PDFs where each letter in a word is it’s own object and I’ve seen PDFs where the object includes a hundred letters but only one of them is visible or not otherwise occluded.

Personally I think PDF text should NOT be extractable. It is used to show printed document.

Applications are open for YC Winter 2021

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact