What's so hard about PDF text extraction? | Hacker News

Jay Taylor's notes

back to listing index

What's so hard about PDF text extraction? | Hacker News

[web search]

Original source (news.ycombinator.com)

Tags: parsers pdf content-extraction news.ycombinator.com

Clipped on: 2020-09-14

This is why iPhone didn't initially ship with double-tap to zoom for PDF paragraphs (like it had for blocks on web pages). I know because I was assigned the feature, and I went over to the PDF guy to ask how I would determine on an arbitrary PDF what was probably a "block" (paragraph), and I got a huge explanation on how hard it would be. I relayed this to my manager and the bug was punted.

Edit: To add a little more color, given that none of us was (or at least certainly I wasn't) an expert on the PDF format, we had so far treated the bug like a bug of probably at-most moderate complexity (just have to read up on PDF and figure out what the base unit is or whatever). After discovering what this article talks about, it became evident that any solution we cobbled together in the time we had left would really just be signing up for an endless stream of it-doesn't-work-quite-right bugs. So, a feature that would become a bug emitter. I remember in particular considering one of the main use cases: scientific articles that are usually in two columns, AND also used justified text. A lot of times the spaces between words could be as large as the spaces between columns, so the statistical "grouping" of characters to try to identify the "macro rectangle" shape could get tricky without severely special-casing for this. All this being said, as the story should make clear, I put about one day of thought into this before the decision was made to avoid it for 1.0, so far all I know there are actually really good solutions to this. Even writing this now I am starting to think of fun ways to deal with this, but at the time, it was one of a huge list of things that needed to get done and had been underestimated in complexity.

Alex3917 6 months ago [–]

> I know because I was assigned the feature, and I went over to the PDF guy to ask how I would determine on an arbitrary PDF what was probably a "block" (paragraph), and I got a huge explanation on how hard it would be.

The funny thing is that creating a universal algorithm to convert PDFs and/or HTML to plaintext is probably comparable in difficulty to building level 5 self-driving cars, and would accrue at least as much profit to any company that can solve it. But there are hundreds of billions of dollars going into self-driving cars, and like zero dollars going into this problem.

tolmasky 6 months ago [–]

What are the groups that would benefit most from the PDF-to-HTML conversion? Who are the customers that would drive this profit? I tried to make those sentences not sound contentious but unfortunately they do, but I am genuinely curious about this space and who is feeling the lack of this technology most.

greycol 6 months ago [–]

Almost any business that has physical suppliers or business customers.

PDF is de-facto standard for any invoicing, POs, quotes, etc.

If you solve the problem you can effectively programmatically deal with invoicing/payments/ large parts of ordering/dispensing. It's a no brainer to add it on to almost any financial/procurement software that deals with inter business stuff.

Any small-medium physical business can probably half their financial department if you can dependably solve this issue.

ofrzeta 6 months ago [–]

Starting November 2020 in the EU machine-readable invoices will be mandatory in the public sector (https://eur-lex.europa.eu/eli/dir/2014/55/oj).

As far as I understand there are at least two standards (I know of in Germany): XRechnung and ZUGFeRD/Factur-X (which is PDF A/3 with embedded XML).

http://fnfe-mpe.org/factur-x/factur-x_en/

sabas_ge 6 months ago [–]

Electronic invoicing in Italy is a thing since a couple of years, mandatory since Jan 1st 2019...

f3r3nc 6 months ago [–]

seems to be only for the electronic invoicing in public procurement

Fire-Dragon-DoL 6 months ago [–]

No, it's now for everyone. It was like that for a while

innomatics 6 months ago [–]

A business that invests in building a machine that reads data, produced by a 3rd party machine, using format intended for lay humans to read, is not investing in the right tech IMO.

Small-mediums should be looking to consolidate buying through a few good suppliers and working with them directly to automate process, or adopting interchange formats.

Problem for some small-business is the cost (process changes, licencing etc) of adopting interchange formats and working with large vendors is prohibitive at their scale e.g. the airline BSP system.

I agree that solving the problem generally i.e. replacing an accounts payable staff person capable of processing arbitrary invoice documents will be comparable to self-driving in difficulty.

If a company deals with a lot of a single type of PDF, then the approach could be economical. I am actually involved in a project looking at doing this with AWS Textract.

speedplane 6 months ago [–]

> A business that invests in building a machine that reads data, produced by a 3rd party machine, using format intended for lay humans to read, is not investing in the right tech IMO.

Building machines that understand formats that are understood by humans is exactly what we should be doing. People should read, write, and process information in a format that is comfortable and optimized to them. Machines should bend to us, we should not bend to them.

If businesses only dealt with machine readable formats, everyone's computer would still be using the command line.

And there's real condescension in your post:

> Small-mediums should be looking to consolidate buying through a few good suppliers and working with them directly to automate process

You're saying that businesses need to change their business to accommodate data formats, but it should be the other way around.

kbutler 6 months ago [–]

There has to be compromise.

The proliferation of computers in business over the last 50 years is precisely because businesses can save money/expand capacity by adapting the business processes to the capabilities of the computers.

Over that time, computers have become more friendly to humans, but businesses have adapted and humans been trained to use what computers can do.

tastyminerals 6 months ago [–]

Yes, most invoices are in PDF but only about 40% of them are native PDF meaning they are actual documents not scanned images converted to PDFs. There are are also compound PDF invoices which contain images. So, in order to extract data from them, one needs not only good PDF parser but an OCR engine too.

dreamcompiler 6 months ago [–]

This is a huge pet peeve of mine. Most invoices are generated on a computer (often in Word) but a huge fraction of the people who generate them don't know how to export to a PDF. So they print the invoice on paper, scan it back in to a PDF, and email that to you. Thus the proliferation of bitmap PDFs.

speedplane 6 months ago [–]

> So, in order to extract data from them, one needs not only good PDF parser but an OCR engine too.

You can go further. Invoices often contain block sections of text with important terms of the invoice, such as shipping time information, insurance, warranties, etc. To build something that works universally, you also need very good natural language processing.

thaumasiotes 6 months ago [–]

If you're using an OCR engine to understand PDFs that are nothing but a scanned image embedded in a PDF... what do you need a PDF parser for? You can always just render an image of a document and then use that.

speedplane 6 months ago [–]

> If you're using an OCR engine to understand PDFs that are nothing but a scanned image embedded in a PDF... what do you need a PDF parser for?

This should be obvious, but the answer is because OCR engines are not terribly accurate. If you have a native PDF, you're far better off parsing the PDF then converting to an image and OCRing. But if OCR ever becomes perfect, then sure.

tastyminerals 6 months ago [–]

For accuracy and speed. The market SOTA Abbyy is far from being accurate.

speedplane 6 months ago [–]

> The market SOTA Abbyy is far from being accurate.

While Abbyy is likely the best, it's also incredibly expensive. Roughly on the order of $0.01/page or maybe at best a tenth of that in high volume.

For comparison, I run a bunch of OCR servers using the open source tesseract library. The machine-time on one of the major cloud providers works out to roughly $0.01 for 100-1000 pages.

bhanhfo 6 months ago [–]

OCR.space charges only $10 for 100,000 conversions. The quality is good, but not as good as Abbyy.

minerals29 6 months ago [–]

It is the best and this is one of the reasons why PDF extraction is hard :)

Alex3917 6 months ago [–]

So I have a lot of experience with basically the same problem just from working on this: https://www.prettyfwd.com. As an example of the opportunity size just in the email domain, the amount of personal non-spam email sent every day is like 100x the total size of Wikipedia, but nothing is really done with any of this information because of this challenge. Basically applications are things like:

- Better search engine results

- Identifying experts within a company

- Better machine translation

- Finding accounting fraud

- Automating legal processes

For context, the reason why Facebook is the most successful social network is that they're able to turn behavioral residue into content. If you can get better at taking garbage data and repackaging it into something useful, it stands to reason that there are lots of other companies the size of Facebook that can be created.

K0SM0S 6 months ago [–]

I often ponder how much of the "old world" will get "digitalized" — translated in numeric form, bits. And how much will just disappear. The question might seem trivial if you think of books, but now think of architecture, language itself (as it evolves), etc.

There's almost no question in my mind that most new data will endure in some form, by virtue of being digital from day 1.

The endgame for such a company, imho, is to become the "source entity" of information management (in abstracted form), whose two major products are one to express this information in the digital space, and the other in the analog/physical space. You may imagine variations of both (e.g. AR/VR for the former).

Kinda like language in the brain is "abstract" (A) (concept = pattern of neurons firing) and then speech "translates" into a given language, like English (B) or French (C) (different sets of neurons). So from A you easily go to either B or C or D... We've observed that Deep Learning actually does that for translation (there's a "new" "hidden" language in the neural net that expresses all human languages in a generic form of sorts, i.e. "A" in the above example).

The similarities of the ontology of language, and the ontology of information in a system (e.g. business) are remarkable — and what you want is really this fundamental object A, this abstract form which then generates all possible expressions of it (among which a little subset of ~1,000 gives you human languages, a mere 300 active iirc; and you might extend that into any formal language fitting the domain, like engineering math/physics, programming code, measurements/KPI, etc.

It's a daunting task for sure but doable because the space is highly finite (nothing like behavior for instance; and you make it finite through formalization, provided your first goal is to translate e.g. business knowledge, not Shakespeare). It's also a one-off thing because then you may just iterate (refine) or fork, if the basis is sound enough.

I know it all sounds sci-fi but having looked at the problem from many angles, I've seen the PoC for every step (notably linguistics software before neural nets was really interesting, producing topological graphs in n dimensions of concepts e.g. by association). I'm pretty sure that's the future paradigm of "information encoding" and subsequent decoding, expression.

It's just really big, like telling people in the 1950's that because of this IBM thing, eventually everybody will have to get up to speed like it's 1990 already. But some people "knew", as in seeing the "possible" and even "likely". These were the ones who went on to make those techs and products.

IceKarma 6 months ago [–]

Digital data is arguably more fragile than analogue, offline, paper (or papyrus, or clay tablet) media. We have documents over 3000 years old that can still be read. Meanwhile, the proprietary software necessary to access many existing digital data formats is tied to obsolete hardware, working examples of which may no longer exist, emulators for which may not exist, and insufficient documentation may exist to even enable their creation. Just as one example, see the difficulty in enabling modern access to the BBC's 1986 Domesday Project.

bjonnh 6 months ago [–]

Academics and other people that rely on scientific publications. Most of the world's knowledge in science is locked into PDFs and screenshots (or even pictures) of manufacturer's (often proprietary) software... So extracting it in a more structured way would be a win (so HTML may not be best). On a related note, I've seen people using Okular to convert PDF tables to a usable form (to be honest its table extraction tool is one of the best i've seen despite being pretty manual).

speedplane 6 months ago [–]

> What are the groups that would benefit most from the PDF-to-HTML conversion? Who are the customers that would drive this profit? I tried to make those sentences not sound contentious but unfortunately they do, but I am genuinely curious about this space and who is feeling the lack of this technology most.

Legal technology. Pretty much everything a lawyer submits to a court is in PDF, or is physically mailed and then scanned in as PDF. If you want to build any technology that understands the law, you have to understand PDFs.

dscpls 6 months ago [–]

Organisations that have existing business processes to publish to print and pdf but now want to publish in responsive formata for mobile or even desktop web.

Changing their process might be more expensive than paying a lot of money for them to carry on as is for a few more years while getting the benefit of modern eyes on their content.

Edit: concrete example would be government publications like budget narrative documents.

tony0x02 6 months ago [–]

The patent office?

aidos 6 months ago [–]

I’ve done a bunch of this work myself and while it’s a bit of a pain to do in general, you can make some reasonable attempts at getting something workable for your use cases.

PDFs are incredibly flexible. Text can be specified in a bunch of ways. Glyphs can be defined to the nth degree. Text sometimes isn’t text at all. There’s no layout engine and everything is absolutely positioned. Fonts in PDF’s are insane because they’re often subset so they only include the required glyphs and the characters are remapped back to 1, 2, 3 etc instead of usual ascii codes.

kbenson 6 months ago [–]

> Fonts in PDF’s are insane because they’re often subset so they only include the required glyphs and the characters are remapped back to 1, 2, 3 etc instead of usual ascii codes.

I've actually seen obfuscation used in a PDF where they load in a custom font that changes the character mapping, so the text you get out of the PDF is gibberish, but the fonts displayed on rendering are correct (a simple character substitution cipher).

The important thing to remember whenever you think something should be simple, is that someone somewhere has a business need for it to be more complicated, so you'll likely have to deal with that complication at some point.

ccozan 6 months ago [–]

That's why the only "reliable" way to extract text is to perform an OCR of the pdf rendering which is exactly what ABBYY is doing.

siftrics 6 months ago [–]

I’m the founder of a startup that is doing this, as well. We strive to be as simple and easy as possible to use.

If you care to check us out: https://siftrics.com/

Defenestresque 6 months ago [–]

Your website demo video is impressive and I can imagine there are many businesses that would save a lot of time and man-hours by incorporating a solution like this.

I've often thought about creating products like these but as a one-man operation I am daunted by the "getting customers" part of the endeavour. How do you get a product like this into the hands of people who make the decisions in a business? (For anyone, not just OP). PPC AdWords campaigns? Cold-calling? Networking your ass off? Pay someone? Basically, how does one solve the "discoverability problem"?

siftrics 6 months ago [–]

Surprisingly, Hacker News has been our number one source of leads. We tried Google Ads and Reddit Ads, but the signup rate was literally three orders of magnitude lower than organic traffic from Hacker News and Reddit.

appleiigs 6 months ago [–]

Is your product only on the cloud? My privacy/internet security team won't let me use products that save customer or vendor data on the cloud because you might get hacked. Only giants, like Microsoft, have been approved after an evaluation.

siftrics 6 months ago [–]

More than half of our customers have asked to be able to skip our cloud and go directly to their database. We’re working on this right now. It’s scheduled to be released this week, so keep an eye open.

In the meantime, if you have any questions, feel free to send me an email at siftrics@siftrics.com. I’d love to hop on the phone or do a Zoom meeting or a Google Hangouts.

innomatics 6 months ago [–]

Will check it out thanks, we had evaluated ABBYY but it didn't suit. Does your product do something like key value or table detection?

siftrics 6 months ago [–]

We do table recognition and pride ourselves on being better at it than ABBYY. We can handle variable number of rows in a table and we take that into account when determining the position of other text on the page.

Feel free to email me at siftrics@siftrics.com with any questions. We can setup a phone call, zoom meeting, or google hangouts too, if you’d like.

72deluxe 6 months ago [–]

I like how your demo is clear and also in Linux! Surprising.

The page is clear and easy to understand, looks good. Well done.

siftrics 6 months ago [–]

Thank you for the kind words!

I like to say that anyone with a good old ThinkPad and an internet connection can mint fortunes and build empires :-)

malikolivier 6 months ago [–]

How is support for languages other than English?

I am especially thinking about Japanese. Our company could probably find good uses of such service if it had Japanese support.

siftrics 6 months ago [–]

Yes, Japanese is supported! Almost all languages are supported.

If you have any questions or need help trying it out, please email me at siftrics@siftrics.com. We can hop on the phone, too, if you'd like.

tastyminerals 6 months ago [–]

Gini GmbH performs document processing for almost all German banks and for many accounting companies. For banks it does realtime invoice photo processing -- OCR and extraction of amount, bank information, receiver etc. For accounting it extracts all kind of data from a PDF. Unfortunately, only for German language market. But here you go, ABBYY by far is not the only one. In fact ABBYY does only OCR and has some mediocre table detection. That's it.

zmix 6 months ago [–]

I do not remember which of the two it was, but 'poppler' or 'pdfbox' (they may use the same backend) created great HTML output, with absolute positions. They also have an XML mode, which is easily transformed.

Of course, there is absolutely no semantics, just display.

aidos 6 months ago [–]

That’s actually often just a consequence of the subsetting (I think). Believe it or not, you can often rebuild the cmaps using information in the pdf to fix the mapping and make the extraction work again.

kbenson 6 months ago [–]

> That’s actually often just a consequence of the subsetting (I think).

I would believe that. It was a pretty poor obfuscation method as they go, if it was intended for that.

> Believe it or not, you can often rebuild the cmaps using information in the pdf to fix the mapping and make the extraction work again.

Oh, I did. That's the flip side of my second paragraph above. When there's a business need to work around complications or obfuscations, that will also happen. :)

speedplane 6 months ago [–]

> PDFs are incredibly flexible. Text can be specified in a bunch of ways. Glyphs can be defined to the nth degree. Text sometimes isn’t text at all. There’s no layout engine and everything is absolutely positioned.

Can't stress this enough. The next time you open a multi-column PDF in adobe reader and it selects a set of lines or a paragraph in the way you would expect, know that there is a huge amount of technology going on behind the scenes trying to figure out the start and end of each line and paragraph.

speedplane 6 months ago [–]

> The funny thing is that creating a universal algorithm to convert PDFs and/or HTML to plaintext is probably comparable in difficulty to building level 5 self-driving cars, and would accrue at least as much profit to any company that can solve it. But there are ... like zero dollars going into this problem.

Converting PDFs to HTML well is a very hard problem, but hard by itself to create a very big company. When processing PDFs or documents generally, the value is not in the format, it's in the substantive content.

The real money is not going from PDF to HTML, but from HTML (or any doc format) into structured knowledge. There are plenty of companies trying to do this (including mine! www.docketalarm.com), and I agree it has the potential to be as big as self-driving cars. However, technology to understand human language and ideas is not nearly as well developed as technology to understand images, video, and radar (what self-driving care rely on).

The problem is much more difficult to solve than building safer-than-human self-driving cars. If you can build a machine that truly understands text, you have built a general AI.

samplatt 6 months ago [–]

>like zero dollars going into this problem

There's a lot more than zero dollars going into this... it's just that the end result is universally something that's "good enough for this one use-case for this one company" and that's as far as it gets.

jimmaswell 6 months ago [–]

Rendering the document to an image and using OCR on it would bypass a lot of trouble trying to make sense of the source, wouldn't it?

Alex3917 6 months ago [–]

Not really, it's just a different set of challenges. The original article sums it up well, in terms of a lack of text-order hints. I haven't really tried incorporating OCR approaches at all, but I suspect they could probably be used to detect hidden text.

The basic issue imho is that NLP algorithms are very inaccurate even with perfect input. E.g. even with perfect input, they're maybe only 75% accurate. And even an a text-processing algorithm that's like 99.9% accurate will yield input to your NLP algorithms that's like 50% accurate, so any results will be mostly unusable.

tastyminerals 6 months ago [–]

NLP algorithms are just fine. It is the combination of regexes, NLP and deep learning that allows you to achieve good extraction results. So, basically OCR / pdf parser -> jpeg/xml/json -> regexes + NLP / DL extractor.

bitL 6 months ago [–]

Semantic segmentation to identify blocks and OCR to convert to text - I think OneNote is already doing that. PDF is a horrible format for representing text, though PostScript is even worse.

Ididntdothis 6 months ago [–]

“The funny thing is that creating a universal algorithm to convert PDFs and/or HTML to plaintext is probably comparable in difficulty to building level 5 self-driving cars, ”

At least :)

perl4ever 6 months ago [–]

Since you can always print a PDF to a bitmap and use OCR, I assume you're implicitly asking for something that does substantially better. How much better, and why?

BigBubbleButt 6 months ago [–]

> The funny thing is that creating a universal algorithm to convert PDFs and/or HTML to plaintext...would accrue at least as much profit [as self-driving cars] to any company that can solve it.

Can you explain a bit more about why this is so valuable? I don't know anything about this industry.

dandelo1953 6 months ago [–]

Does this mean some abstraction is lost between the creation phase and final "save to pdf" phase? It'd seem ridiculous to not easily be able to track blocks while it's a WIP.....

dreamcompiler 6 months ago [–]

Except the stakes are lower. Nobody dies if a PDF extraction isn't perfect.

calf 6 months ago [–]

I don't know if PDF has truly evolved from its desktop publishing origins, but it is a terrible format because it no longer contains the higher level source information that you would have in an InDesign or a LaTeX file. PDF/Postscript were meant to represent optical fidelity and thus are too low-level abstractions for a lot of end-user, word processing tasks (such as detecting layout features), and thus trying to reverse engineer the "design intent" from them feels like doing work that is unecessarily tedious. But that's the way it seems to be given the popularity of the format.

rbobby 6 months ago [–]

> double-tap to zoom

Why wouldn't you just zoom with the center point being where the tap occurred?

abiogenesis 6 months ago [–]

That's probably what they did for v1.0 after they saw it is not that easy to zoom such that the whole paragraph always fits into the visible area.

tolmasky 6 months ago [–]

Where to center to is only one vector, the other is how much to zoom: ideally it’s such that the text block fits on the screen. But again, that requires knowing the bounds of the text block. Zooming by a constant wherever you tap is a much less useful feature for text (vs. a map for instance), but I think it’s what we defaulted to (can’t remember if it was that of just nothing).

giovannibonetti 6 months ago [–]

One of the main features of the product I work on is data extraction from a specific type of PDF. If you want to build something similar these are my recommendations for you:

- Use https://github.com/flexpaper/pdf2json to convert the PDF in an array of (x, y, text) tuples

- Use a good text parsing library. Regexes are probably not enough for your use case. In case you are not aware of the limitations of regexes you may want to learn about Chomsky hierarchy of formal languages.

Here is the section of our Dockerfile that builds pdf2json for those of you that might need it:

# Download and install pdf2json ARG PDF2JSON_VERSION=0.70 RUN mkdir -p $HOME/pdf2json-$PDF2JSON_VERSION \ && cd $HOME/pdf2json-$PDF2JSON_VERSION \ && wget -q https://github.com/flexpaper/pdf2json/releases/download/$PDF... \ && tar xzf pdf2json-$PDF2JSON_VERSION.tar.gz \ && ./configure > /dev/null 2>&1 \ && make > /dev/null 2>&1 \ && make install > /dev/null \ && rm -Rf $HOME/pdf2json-$PDF2JSON_VERSION \ && cd

robinhowlett 6 months ago [–]

Thanks for the links - agree about the (x,y,text) callout but other metadata like font size can be useful too.

Regexes have limitations but I was able them to leverage them sufficiently for PDFs from a single source.

I parsed over 1 million PDFs that had a fairly complex layout using Apache PDFBox and wrote about it here: https://www.robinhowlett.com/blog/2019/11/29/parsing-structu...

Defenestresque 6 months ago [–]

I thoroughly enjoyed both the blog post (as an accessible but thorough explanation of your experience with PDF data extraction) and the linked news article [0] as an all-too-familiar story of a company realizing that a creative person is using their freely-available data in novel and exciting ways and immediately requesting that they shut it down, because faced with the perceived dichotomy of maintaining control versus encouraging progress they will often play on the safe side.

[0] https://www.thoroughbreddailynews.com/getting-from-cease-and...

giovannibonetti 6 months ago [–]

Oh, yeah, pdf2json returns font sizes as well. I forgot to mention that.

pierre 6 months ago [–]

pdf2json font name can be uncorrect sometime as it does only extract them based on a pre-set collection of fonts. I suggest using this fork that fix it :

https://github.com/AXATechLab/pdf2json

Bounding box also can be off with pdf2json. Pdf.js do a better job but have a tendency to no handling some ligature/glyph well, transforming word like finish to "f nish" sometime (eating the i in this case). pdfminer (python) is the best solution yet but a thousand time slower....

hackcasual 6 months ago [–]

I worked on an online retailer's book scan ingestion pipeline. It's funny because we soon got most of our "scans" as print-ready PDFs, but we still ran them through the OCR pipeline (that would use the underlying pdf text) since parsing it any other way was a small nightmare.

minerals29 6 months ago [–]

I am an ML engineer in one of the PDF extraction companies processing thousands of invoices and receipts per day in realtime. Before we started adding ML all our processing logic was build on top of hundreds of regexes and gazetteers. Even until now handcrafted rules are the backbone of our extraction system whereas ML is used as fallback. Yes, regexes accumulate tech debt and become a maintenance blackhole but if they work, they are faster and more accurate than any fancy DL tech out there.

serhart 6 months ago [–]

> Use a good text parsing library. Regexes are probably not enough for your use case. In case you are not aware of the limitations of regexes you may want to learn about Chomsky hierarchy of formal languages.

Most programming languages offer a regex engine capable of matching non-regular languages. I agree though, if you are actually trying to _parse_ text then a regex is not the right tool. It just depends on your use case.

dunham 6 months ago [–]

For simple cases, I've also found "pdftotext -layout" useful. For a quick on-off job, this would save someone the trouble of assembling the lines themselves.

I have used this in the past to extract tables, but it doesn't help much in cases where you need font size information.

daniel-levin 6 months ago [–]

I’m a contractor. One of my gigs involved writing parsers for 20-something different kinds of pdf bank statements. It’s a dark art. Once you’ve done it 20 times it becomes a lot easier. Now we simply POST a pdf to my service and it gets parsed and the data it contains gets chucked into a database. You can go extremely far with naive parsers. That is, regex combined with positionally-aware fixed-length formatting rules. I’m available for hire re. structured extraction from PDFs. I’ve also got a few OCR tricks up my sleeve (eg for when OCR thinks 0 and 6 are the same)

pacoverdi 6 months ago [–]

Many years ago, I regularly had to parse specifications of protocols from various electronic exchanges. The general approach I used was to do a first pass using a Linux tool to convert it to text: pdftotext. Something like:

    pdftotext -layout -nopgbrk -eol unix -f $firstpage -l $lastpage -y 58 -x 0 -H 741 -W 596 "$FILE"

After that, it was a matter of writing and tweaking custom text parsers (in python or java) until the output was acceptable, generally an XML file consumed by the build (mainly to generate code).

A frequent need was to parse tables describing fields (name, id, description, possible values etc.). Unfortunately, sometimes tables spanned several pages and the column width was different on every page, which made column splitting difficult. So I annotated page jumps with markers (e.g. some 'X' characters indicating where to cut).

As someone else said, this is like black magic, but kind of fun :)

Edit: grammar

dredmorbius 6 months ago [–]

I've discovered page-oriented processing in awk, which is a godsend for parsing PDFs.

See:

https://news.ycombinator.com/item?id=22156456

In the GNU Awk User's Guide:

https://www.gnu.org/software/gawk/manual/html_node/Multiple-...

Tracking column and field widths across page breaks is ... interesting, but more tractable.

mtlogstdo 6 months ago [–]

I worked for an epub firm that used a similar approach a while ago - we took PDFs and produced Flash (yes, that old) versions for online, and created iOS and Android apps for the publisher.

I've come across most of the problems in this post but the most memorable thing was when we were asked to support Arabic, when suddenly all your previous assumptions are backwards!

haberman 6 months ago [–]

Oh my goodness, this whole thread is deja vu from some code I wrote to parse my bank statements. I arrived at exactly the same solution of "pdftotext -layout" followed by a custom parser in Python. And ran into the same difficulty with tables: I wrote a custom table parser that uses heuristics to decide where column breaks are.

hnick 6 months ago [–]

I work in the print industry and some clients have the naive idea they'll save money by formatting their own documents (naive because usually this just means a lot more work for us, which they end up paying for).

We need some metadata to rearrange and sort PDF pages for mailing and delivery (such as name, address, and start/end page for that customer).

Our general rule is you provide metadata in an external file to make it easy for us. Otherwise, we run pdftotext and hope there's a consistent formatting for the output (e.g. every first page has "Issue Date:", "Dear XYZ,", or something written on it).

If that doesn't work then we're re-negotiating. It is not too difficult usually to build a parser for one family of PDF files based on a common setup as you've said and you get to learn various tricks. It is very difficult though to write a general parser.

Personally, I found parsing postscript easier since usually it was presented linearly.

just_myles 6 months ago [–]

I can cosign on this methodology. I used to work in an organization that used to build pdfs for accounting and licensing documentation. I used a proprietary tool (Planetpress :( ) to generate the documents using metadata from a separate input file (csv or xml) to determine what column maps to what field.

Good thing about this was as you have already outlined: It allowed for some flexibility in what was acceptable input data. For specific address formats or names we could accept multiple formats as long as they were consistent and in the proper position in the input file.

Regarding renegotiating: We didn't get that far. However, if a customer within our organization was enlisting our expertise and could not produce an acceptable input file, then we would go back to them and explain the format that we require in order to generate the necessary documents. Of course, creating our document through our data pipelines is obviously the better choice, but this was not an option in some cases at the time.

As far as doing the work of creating these documents in a tool like Planetpress is concerned, well, don't use Planetpress. You are better of doing it in your favorite language of choice's libraries tbh. Nothing worse than having to use proprietary code (Presstalk/Postscript.) that you have to learn and never be able to use anywhere.

hnick 6 months ago [–]

By re-negotiating I mean in terms of quoting billable hours. A rule of thumb for a typical Postscript scraper was around 20 hours end to end (dev, testing, and integration into our workflow system).

The problem we have with a lot of client files is that they look fine but printers don't care about "look fine", they crash hard when they run out of virtual memory due to poor structure. And usually without a helpful error message, so that's more billable hours to diagnose. The most common culprit is workflows that develop single document PDFs then merge them resulting in thousands of similar and highly redundant subset fonts.

Quarrelsome 6 months ago [–]

Any tricks for decimal points versus noise? Its a terrifying outcome and all I've got is doing statistical analysis on the data you've already got and highlighting "outliers".

kevin_thibedeau 6 months ago [–]

Change the decimal point in the font to something distinctive before rasterizing.

_0z6m 6 months ago [–]

For something like bank statements, I'd use the rigidly-defined formatting (both number formatting and field position) to inform how to interpret OCR misfires. My larger concern then would be missing a leading 1 (1500.00 v 500.00), but checking for dark pixels immediately preceding the number will flag those errors. And I suppose looking for dark pixels between numbers could help with missed decimals too.

mipmap04 6 months ago [–]

I've done this a bit. I define ranges per numeric field and if it exceeds or is below that range, I send it to another queue for manual review. Sometimes I'll write rules where if it's a dollar amount that usually ends ".00" and I don't read a decimal but I do have "00", then I'll just fix that automatically if it's outside my range.

HPsquared 6 months ago [–]

(Novice speaking) Maybe there's something about looking for the spacing / kerning that is taken up by a decimal point? (Not sure if OCR tools have any way to look for this)

throwaway3157 6 months ago [–]

Do you have a blog? I'd enjoy reading some of your tricks.

Also, how do you manage things when one of those banks decides to change the layout/format?

saradhi 6 months ago [–]

Interestingly, I was doing the similar stuff for 3 years to a US company. Curious, is your client a legal tech company? Mine was.

The experience helped me to roll out an API, as https://extracttable.com, for developers.

OCR tricks? Assuming post processing dev stuff - may I know your OCR engine. We are supported with Kofax and openText along with cloud engines like GVision as a backup.

slig 6 months ago [–]

Maybe there's a SasS opportunity for you to explore.

LeonM 6 months ago [–]

I build such a service, but it is impossible to guarantee any reliable result. I ended up shutting it down.

The PDF standard is a mess, and the number of 'tricks' I've seen done is astonishing.

Example: to add shade or border effect to text, most PDF generators simple add the text twice with a subtle offset and different colors. Result: your SaaS service returns every sentence twice.

Off course there were workarounds, but at some point it became unmaintanable.

shim__ 6 months ago [–]

I'm actually surprised that PDF hasn't been superseded by some form of embedded HTML by now.

Fnoord 6 months ago [–]

It partly has: ePub [1], an open format for ebooks, contain HTML.

[1] https://en.wikipedia.org/wiki/EPUB

lmm 6 months ago [–]

I'd say exactly the opposite. PDF makes it easy to create a document that looks exactly the way you want it to, which seems to be all that most web designers want (witness all the sites that force a narrow column on a large screen and won't reflow their text properly on a small screen).

zo1 6 months ago [–]

In a way it has. In my experience, there have been multiple times where a "generate PDF" requirement has come up, with the best viable solution being "develop it in HTML using standard tech" followed by "and then convert it to PDF".

mLuby 6 months ago [–]

I blame CSS.

faeyanpiraat 6 months ago [–]

Why?

siftrics 6 months ago [–]

Hi! I’m the founder of a startup (https://siftrics.com) in this exact space.

The demand for automating text extraction is still very high — or at least it feels like it when you’re working around the clock to cater to 3 of your customers, only to wake up to 10 more the next day. We’re small but growing extremely quickly.

JoblessWonder 6 months ago [–]

As someone who works in aviation... what made you choose an avionics company as your demo business?

I've bookmarked your site for future research... but the aviation part has me curious!

kelvin0 6 months ago [–]

What space are your customers in? Healthcare? Government?

siftrics 6 months ago [–]

Everything. Insurance companies to fledgling AI startups.

It’s definitely harder to get government business because the sales process is so long and compliance is so stringent. That said, we are GDPR compliant.

slig 6 months ago [–]

Great demo video. Congrats on growing your startup!

kelvin0 6 months ago [–]

Well I am putting the finishing touches on a front end that allows extracting PDF text visually. It's also able to adjust when the PDF page size vary for a given document type. Once you build the extractor for a document type, it can run on a batch of PDFs and store to Excel or Database (or any other format). I sense this tool facilitates and automates a lot of the 'dark art' you mention. Of course there are always difficult documents that don't fit exactly in the initial extraction paradigm, for those I use the big guns ...

penagwin 6 months ago [–]

Id also be interested in a blog or any basic tips/examples! I totally understand you don't want to give too much away, but I'm sure HN would love to see it!

just_myles 6 months ago [–]

I remember writing one of my first parsers was for a pdf and I had to employ a similar methodology where I had to rely on regex and "positionally-aware fixed-length" formatting rules. I would literally chunk specific groups by the number of spaces they contained lol. I had to do very little manual intervention but, damn it all, it worked :D .

malcolmhere 6 months ago [–]

I've written similar code for investment banks, to extract financial reporting data from PDFs. It's shocking to think how much of the financial world runs on this kind of tin-cans-on-a-piece-of-string solution.

BlueTemplar 6 months ago [–]

Do these pdfs even get printed ?

kabacha 6 months ago [–]

My first internship was at a small company that did PDF parsing and building for EU government agencies and it was really painful work but paid an absolute shitton.

sudhirj 6 months ago [–]

Are you open to doing more of this? Trying to do the same thing but I’d rather have an expert do it and focus on the app.

kelvin0 6 months ago [–]

Are you building an app?

sudhirj 6 months ago [–]

Building personal finance app to keep track of multiple bank accounts and investments, categorising spends, etc.

Parsing statement PDFs from every bank is pretty hellish.

wdb 6 months ago [–]

That’s why the open banking api is amazing these days

In the past we did purposely make it more difficult to parse our PDFs

fluidcruft 6 months ago [–]

Do you have any tricks for dealing with missing unicode character mapping tables for embedded fonts?

graeme 6 months ago [–]

What’s your contact info? Didn’t see any on your github.

daniel-levin 6 months ago [–]

dan at threeaddresscode dot com

teddyc 6 months ago [–]

Can I PM you?

daniel-levin 6 months ago [–]

Of course.

teddyc 6 months ago [–]

I don't think Hacker News supports PMs.

I managed to find your email address from your GitHub profile. Going to send you an old fashioned email.

aggerdom 6 months ago [–]

Are you me? Wish that I had known the insertion order trick, though it isn't straightforward to implement with the stack I was using at a previous gig. (Tabula + Naive parsing + Pandas Data Munging). I can expand on a few issues challenges I've run into when parsing PDFs:

# Parser drift and maintenance hell

Let's say that you receive 100 invoices a month from a company over the course of 3 months. You look over a handful of examples, pick features that appear to be invariant, and determine your parsing approach. You build your parser. You're associating charges from tables with the sections their declared in, and possibly making some kind of classification to make sure everything is adding up right. It works for the example or two pdfs you were building against. It goes live.

You get the a call or bug report: it's not working. You try the new pdf they send you. It looks similar, but won't parse because it is--in fact--subtly different. It has a slightly different formatting of the phone-number on the cover page, but identical everywhere else. You change things to account for that. You retest your examples, they break. Ok, two different formats same month, same supplier. You fix it. Chekhov's Gun has been planted.

A month passes, it breaks. You inspect the offending pdf. Someone racked up enough charges they no longer fit on a page. You alter the parser to check the next page. Sometimes their name appears again, sometimes not, sometimes their next page is 300 pages away. It works again.

A few more months later, a sense of deja-vu starts to set it. Didn't I fix this already? You start tracking three pdfs across 3 months:

pdf 1 : a -> b -> c (Starts with format a, change to be same as pdf 2, then changes again)

pdf 2 : b -> b -> c (Starts with one format, stays the same, changes the same way as pdf 1)

pdf 3 : b -> a -> b (Starts same as pdf 2, changes to same as pdf 1 first month, same as pdf 3)

What's the common factor between these version changes? The return address is determining the version.

PDFs are slightly different from office to office, with templates drifting slightly each month in diverging directions. You have to start reevaluating parsing choices and splitting up parsers. It's difficult to account for incurring linear maintenance cost for each new supplier and amortize that over a sizeable period of time. My arch nemesis is an intern who got put to work fixing the invoices at one office of one foreign supplier.

# PDFs that aren't standards compliant

In this case, most pdf processing libraries will bail out. Pdf viewers on the other hand will silently ignore some corrupted or malformed data. I remember seeing one that would be consistently off by a single bit. Something like `\setfont !2` needed to have '!' swapped out for another syntactically valid character that would leave byte offsets for the pdf unchanged.

TLDR: If you can push back, push back. Take your data in any format other than PDF if there is any way that is possible.

Iwillgetby 6 months ago [–]

If you upload a pdf to google drive and download it 10 minutes later it will magically have BY FAR the best OCR results in the pdf. Note my pdf tests were fairly clean so your experience may not be the same.

I have used Google's fine OCR results to simulate a hacker.

- Download a youtube video that shows how to attack a server on the website hackthebox.eu

- Run ffmpeg to convert the video to images.

- Run a jpeg to pdf tool.

- Upload the pdf to google drive.

- Download the pdf from google drive.

- Grep for the command line identifiers "$" "#".

- Connect to hackthebox.eu vpn.

- Attack the same machine in the video.

DantesKite 6 months ago [–]

Right? I love the OCR for Google Drive. It's such a useful, hidden feature.

By the way, why do you wait 10 minutes? Is there a signal that the PDF is done processing?

Or is there just some kind of voodoo magic that seems to happen that just takes 10 minutes to do?

Iwillgetby 6 months ago [–]

2 minutes is probably long enough. I did notice that google drive doesn't seem to like it if you upload a lot of files. I have had files sit and never get OCR, but I forgot about them so they may have OCR on them now.

Also, I am not aware of a signal when it is done.

Ididntdothis 6 months ago [–]

You got to love modern software. It may do it or not. It may do it within an unknowable timeframe. But if it does it, it’s wonderful.

fireattack 6 months ago [–]

Google Drive can directly OCR jpeg or any image. Just upload and open it with Google Docs.

Now I think about it, I don't know what you mean by "upload a pdf to google drive and download it 10 minutes later".

Uploading and downloading a file shouldn't change it at all, at bit level.

anthk 6 months ago [–]

>- Run a jpeg to pdf tool.

ImageMagick. convert *.jpg out.pdf

ddeokbokki 6 months ago [–]

This solution is absolutely beautiful

Wiretrip 6 months ago [–]

PDF is, without a doubt, one of the worst file formats ever produced and should really be destroyed with fire... That said, as long as you think of PDF as an image format it's less soul destroying to deal with.

lm28469 6 months ago [–]

PDF is good at what it's supposed to be good. Parsing pdf to extract data is like using a rock as a hammer and a screw as a nail, if you try hard enough it'll eventually work but it was never intended to be used that way.

mumblemumble 6 months ago [–]

I think my fastener analogy would probably involve something more like trying to remove a screw that's been epoxied in. Or perhaps trying to do your own repairs on a Samsung phone.

It's not that the thing you're trying to do is stupid. It's probably entirely legitimate, and driven by a real need. It's just that the original designers of the thing you're trying to work on didn't give a damn about your ability to work on it.

Finnucane 6 months ago [–]

Actually, parsing text data from a pdf is more like using the rock to unscrew a screw, in that it was not meant to be done that way at all. But yeah, the pdf was designed to provide a fixed-format document that could be displayed or printed with the same output regardless of the device used.

I'm not sure (I haven't thought about it a lot) that you could come up with a format that duplicates that function and is also easier to parse or edit.

anoncake 6 months ago [–]

It's closer to using a screwdriver to screw in a rock. The task isn't supposed to be done in the first place but the tool is the least wrong one.

mark-r 6 months ago [–]

I would think any word processing document format would duplicate that function and be better.

bachmeier 6 months ago [–]

It's pretty silly when you think about it. There's an underlying assumptions that you'll work with the data in the original format that you used to make the PDF.

hhas01 6 months ago [–]

“PDF is good at what it's supposed to be good.”

QFT. PDF should really have been called “Print Description Format”. At heart it’s really just a long list of non-linear drawing instructions for plotting font glyphs; a sort of cut-down PostScript.

https://en.wikipedia.org/wiki/PostScript

(And, yes, I have done automated text extraction on raw PDF, via Python’s pdfminer. Even with library support, it is super nasty and brittle, and very document specific. Makes DOCX/XLSX parsing seem a walk in the park.)

What’s really annoying is that the PDF format is also extensible, which allows additional capabilities such as user-editable forms (XFDF) and Accessibility support.

https://www.adobe.com/accessibility/pdf/pdf-accessibility-ov...

Accessibility makes text content available as honest-to-goodness actual text, which is precisely what you want when doing text extraction. What’s good for disabled humans is good for machines too; who knew?

i.e. PDF format already offers the solution you seek. Yet you could probably count on the fingers of one hand the PDF generators that write Accessible PDF as standard.

(As for who’s to blame for that, I leave others to join up the dots.)

ProZsolt 6 months ago [–]

PDF is great what it meant to be, a digital printed paper, with its pros (It will look exactly the same anywhere) and cons (Can't easily extract data from it or modify it).

Currently, there is no viable alternative if you want the pros but not the cons

bobbylarrybobby 6 months ago [–]

For me, the biggest con of PDFs is that like physical books, the font family and size cannot be changed. This means you can't blow the text up without having to scroll horizontally to read each line or change the font to one you prefer for whatever reason. It boggles my mind that we accept throwing away the raw underlying text that forms a PDF. PDF is one step above a JPEG containing the same contents.

mumblemumble 6 months ago [–]

> Currently, there is no viable alternative if you want the pros but not the cons

I remember OpenXPS being much easier to work with. That might be due to cultural rather than structural differences, mind - fewer applications generate OpenXPS, so there's fewer applications to generate them in their own special snowflake ways.

ProZsolt 6 months ago [–]

This is the first time I heard of it. When I search for it I only find the Wikipedia article and 99 links to how to convert it to pdf.

The problem with this is that from an average person perspective it doesn't have the pros. There is no built-in or first-party app that can open this format on Mac and Linux. More than 99% of the users only want to read or print it. It's hard to convince them to use an alternative format when it's way more difficult to do the only thing they want to do.

degski 6 months ago [–]

It's a Windows-thing, since W7, IIRC. It's ok now, but it has been buggy for years, and yes, who eats xps-files, so better it is, but it's not more useful.

tonyedgecombe 6 months ago [–]

It was too late and probably too attached to Microsoft to succeed. It is still used as the spool file format for modern printer drivers on Windows.

carapace 6 months ago [–]

Screenshots of Smalltalk. (I'm joking.)

bob1029 6 months ago [–]

We have to fill existing PDFs from a wide range of vendors and clients. Our approach is to raster all PDFs to 300DPI PNG images before doing anything with them.

Once you have something as a PNG (or any other format you can get into a Bitmap), throwing it against something like System.Drawing in .NET(core) is trivial. Once you are in this domain, you can do literally anything you want with that PDF. Barcodes, images, sideways text, html, OpenGL-rendered scenes, etc. It's the least stressful way I can imagine dealing with PDFs. For final delivery, we recombine the images into a PDF that simply has these as scaled 1:1 to the document. No one can tell the difference between source and destination PDF unless they look at the file size on disk.

This approach is non-ideal if minimal document size is a concern and you can't deal with the PNG bloat compared to native PDF. It is also problematic if you would like to perform text extraction. We use this technique for documents that are ultimately printed, emailed to customers, or submitted to long-term storage systems (which currently get populated with scanned content anyways).

rahimnathwani 6 months ago [–]

You could probably reduce file size by generating your additions as a single PDF, and then combining that with the original 'form', using something like

pdftk form.pdf multibackground additions.pdf output output.pdf

mopsi 6 months ago [–]

> No one can tell the difference between source and destination PDF unless they look at the file size on disk.

Not even when they try to select and copy text?

hnick 6 months ago [–]

You can add PDF tag commands to make rasterised text selectable and searchable, though they probably aren't doing that.

equasar 6 months ago [–]

Any recommended library for .NET to extract text by coordinates?

Rury 6 months ago [–]

There's itext7 (also for java). Not sure how it compares with other libraries, but it will parse text along with coordinates. You just need to write your own execution strategy to parse how you want.

From my experience, it seems to grab text just fine, the tricky part is identifying & grabbing what you want, and ignoring what you don't want... (for reasons mentioned in the article)

https://github.com/itext/itext7-dotnet

https://itextpdf.com/en/resources/examples/itext-7/parsing-p...

bob1029 6 months ago [–]

I don't know that this could exist for all PDFs.

Sounds like you are in need of OCR if you want to be able to use arbitrary screen coords as a lookup constraint.

eterps 6 months ago [–]

Lots of people doing their daily jobs are not aware of the information loss that occurs whenever they are saving/exporting as PDF.

totololo 6 months ago [–]

In the consulting industry I’ve seen PDF being used precisely because third parties couldn’t mess with the content anymore.

KineticLensman 6 months ago [–]

Yes, the company I once worked for used to supply locked PDF copies to make it slightly harder for casual readers to re-use / steal our text.

ldenoue 6 months ago [–]

That’s the approach I’m using to reformat “reflow” PDFs for mobile in my app https://readerview.app/

darknoon 6 months ago [–]

The first link on your demo gives me an error (mobile safari) https://www.appblit.com/pdfreflow/viewdoc?url=http://arxiv.o...

Gatsky 6 months ago [–]

I have been waiting for this for so long. It really works, well done.

stronglikedan 6 months ago [–]

Tell that to the entire commercial print industry, where they work very well.

Ididntdothis 6 months ago [–]

Yup. I still have PTSD from a project where I needed to extract text from millions of PDFs

adrianN 6 months ago [–]

What alternative do you propose? Postscript?

Koshkin 6 months ago [–]

Why not, .ps.gz works pretty well.

hamburglar 6 months ago [–]

... and is much more difficult to extract text from than PDF, given that it's turing complete (hello halting problem) and doesn't even restrict your output to a particular bounding box.

goatlover 6 months ago [–]

It was never meant to be a data storage format. It's for reading and printing.

BlueTemplar 6 months ago [–]

Except it sucks for reading?

goatlover 6 months ago [–]

I haven't experienced problems reading articles and books in PDF format on my phone.

efreak 6 months ago [–]

I read ebooks on my Nintendo DSi for several years when I was in college; The low-resolution screen combined with my need for glasses (and dislike of wearing them) made reading PDF files unbearable. Later on I got a cheap android tablet and reading PDF files was easier, but still required constant panning and zooming. Today I use a more modern device (2013 Nexus 7 or 2014 NVidia Shield), and I still don't like PDF files. I usually open the PDF in word if possible, save it in another format, then convert to epub with calibre, and dump the other formats.

Epubs in comparison are easy, as all it takes is a single tap or button press to continue. When there's no DRM on the file (thanks HB, Baen) I read in FBReader with custom fonts, colors, and text size. It doesn't hurt any that the epub files I get are usually smaller than the PDF version of the same book.

Personally, I think the fact that Calibre's format converter has so many device templates for PDF conversion says a lot.

hhas01 6 months ago [–]

Try being visually impaired.

drdeadringer 6 months ago [–]

I have a doubt. What am I missing?

grishka 6 months ago [–]

You clearly haven't ever worked with MP3.

sixhobbits 6 months ago [–]

As a meta point, it's really nice to see such a well-written, well-researched article that is obviously used as a form of lead generation for the company, and yet no in your face "call to actions" which try to stop you reading the article you came for and get out your wallet instead.

jiveturkey 6 months ago [–]

i mean except for the banner at the top and bottom! but yeah, an SEO article with actual substance, well formatted, not grey-on-grey[1], no trackers[2], is rare these days.

[1] recently read an SEO post on okta's site. who can read that garbage?

[2] only GA ... which isn't a 3rd-party tracker.

duckmysick 6 months ago [–]

> GA ... which isn't a 3rd-party tracker.

Why not? It's not self-hosted and results are stored elsewhere.

jiveturkey 6 months ago [–]

it doesn't correlate across sites by default -- the reasonable definition of a 3rd party tracker. by your definition, everything not complete self-hosted is a 3rd-party tracker. eg, netlify, which uses server logs to "self"-analyze would be a 3rd party tracker. it is not self-hosted and the data is stored elsewhere.

some might add: for the purpose of resale of the data, but I don't think that's a requirement to be classified as 3rd party tracker. the mere act of correlation, no matter what you then do with the data, makes you a 3rd party tracker. in case you think that's just semantics, this is important for GDPR and the new california law.

you can turn on the "doubleclick" option, which does do said correlation and tracks you. but that's up to the site to decide. GA doesn't do it by default.

dwheeler 6 months ago [–]

The best technique for having a PDF with extractable data is to include the data within the PDF itself. That is what LibreOffice can do, it can slip in the entire original document within a PDF. Since a compressed file is quite small, the resulting files are not that much larger, and then you don't need to fuss with OCR or anything else.

wenc 6 months ago [–]

Yes to embedding. In Canada, folks have always been able to e-file tax returns, but the CRA (Canada Revenue Agency) also has fillable PDF form for folks who insist on mailing in their returns (with their receipts and stuff so they don't have to store them and risk losing them).

When you're done filling the form, the PDF runs form validity checks and generates a 2D barcode [1] -- which stores your all field entry data -- on the first page. This 2D barcode can then be digitally extracted on the receiving end with either a 2D barcode scanner or a computer algorithm. No loss of fidelity.

Looks like Acrobat supports generation of QR, PDF417 and Data Matrix 2D barcodes.[2]

[1] https://www.canada.ca/en/revenue-agency/services/tax/busines...

[2] https://helpx.adobe.com/acrobat/using/pdf-barcode-form-field...

gruez 6 months ago [–]

>for folks who insist on mailing in their returns (with their receipts and stuff so they don't have to store them and risk losing them).

The Canadian tax agency offers free storage for whatever receipts you mail them? Sounds nifty. Does the IRS (or any other tax agency) do this?

wenc 6 months ago [–]

Just the receipts relevant to the tax return. If you e-file you're responsible for storing receipts up to 6 years in case of audit. (or something like that)

radarsat1 6 months ago [–]

As long as you can trust that the contents of the embedded document is the same as what is displayed.

dwheeler 6 months ago [–]

If you're worried about malicious differences, "regular" PDFs are worse.

As noted in the article, it is extremely difficult to figure out the original text given only a "normal" PDF, so you end up using a lot of heuristics that sometimes guess correctly. There's no guarantee that you'll be able to extract the "original text" when you start with an arbitrary PDF without embedded data. So if you're extracting text, neither way guarantees that you'll get "original text" that exactly matches the displayed PDF if an attacker created the PDF.

That said, there's more you can do if you have an embedded OpenDocument file. For example, you could OCR the displayed PDF, and then show the differences with the embedded file. In some cases you could even regenerate the displayed PDF & do a comparison. There are lots of advantages when you have the embedded data.

72deluxe 6 months ago [–]

How does LibreOffice include the entire document with the PDF?

Is there a special "data" section of the PDF that includes this? Can you point me to any documentation regarding this? It sounds quite good TBH.

bsdubernerd 6 months ago [–]

It's nice to note how several of these problems already exist in much more structured document types, such as HTML.

Using white-on-white dark-hat SEO techniques for keyword boosting? Check. Custom fonts with random glyphs? Check. I didn't see custom encodings (yet).

We try to keep HTML semantic, but google has been interpreting pages to a much higher level in order to spot issues such as these. If you ever tried to work on a scraper, you know how it's very hard to get far nowdays without using a full-blown browser as a backend.

What worries me is that it's going to get massively worse. Despite me hating HTML/web interfaces, one big advantage for me is that everything which looks like text is normally selectable, as opposed to a standard native widget which isn't. It's just much more "usable", as a user, because everything you see can be manipulated.

We've seen already asm.js-based dynamic text layout inspired by tex with canvas rendering that has no selectable content and/or suffers from all the OP issues! Now, make it fast and popular with WASM...

"yay"

dredmorbius 6 months ago [–]

Hiding page content unless rendered via JS is the darkest dark pattern in HTML I've noted.

Though absolute-positioning of all text elements via CSS at some arbitrary level (I've seen it by paragraph), such that source order has no relationship to display order, is quite close.

ethanwillis 6 months ago [–]

I went down a rabbit hole while making a canvas based UI library from scratch.. and started reading about the history of NeWS, display postscript, and postscript in general.

I started reading the ISO spec for postscript used in modern PDFs. You can read it yourself here: https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PD...

What actually needs to be done to extract text correctly is to be able to parse the postscript, have a way of figuring out how the raw text.. or the curves that draw the text.. are displayed (whether they are or not and in relation to each other) using information that the postscript gives you.

Edit: More than anything I think understanding deeply the class of PDFs you want to extract data from is the most important part. Trying to generalize it is where the real difficulty comes from.. as in most things.

mkjmkumar 6 months ago [–]

Around couple of years ago I am working on a home project and utilised Tesseract and Laptonica for OCR. Storage and search HDFS, HBase and SolrCloud on extracted text. You can find the details here on my website. I was very impressed with conversion of hand written pdf docs with 90% readable accuracy. I have named it as Content Data Store(CDS) http://ammozon.co.in/headtohead/?p=153 . Source code is open and you may find steps on installation and how to run here. http://ammozon.co.in/headtohead/?p=129 http://ammozon.co.in/headtohead/?p=126 A short demo http://ammozon.co.in/gif/ocr.gif

I didnot get time to enhance it further but planning to containerize the whole application. See if you find it useful in its current form.

hylian 6 months ago [–]

I had a similar problem and ended up using AWS' Textract tool to return the text as well as bounding box data for each letter, then overlayed that on a UI with an SVG of the original page, allowing the user to highlight handwritten and typed text. I plan to open source it so if anyone's interested let me know.

Not a fan of the potential vendor lock in though, so it's only really suitable for those in an already AWS environment not worried about them harvesting your data.

rwojo 6 months ago [–]

Very interested to see this as I was about to work on the same thing!

miki123211 6 months ago [–]

I use a screen reader, so of course some kind of text extraction is how I read PDFs all the time. There were some nice gotchas I've found.

* Polish ebooks, which usually use Watermarks instead of DRM, sometimes hide their watermarks in a weird way the screen reader doesn't detect. Imagine hearing "This copy belongs to address at example dot com: one one three a f six nine c c" at the end of every page. Of course the hex string is usually much longer, about 32 chars long or so.

* Some tests I had to take included automatically generated alt texts for their images. The alt text contained full paths to the JPG files on the designer's hard drive. For example, there was one exercise where we were supposed to identify a building. Normally, it would be completely inaccessible, but the alt was something like "C:\Documents and Settings\Aneczka\Confidential\tests 20xx\history\colosseum.jpg".

* My German textbook had a few conversations between authors or editors in random places. They weren't visible, but my screen reader still could read them. I guess they used the PDF or Indesign project files themselves as a dirty workaround for the lack of a chat / notetaking app, kind of like programmers sometimes do with comments. They probably thought they were the only ones that will ever read them. They were mostly right, as the file was meant for printing, and I was probably the only one who managed to get an electronic copy.

* Some big companies, mostly carriers, sometimes give you contract templates. They let you familiarize yourself with the terms before you decide to sign, in which case they ask you for all the necessary personal info and give you a real contract. Sometimes though, they're quite lazy, and the template contracts are actually real contracts. The personal data of people that they were meant for is all there, just visually covered, usually by making it white on white, or by putting a rectangle object that covers them. Of course, for a screen reader, this makes no difference, and the data are still there.

Similar issues happen on websites, mostly with cookie banners, which are supposed to cover the whole site and make it impossible to use before closing. However, for a screen reader, they sometimes appear at the very beginning or end of the page, and interacting with the site is possible without even realizing they're there.

tyingq 6 months ago [–]

I almost always have to resort to a dedicated parser for that specific pdf. I use it, for example, to injest invoice data from suppliers that won't send me plain text. Always end up with a parser per supplier. And copious amounts of sanity checking to notify me when they break/change the format.

saradhi 6 months ago [–]

I'm an ML engineer, worked as a part time data engineer consultant for a medical lines/claims extraction company, for 3 years, which majorly involved in extracting the tabular data from the PDFs and Images. Developer rules or parsers as such is JUST no help. You end up creating a new rule every time you miss the data extraction.

With that in consideration, and the existing resources are little help especially on skewed, blurry, handwritten and 2 different table structure in the input, I ended up creating an API service to extract tabular data from Images and PDFs - hosted as https://extracttable.com . We cared it to be robust, average extraction time on images is under 5 seconds. On top of maintaining accuracy, A bad extraction is eligible for credit usage refund, which literally not any service offer it.

i Invite HN users to give it a try and feel free to email saradhi@extracttable.com for extra API credits for the trail.

jazzido 6 months ago [–]

Hi, author and maintainer of Tabula (https://github.com/tabulapdf/tabula). We've been trying to contact you about the "Tabula Pro" version that you are offering.

Feel free to reachme at manuel at jazzido dot com

staticautomatic 6 months ago [–]

Edit: See reply below

Am I reading the repos correctly? It looks like Extractable copied Tabula (MIT) to its own repo rather than forking it, removed the attribution, and then tried to re-license it as Apache 2.0. If so, that would be pretty fucked up.

https://github.com/tabulapdf

https://github.com/ExtractTable/tabulapro

jazzido 6 months ago [–]

Not really. They import tabula_py, which is a Python wrapper around tabula-java (the library of which I'm a maintainer).

Still, I would have loved at least a heads up from the team that sells Tabula Pro. I know they're not required to do so, but hey, they're kinda piggybacking on Tabula's "reputation".

bruckie 6 months ago [–]

If you control the Tabula trademark (which doesn't necessarily require a formal registration), you may be able to prohibit them from using the TabulaPro name. That's exactly what trademark law is for.

(IANAL)

wpietri 6 months ago [–]

You're being much more polite here than I would be. Even if it isn't illegal, what they've done is a giant dick move.

saradhi 6 months ago [–]

William, the intention of "TabulaPro" is to give the developers a chance to use a single library instead of switching ExtractTable for images and tabula-py for text PDFs.

What do you recommend us to do, to not make you feel we made a dick move.

TIA

wpietri 6 months ago [–]

Well, let me ask a few questions:

Did you ask permission of the original author to use a derived name?

Did you discuss your plan to commercialize the original author's work with the author? Before starting out?

Since starting a commercial project, how much money have you given to the original author?

saradhi 6 months ago [–]

- No, No, Zero.

"commercialize the original author's work with the author" - No, but let me highlight this, any extraction with tabula-py is not commercialized - you can look into the wrapper too :) or even compare the results with tabula-py vs tabulaPro.

Copying the TabulaPro description here, "TabulaPro is a layer on tabula-py library to extract tables from Scan PDFs and Images." - we respect every effort of the contributors & author, never intended to plagiarize.

I understand the misinterpretation here is that we are charging for the open-sourced library because of the name. We already informed author in the email about unpublishing the library, this morning, I just deleted the project and came here to mention it is deleted :)

wpietri 6 months ago [–]

Sorry, Saradhi, I don't think you can reasonably claim there was no intention to plagiarize. Adding a "pro" to something is clearly meant to suggest it's the paid version of something. And it's equally clear that "TabulaPro" is derived from "Tabula".

It may be that you didn't realize that people would see your appropriation as wrong, although I have a hard time believing that as well given that the author tried to contact you and was ignored. As they say, "The wicked flee when no man pursueth."

So what I see here is somebody knowingly doing something dodgy and then panicking when getting caught. If you'd really like to make amends, I'd start with some serious introspection on what you actually did, and an honest conversation with the original author that hopefully includes a proper [1] apology.

[1] Meaning it includes an explicit recognition of your error and the harms done, a clear expression of regret, and a sincere offer to make amends. E.g., https://greatergood.berkeley.edu/article/item/the_three_part...

wpietri 6 months ago [–]

And I'm going add it's really weird that your answer ("No, No, Zero") is exactly the same as what the library author said [1] two hours before you posted. But you do that again without acknowledging the author, and with just enough format difference that it's not a copy-paste. It's extremely hard for me to imagine you didn't read what he said before writing that; it's just too similar.

[1] https://news.ycombinator.com/item?id=22483334

saradhi 6 months ago [–]

I understand how the whole thing interpreted as, and I agree with you. I'll do a note as suggested. Thanks for guiding and the link too.

I would like to get the author's comment on "tried to contact you and was ignored", as I was the one who emailed yesterday.

jazzido 6 months ago [–]

Author here.

1) No. 2) No. 3) Zero.

staticautomatic 6 months ago [–]

Thanks for clarifying.

CaptArmchair 6 months ago [–]

I chuckled at your "the worst image" sample. Which still looked quite decent all things considered.

You're "handwritten" example looks a bit "too decent" as well. I can see how that works. You first look for the edges of the table, and then you evaluate the symbol in each cell as something that matches unicode.

So, how well does this cope with increasing degradation? i.e. pencil written notes that bleed outside cell borders, curve around borders, etc.? Stamps and symbols (watermarks) across tables?

saradhi 6 months ago [–]

"pencil written notes that bleed outside cell borders, curve around borders, etc.? Stamps and symbols (watermarks) across tables?"

"The Worst Image" is a close match to that, except it is a print.

Regarding increasing degradation - as stated above, the OCR engine is not proprietary - we confined ourselves to detect the structure at this moment, and started with the most common problems.

aasasd 6 months ago [–]

What a glorious format for storing mankind's knowledge. Consider that by now displays have arbitrary sizes and a variety of proportions, and that papers are often never printed but only read from screens. To reflow text for different screen sizes, you need its ‘semantic’ structure.

And meanwhile if you say on HN that HTML should be used instead of PDF for papers, people will jump on you insisting that they need PDF for precise formatting of their papers—which mostly barely differ from Markdown by having two columns and formulas. What exactly they need ‘precise formatting’ for, and why it can't be solved with MathML and image fallback, they can't say.

People feeling the urge to defend PDF might want to pick up at this point in the discussion: https://news.ycombinator.com/item?id=21454636

mLuby 6 months ago [–]

Not everything needs to look good on every screen size. I don't expect to be able to read academic papers on my smartwatch, and a simple alarm clock app looks kind of silly when it's fullscreen across a desktop monitor. Likewise, when layout makes the difference between reader understanding or confusion, it's hard to trust automatic reflowing on unknown screen sizes.

PDF is simply better than HTML when it comes to preserving layout as the author intended.

aasasd 6 months ago [–]

I wonder if you realize that both your points wildly miss what I said.

First, there's no need to stretch my argument to the point of it being ridiculous. I don't have to reach for a watch to suffer from PDF. Even a tablet is enough: I don't see many 14" tablets flying off the shelves. I also know for sure that the vast majority of ubiquitous communicator devices, aka smartphones, are about the same size as mine, so everyone with those is guaranteed to have the same shitty experience with papers on their communicators and will have to sedentary-lifestyle their ass off in front of a big display for barely any reason.

Secondly:

> Likewise, when layout makes the difference between reader understanding or confusion, it's hard to trust automatic reflowing on unknown screen sizes. PDF is simply better than HTML when it comes to preserving layout as the author intended.

As I wrote right there above, still no explanation of why the layout makes that difference and why preserving it is so important when most papers are just walls of text + images + some formulas. Somehow I'm able to read those very things off HTML pages just fine.

72deluxe 6 months ago [–]

Some people are very "funny" about the layout of items and text and want it to be preserved identically to their "vision" when they created it. For example, every "marketing" individual when they see a webpage seem to want it pixel-perfect.

I think it's the artist in them.

This is understandable in some instances:

a. Picasso's or Monet's works probably wouldn't be as good if you just roll them up into a ball. Sure, the component parts are still there (it's just paper/canvas and paint after all!) but the result isn't what they intended.

b. A car that has hit a tree is made up of the composite parts but isn't quite as appealing (or useful) as the car before hitting the tree.

c. A wedding cake doesn't look as good if the ingredients are just thrown over the wedding party's table. The ingredients are there, but it just isn't the same...

Presentation is sometimes important.

aasasd 6 months ago [–]

That's a tradeoff between complex formatting and accessibility of the result. Authors are making readers sit in front of desktops/laptops for some wins in formatting. Considering that papers, at least ones that I see, are all just columns of text, images and formulas, the win seems to be marginal, while the loss in accessibility is maddening with the current tech-ecosphere.

alkonaut 6 months ago [–]

A PDF isn’t for storage it’s for display. It’s the equivalent of a printout. You don’t delete your CAD drawing or spreadsheet after printing it out.

aasasd 6 months ago [–]

> A PDF isn’t for storage it’s for display. It’s the equivalent of a printout.

This conjecture would have some practical relevance if I had access to the same papers in other formats, preferably HTML. Yet I'm saddened time and again to find that I don't.

In fact, producing HTML or PDF from the same source was exactly my proposed route before I was told that apparently Tex is only good for printing or PDFs. I hope that this is false, but not in a position to argue currently.

alkonaut 6 months ago [–]

But when you access a paper it’s for reading it, correct?

It is worrying if places that are “libraries” of knowledge aren’t taking the opportunity to keep searchable/parseable data, but it’s no worse than a library of books.

aasasd 6 months ago [–]

> but it’s no worse than a library of books

That's not my complaint in the first place. The problem is that while we progressed beyond books on the device side in terms of even just the viewport, we seemingly can't move past the letter-sized paged format. The format may be a bit better than books—what with it being easily distributed and with occasionally copyable text—but not enough so.

I'm not even touching the topic of info extraction here, since it's pretty hard on its own and despite it also being better with HTML.

floriol 6 months ago [–]

Yeah, it's better with HTML than with PDF, but it's still pretty terrible... Use some actually structured data format like XML (XHTML would be good), because you don't want to include a complete browser just to search for text

hnick 6 months ago [–]

PDF/A is intended for archival and long term storage.

They key difference is it contains no ambiguity (such as all fonts must be embedded).

lmm 6 months ago [–]

HTML has all the same problems and degrades over time. A PDF from 20 years ago will at least be readable by a human; a HTML page does not even guarantee that much.

You're right that most of the relevant semantics would fit into Markdown. So store the markdown! There are problems with PDF but HTML is the worst of all worlds.

aasasd 6 months ago [–]

What exactly degrades about HTML in twenty years? I can read pages from the 90s just fine: the main thing off is the font size due to the change in screen resolutions, but—surprise!—plain HTML scales and reflows beautifully on big and small screens. (Which is the complete opposite of ‘HTML has the same problems’.) I hope you're not lamenting the loss of the ‘blink’ tag.

If you're talking about images and whatnot falling off, that's a problem of delivery and not the format.

Markdown translates to HTML one-to-one, it's in the basic features of Markdown. For some reason I have to repeat time and again: use a subset of HTML for papers, not ‘glamor magazine’ formatting. The use of HTML doesn't oblige you to go wild with its features.

lmm 6 months ago [–]

> What exactly degrades about HTML in twenty years? I can read pages from the 90s just fine: the main thing off is the font size due to the change in screen resolutions, but—surprise!—plain HTML scales and reflows beautifully on big and small screens. (Which is the complete opposite of ‘HTML has the same problems’.) I hope you're not lamenting the loss of the ‘blink’ tag.

I am indeed, and of other tags that are no longer supported. Old sites are often impossible to render with the correct layout. Resources refuse to load because of mixed-content policy or because they're simply gone - which is a problem with the format because the format is not built for providing the whole page as a single artifact. And while the oldest generation of sites embraced the reflowing of HTML, the CSS2-era sites did not, so it's not at all clear that they will be usable on different-resolution screens in the future.

> Markdown translates to HTML one-to-one, it's in the basic features of Markdown. For some reason I have to repeat time and again: use a subset of HTML for papers, not ‘glamor magazine’ formatting. The use of HTML doesn't oblige you to go wild with its features.

This is one of those things that sounds easy but is impossible in practice. Unless you can clearly define the line between which features should be used and which should not, you'll end up with all of the features of HTML being used, and all of the problems that result.

aasasd 6 months ago [–]

> Unless you can clearly define the line between which features should be used and which should not, you'll end up with all of the features of HTML being used, and all of the problems that result.

You'll notice that I said in the top-level comment that my beef is with PDF papers (i.e. scientific and tech). I don't care about magazines and such, since they obviously have different requirements. So let's transfer your argument to current papers publishing:

“Since PDF can format text and graphics in arbitrary ways, you'll end up with papers that look like glamor and design magazines and laid out like Principia Discordia and Dada posters. You'll have embedded audio, video and 3D objects since PDF supports those, and since it can embed Flash apps, you'll have e.g. ‘RSS Reader, calculator, and online maps’ as suggested by Adobe, and probably also games. PDF also has Javascript and interactive input forms, so papers will be dynamic and interactive and function as clients to web servers.”

You can decide for yourself whether this corresponds to reality, and if the hijinks of CSS2-era websites are relevant.

What is it with people, one after another, jumping to the same argument of ‘if authors have HTML, they will immediately go bonkers’? If really looks like some Freudian transfer of innate tendencies. We have Epub, for chrissake, which is zipped HTML—what, have Epub books gone full Dada while I wasn't looking? Most trouble with Epub that I've had is inconvenience with preformatted code.

> Old sites are often impossible to render with the correct layout. Resources refuse to load because of mixed-content policy or because they're simply gone - which is a problem with the format because the format is not built for providing the whole page as a single artifact.

Yes, as I mentioned under the link provided in the top-level comment, the non-use of a packaged-HTML delivery is precisely my beef here. The entire idea of using HTML for papers implies employing a package format, since papers are usually stored locally. It's a chicken-and-egg problem. It's solved by the industry picking one of the dozen available package formats and some version of HTML for the content. Which would still mean that HTML is used for formatting. HTML could be embedded in PDF for all I care, if I can sanely read the damn thing on my phone.

lmm 6 months ago [–]

> “Since PDF can format text and graphics in arbitrary ways, you'll end up with papers that look like glamor and design magazines and laid out like Principia Discordia and Dada posters. You'll have embedded audio, video and 3D objects since PDF supports those, and since it can embed Flash apps, you'll have e.g. ‘RSS Reader, calculator, and online maps’ as suggested by Adobe, and probably also games. PDF also has Javascript and interactive input forms, so papers will be dynamic and interactive and function as clients to web servers.”

Those things don't happen in PDFs in the wild, or at least not to any great extent. It's not that technical paper authors have shown some special restraint and limited themselves to a subset of what the rest of the PDF world does. Technical papers look much like any other PDF and do much the same thing that any other PDF does; if they were authored in HTML, we should expect them to look much like any other HTML page and do much the same thing that any other HTML page does. Based on my experience of HTML pages, that would be a massive regression.

> Yes, as I mentioned under the link provided in the top-level comment, the non-use of a packaged-HTML delivery is precisely my beef here. The entire idea of using HTML for papers implies employing a package format, since papers are usually stored locally. It's a chicken-and-egg problem. It's solved by the industry picking one of the dozen available package formats and some version of HTML for the content. Which would still mean that HTML is used for formatting. HTML could be embedded in PDF for all I care, if I can sanely read the damn thing on my phone.

The details matter; you can't just handwave the idea of a sensible set of restrictions and a good packaging format, because it turns out those concepts mean very different things to different people. If you want to talk about, say, Epub, then we can potentially have a productive conversation about how practical it is to format papers adequately in the Epub subset of CSS and how useful Epub reflowing is versus how often a document doesn't render correctly in a given Epub reader. If all you can say about your proposal is "a subset of HTML" then of course people will assume you're proposing to use the same kind of HTML found on the web, because that's the most common example of what "a subset of HTML" looks like.

aasasd 6 months ago [–]

> It's not that technical paper authors have shown some special restraint and limited themselves to a subset of what the rest of the PDF world does. Technical papers look much like any other PDF and do much the same thing that any other PDF does.

This makes zero sense to me. You're saying that technical papers look the same as Principia Discordia or glamor/design magazines or advertising booklets, including those that just archive printed media. That technical papers include web-form functionality just like some PDFs do—advertising or whatnot, I'm not sure. If that's the reality for you then truly I would abhor living in it—guess I'm relatively lucky here in my world.

However, if you point me to where such papers hang out, I would at least finally learn what mysterious ‘complex formatting’ people want in papers and which can only be satisfied by PDF.

fermienrico 6 months ago [–]

Actually, no thanks. "Sementic" structure is how we got responsive web soup of ugly websites with hamburger menus.

We need the opposite, we need a format that stays the same size, same proportions and is vectorized so you can zoom to any size - however, the relationship of space between elements remains constant.

PDF is an amazing format IMO. Think of it like Docker - the designer knows exactly how its going to appear on the user's device.

nradov 6 months ago [–]

The problems you describe have nothing to do with the semantic web. Those are orthogonal issues.

machinelabo 6 months ago [–]

No, not everything is orthogonal. When you have sementic structure, you're gonna display it in responsive and adaptive way. That __breaks__ the design intent. For a designer, WYSIWYG is godsent. the parent comment is right - PDF is like a Docker container for designers - people who work with media.

If you have opposing thoughts, please elaborate further instead of simply saying "nothing to do with it". HN works by explaining and arguing about issues to get to the bottom of something.

aasasd 6 months ago [–]

> For a designer, WYSIWYG is godsent. the parent comment is right - PDF is like a Docker container for designers - people who work with media.

Sorta like this then, more rigidly conveying the designer's intended layout: https://bureau.rocks/projects/book-typography-en/ ?

(Though these ‘books’ are almost the opposite of what I'm advocating for, in terms of formatting.)

aasasd 6 months ago [6 more]

hgoury 6 months ago [–]

There is a fairly interesting library developed by the Stanford Team behind https://www.snorkel.org/ that takes structured documents, including PDF formatted as tables, and builds a knowledge base: https://github.com/HazyResearch/fonduer

It looks promising for these kinds of daunting tasks

lwhsiao 6 months ago [–]

One of the co-authors of Fonduer here. Just for reference the original paper for Fonduer is here:

https://dl.acm.org/doi/pdf/10.1145/3183713.3183729

And additional follow-up work on extracting data from PDF datasheets is here:

https://dl.acm.org/doi/pdf/10.1145/3316482.3326344

One thing to point out about our library is that while we do take PDF as input and use it to calculate visual features, we also rely on an HTML representation of the PDF for structural cues. In our pipeline this is typically done by using Adobe Acrobat to generate an HTML representation for each input PDF.

bhl 6 months ago [–]

What type of visual features are you looking at? I've been trying to find a web-clipper that uses both visual and structural cues from the rendered page and HTML, but have no luck finding a good starting point.

lwhsiao 6 months ago [–]

There are a handful. We looks at bounding boxes to featurize which spans are visually aligned with other spans. Which page a span is on, etc. You can see more in the code at [1]. In general, visual features seem to give some nice redundancy to some of the structural features of HTML, which helps when dealing with an input as noisy as PDF.

[1]: https://github.com/HazyResearch/fonduer/tree/master/src/fond...

p0nce 6 months ago [–]

The take-away is that PDF should not be an input to anything.

mikestew 6 months ago [–]

Except eyeballs and printers, and printers are just an eyeball abstraction.

hnick 6 months ago [–]

You'd hope so, but some printers run some very finicky software with less horsepower than your desktop machine so can fall over on complex PDF structures. I preferred Postscript!

72deluxe 6 months ago [–]

How does PCL cope with PDFs? Do you know if some conversion happens beforehand?

hnick 6 months ago [–]

Not sure if I'm misunderstanding but PCL is another page description language like PS, some printers can use both depending on the driver.

Most of our Xerox printers spoke Postscript natively, these days more printers can use PDF. We generally used a tool to convert PCL to PS to suit our workflow if that was the only option for the file, because being able to manipulate the file (reordering and applying barcodes or minor text modifications) was important. Likewise for AFP and other formats. PCL jobs were rare so I never worked on them personally.

LorenPechtel 6 months ago [–]

Unfortunately, when you need the output of program A as the input to B sometimes you have to jump through such hoops. I've never done it with .pdf but I've fought similar battles with .xps and never fully conquered them. (And the parser was unstable as hell, besides--it would break with every version and sometimes for far lesser reasons.)

dredmorbius 6 months ago [–]

Oughts isn't ises.

jatsign 6 months ago [–]

The relatively small company I work for makes me fill out some forms by hand, because they receive them from vendors as a PDF. So I print it out, sign it, and return it to my company by hand.

If someone could make a service that lets you upload a PDF that contains a form, and then let users fill out that form and e-sign it and collect the results, and then print them out all at once, it would be great.

It's not a billion dollar idea but there are a lot of little companies that would save a lot of time using it.

nathan_f77 6 months ago [–]

There are quite a few services that should be able to solve this problem (turning a PDF into a web form and collecting signatures.) Here's a few of the services I'm aware of:

* https://www.hellosign.com/products/helloworks

* https://www.useanvil.com

* https://www.pandadoc.com

* https://www.pdffiller.com

* https://www.platoforms.com

* JotForm (https://www.jotform.com/help/433-How-to-Add-an-E-Signature-t...)

* https://www.webmerge.me

(I know about all these because I'm working on a PDF generation service for developers called DocSpring [1]. I'm also working on e-signature support [2], but that's still under development, and still won't be a perfect fit for your use-case.)

[1] https://docspring.com

[2] https://docspring.com/docs/data_requests.html

edent 6 months ago [–]

I use Xournal - https://sourceforge.net/projects/xournal/

It lets me type in to forms - or draw text over them if necessary. Then I paste in a scan of my signature. Then save as a PDF an email across.

I've been doing this for years. Job applications, mortgages, medical questionnaires. No one has every queried it.

If you're hand delivering a printed PDF, it's just going to be copy-typed by a human into a computer. No need to make it too fancy.

moftz 6 months ago [–]

I used Xournal for a couple years in college. It was perfect in how simple it was to mix handwritten and typed notes or markup documents. The only thing is that I wish it had some sort of notebook organization feature. It would have been nice keeping all of my course notes in one file, broken down by chapter or daily pages. Instead, I ended up with a bunch of individual xojs that did the job but made searching for material take longer.

coob 6 months ago [–]

macOS's preview will let you do all of this also.

52-6F-62 6 months ago [–]

Coming from a documents format world (publishing), there are a lot of cases like this.

In theory it sounds like it should be straightforward but it hinges so much on how well the document is structured underneath the surface.

Being that these tools were primarily designed for non-technical users first the priority is in the visual and printed outcome and not the underlying structure.

One document can look much the same as another in form—uses black borders to outline fields, similar or same field names, etc, but may be structured entirely differently and that can be a madhouse of frustrating problems.

It can be complex enough to write a solution for one specific document source. Writing a universal tool that could take in any form like that would probably be a pretty decent moneymaker.

My first intuition, though, would be it may be more successful (though no less simple) to develop a model that can read from the visual of the document rather than parsing it successfully.

Open to learning something here, though!

aty268 6 months ago [–]

Out of curiosity, what exactly are non-technical people doing with PDF's, and why does there need to be a universal tool in the space? What would the tool do with the extracted data?

52-6F-62 6 months ago [–]

All kinds of things. PDF is the unifying data exchange format for a lot of businesses who use computers at some end to manage things and need to exchange documents of any kind without relying on the old "can you open Word files?" type problems.

There is a wide world outside of consumers of SaaS products for every little niche problem.

Sometimes they are baked in processes that still use PDF's to share information, sometimes they're old forms of any kind, sometimes even old scanned docs that are still in use but shared digitally. A lot of the businesses that carry on that way are of the mind that "if it's not broke, don't fix it" which is quite rational for their problem areas and existing knowledge base. They might be a potential market at some point for a new solution, but good luck selling them on a web-based subscription SaaS solution when a simple form has been serving their needs for 30+ years.

OP's problem of the PDF being the go-between to digital endpoints is more common than you might think.

The universality I was referring to was the wide range of possibilities for how a given form might be laid out. And old documents contain a lot of noise when they've been added to or manipulated. Look inside an old PDF form from some small-medium sized business sometime. Now imagine 1000 variations of that form one standard problem. Then multiple that by the number of potential problem areas the forms are managing.

Also like OP said—it's not sexy, but it's very real and having an intelligent PDF form reader and consumer would be a time-saver for those businesses who aren't geared to completely alter their workflow.

The tool could do anything with the extracted data. If it allowed you to connect to any of your in house services (like payroll or accounting) either with a quick config/API or a custom patch, or Google Drive, or whatever without complications like online-required and web accounts especially. No whole solution like that exists to my knowledge. At least nothing accessible to the wider market.

aty268 6 months ago [–]

Thanks for the comment, this is really interesting. I guess i'm still confused what people actually do with these PDF's though. Are people looking at a PDF sent to them and manually entering that data somewhere else (like payroll or accounting), so this tool would take that data from the PDF and pump it in there automatically?

Thanks again, I just want to make sure I understand.

jagged-chisel 6 months ago [–]

I assume you mean a drawn form as opposed to a true PDF form. The former would be difficult to parse automatically into inputs.

OTOH, a PDF form works exactly they way you’d like. Maybe there’s a small market in helping convert one to the other for collecting input from old paper-ish forms.

travisporter 6 months ago [–]

Would DocuSign work? I’ve signed for lease documents several times that way.

jatsign 6 months ago [–]

Something like that would work for signing, but the hard part is "turn this pdf into an online form". That way after a user finishes a form, you can perform some basic error checking like, did they fill out everything, is this field a valid format, etc. After 100 employees turn in a multi-page printed out form, someone has to go through it and make sure they signed everywhere, filled out all the fields, etc.

Again, not sexy, but it is so stupid I have to fill out a direct deposit form by hand and turn it into my company, who checks it, then hands it off to the payroll vendor, who has to check it, just to enter the damn data into a form on their end.

izacus 6 months ago [–]

Well, PDF forms support all of this already, so why not just add validation inside the PDF?

pintxo 6 months ago [–]

iPad OS should also be able to do this. Especially if you have the pen.

coob 6 months ago [–]

hellosign.com does exactly this

Savageman 6 months ago [–]

> By looking at the content, understanding what it is talking about and knowing that vegetables are washed before chopping, we can determine that A C B D is the correct order. Determining this algorithmically is a difficult problem.

Sorry, this is a bit off-topic regarding PDF extraction, but it distracted me greatly while reading...

I'm pretty sure the intention was A B C D (cut then wash). Not sure why the author would not use alphabet order for the recipe...

[edit] Sorry, I made it read to a colleague and he mentioned the A B C D annotations were probably not in the original document. This was not clear at all for me while reading, and if they are not included it's indeed hard to find the correct paragraph order.

shawnz 6 months ago [–]

Even if the ABCD was in the original document, how would the computer figure out it's supposed to indicate the order?

And of course, even if the letters were there in the original document, it would be clear to a human that they're incorrect because it doesn't make sense to wash vegetables after cutting.

Search: