The Workings of PDFTextOnline

A bunch of top-notch software components (combined with a ton of hard work) make this site possible.

PDFTextStream

PDFTextStream is the ideal component for PDF text extraction when accuracy and performance matter. PDFTextOnline uses the python build of the PDFTextStream library (Java and .NET builds are also available).

PDFTextStream is a very sophisticated piece of software -- the idiosyncrasies and deep knotted complexities of the PDF file format require it. However, it does a lot more than support the PDFTextOnline service.

Many large organizations use PDFTextStream to solve some of their most vexing and costly document management and extraction problems. If this sounds interesting to you, then head over to http://snowtide.com/PDFTextStream to read more about how it might make the critical difference for your enterprise.

AJAX and Dojo

AJAX was the obvious choice when we began to consider how to provide PDFTextStream's PDF text extraction capabilities as a service. The user-centric principles of interactivity and responsiveness paired well with PDFTextStream's inherent performance advantages.

We chose the Dojo Toolkit for our AJAX framework. It provides a lot of great baseline functionality, and a lot of tools for making our own Javascript code work better than it would otherwise.

AFLAX and Flash

One of the most frustrating and user-unfriendly aspects of Snowtide's original "Online Demo" of PDFTextStream was that it forced users to upload a PDF file using a 'regular' HTML form. There are two serious problems with this:

  • HTML form uploads are synchronous -- in other words, once the form was submitted, the user could not do anything until the upload finished.
  • It is impossible to provide any indication of progress during the course of an HTML form upload.

That second problem was particularly vexing. If someone wanted to extract content out of a 10MB PDF file, they would be waiting for that file to upload for some time. Because of that wait, it would be too easy for most people to assume that PDFTextOnline had "crashed" or otherwise failed.

Thankfully, a solution to this was provided via Flash 8 (and it's new file upload functionality) and the AFLAX project, which makes the Flash API's available to Javascript applications. Now, when you upload a file in PDFTextOnline, you get a very friendly progress bar, and it's obvious that the upload is working, and you can guesstimate how long the upload will take to complete.

Pylons

Both snowtide.com and PDFTextOnline are deployed using the Pylons web application framework. It's a great piece of work, and generally stays out of our way, like a good web framework should!

Honorable Mentions

Thanks to the Tango/Freedesktop.org project for the very slick application icons.

Thanks to the Lighttpd project for a super-tight web server.