How does PDFTextOnline work?
The steps PDFTextOnline takes to process your documents is very straightforward:
- You upload a PDF document
- PDFTextOnline passes the PDF file off to its version of PDFTextStream, Snowtide's PDF text extraction library
- PDFTextStream extracts the first 10 pages of the PDF, along with any bookmarks, document properties, or form data it finds in the PDF file.
- This data is returned to your browser using AJAX technologies, and is rendered in your browser in real-time.
- If you then browse beyond the first ten pages, PDFTextOnline will use PDFTextStream to again extract another 10 pages of PDF text, which is returned to your browser and used to update the PDF document view already in place.
As for how PDFTextStream does what it does -- suffice it to say that it is really smart and fast about how to best extract content from your PDF documents. If you're interested, you can learn more about the PDFTextStream PDF conversion library.
What are PDFTextOnline's requirements?
- A browser: Internet Explorer 6, Firefox 1.5 (or similar), and Safari 2 are regularly verified as complying with PDFTextOnline's requirements.
- Flash 8
- Javascript must be turned on in your browser
How can I save the text and data extracted by PDFTextOnline?
What will happen to the PDF documents I upload while using PDFTextOnline?
All PDF documents that are uploaded through PDFTextOnline are stored on our servers while you are accessing the content extracted from them. All documents are removed from our servers within 48 hours of being uploaded. If for some reason an error occurs while PDFTextOnline is processing a document you uploaded, we will refer to the uploaded document to reproduce the problem and eliminate its cause. These actions will only be taken by PDFTextOnline staff.
Under no circumstances will your documents or the content they contain be provided to a third party. Please see our privacy policy for details.
Does PDFTextOnline extract content from DRM-protected PDF documents?
No, and it never will. Most DRM-protected PDF documents (such as ebooks) use proprietary encryption handlers. Making it possible to 'crack' such documents would require us to circumvent that encryption, thereby placing us in violation of the U.S. DMCA.
Further, PDFTextOnline will not circumvent the encryption of any password-protected PDF document. In the future, we may provide an option for you to enter the password needed to open an encrypted PDF document.
Is there any way to get back to the documents I have previously uploaded into PDFTextOnline?
No, not yet. As you can imagine, it would require significant resources to keep each of your PDF documents on hand so that you could refer to them at a later date. However, we are working on extensions to PDFTextOnline that will allow you to return to previously-uploaded PDF documents and their extracted content.
What does the "beta!" underneath the PDFTextOnline logo mean?
PDFTextOnline is no longer a "beta". So, all those that said that we'd never take the 'beta!' tag off of the site's logo (unlike many other web applications) can now enjoy crow. :-)
PDFTextOnline is effective and useful, but I would like to convert some PDF document to text which I simply cannot upload (due to security or absolute confidentiality policies). Is there any way I can use PDFTextOnline on my computer or within my company?
We are open to the notion of licensing PDFTextOnline for internal deployment and use by large enterprises and government agencies -- please email us if this interests you.
Otherwise, you can get all of the functionality provided by PDFTextOnline (and a lot more!) by using PDFTextStream. Please note that PDFTextStream is not a standalone application -- it is a component that needs to be integrated into a new or existing application. Your IT staff can do this for you, or Snowtide can provide the necessary services to integrate PDFTextStream into your environment so that you get the most out of PDFTextStream.
Is there an API available for PDFTextOnline?
No, not yet. We've been working on ways to provide such an API for some time, and it will become available eventually. Providing an API for PDFTextOnline is not trivial, primarily due to the significant bandwidth that would be required by any such API for shipping PDF documents and their text extracts back and forth, as well as because of the CPU-intensive nature of PDF text extraction and the infrastructure challenges that that presents. But don't worry, we're working these issues. When we've settled on a solution, you can be sure we'll let you know.
In the meantime, you can get all of the functionality provided by PDFTextOnline (and a lot more!) by using PDFTextStream in your application. It's available for Java, .NET, and python, and can be licensed per-CPU, per-server, or as an OEM component.
