Rather than provide you with a dry reference, we'll take you on a step-by-step tour through how to use PDFTextOnline.
Uploading Your PDF
The first thing you'll want to do after starting PDFTextOnline is to upload a PDF -- that is why you're here, right? When you first drop into PDFTextOnline, you will see the 'Open PDF' dialog:

This is easy, it's 1-2-3:
- Click on the button that reads 'Select PDF Document'; a file browser will pop open. Using it, you'll be able to choose a PDF document on your computer.
- Click the 'Start!' button.
(If you can't read the text in the swirled image, just click the 'Get a New Image' link below the image to get a different bit of swirled text.)
Your PDF document should now be on its way to PDFTextOnline. It will accept any PDF document up to 10MB in size.
There's nothing like the smell of fresh text in the morning...
Once it receives your PDF, PDFTextOnline will use PDFTextStream, our own PDF text extraction library. It will extract text from the first 10 pages of your PDF document (to start), and send it back to your browser along with the PDF document's bookmarks, document properties, and form data (if available). While it does this, PDFTextOnline shows a status bar near the top of your browser window.
When the text extracts arrive, you'll know it -- the status bar will disappear, and the first page of your PDF's text will show up in the main text area.
Getting Around
While using PDFTextOnline, you'll become great friends with the navigation widgets. Once PDFTextOnline has received some text from your PDF document, you'll be able to use it to move around its pages:

You should find the navigation controls very familiar -- click the left or right arrows to move to the next or previous pages (respectively). Or, if you prefer, you can type a particular page number into the input box.
As you move around the pages of your PDF document's text, you will eventually attempt to view a page whose text has not yet been extracted -- remember that only the first 10 pages of the PDF document's text are extracted when you first upload the file. Each time you request a page that has not had its text extracted yet, PDFTextOnline will download a new set of pages from the server. The status bar at the top of the window will show up again, and when the new pages of text arrive, the text area will be updated accordingly.
You can also navigate your PDF document's text using its bookmarks (if it has any).
Saving Your Text
Once you have uploaded your PDF document, and PDFTextOnline has delivered its initial extract, you can request to download the entire text extract to your computer in a single action. Just click on the 'Save All Text' button, and enter a name for the text extract file you are about to download. PDFTextOnline will show a small status frame (shown to the right here) that will be directed to the ZIP file containing your full PDF text extract.
Why is it a ZIP file? Two reasons:
- All of the text that PDFTextOnline produces is encoded using UTF-8; this ensures that your text extracts retain all of the special characters and diacriticals that are found in the source PDF files. This is especially important when using PDFTextOnline to extract text from PDF documents containing Chinese, Japanese, and Korean. If we did not provide the text extract download as a ZIP file, then your browser would probably not recognize the UTF-8 encoding, and any special characters found in the PDF text extract would display improperly.
- We plan on including extracted PDF metadata, form data, and bookmarks in the ZIP file in the future. This will round out the downloaded file to include everything that PDFTextOnline is capable of extracting.
Other Data
The body text isn't the only type of PDF content to which PDFTextOnline provides access. At the top of the PDFTextOnline interface, you will find a set of tabs:

Document Properties
If your PDF document contains document properties (most do), clicking on the 'Document Properties' tab will show those properties to you:

You will find many of the more common PDF document properties familiar: creation date, modification date, author, title, etc. Others are custom, and used in specialized PDF processing environments.
Form Data
Clicking on the 'Form Data' tab will show any interactive form data fields that are available in the PDF document you uploaded.
Bookmarks
If your PDF document contains bookmarks, opening up the 'Bookmarks' view on the left side of your browser window will show its bookmark hierarchy (sometimes called the 'document outline').
This tree of bookmarks works just like the one in many PDF viewers (including Adobe Acrobat). When you click on a bookmark, PDFTextOnline will bring you to that page of text (retrieving it from the server first, if it hasn't been loaded yet). Also, if you 'hover' your cursor over a bookmark, a tooltip will pop up indicating to what page the bookmark refers.
Display Options
In the upper-left corner of your browser window, you will see the options view:

Font Chooser
You can choose which font is used to display PDF text. Since PDFTextOnline delivers pure text to your browser, without any style information, what font you use will have a big impact on how each pages' text looks.
Page Layout
PDFTextOnline provides two different modes that determine how text from a PDF page is extracted:
- Visual The default, this mode will attempt to extract text so that it matches the visual layout of each page. This will result in table columns lining up, headings being centered over articles (assuming the headers are centered in the source PDF), etc. There will still likely be some variation in the extracted text due to the technical realities of attempting to make plain text look like nicely formatted content from a rich PDF document, but this layout mode will come very, very close to the look of the original document in most cases.
- Semantic This mode will render extracted text so that semantically-sensitive boundaries in the content are preserved. For example, the semantic layout will make sure that columns of text are separated.
If you are still unsure as to what these layout modes mean, here's a comparative example:
But, in a larger sense, we can not dedicate -- our poor power to add or detract. The world will we can not consecrate -- we can not hallow -- little note, nor long remember what we say here, this ground. The brave men, living and dead, who but it can never forget what they did here. It is struggled here, have consecrated it, far above for us the living, rather, to be dedicated here
But, in a larger sense, we can not dedicate -- we can not consecrate -- we can not hallow -- this ground. The brave men, living and dead, who struggled here, have consecrated it, far above our poor power to add or detract. The world will little note, nor long remember what we say here, but it can never forget what they did here. It is for us the living, rather, to be dedicated here
Please note that the page layout setting also carries over to the extraction that is performed when you save all of a PDF document's text to disk.
How was that?
We hope this has been helpful. If you have any additional questions or comments, please don't hesitate to email us.
