Extract Text

Extract text and more from any supported document type, including PDF, HTML, and DVI.


Since the extractors are based on the same engine that is used for rendering, they are not confused by markup or spacing; HTML pastes together words across style markup, and PDF tries hard to paste together words from fragments that may have been positioned with intervening kerning spacing.


java tool.doc.ExtractText [options] URL|filename...


1. The default operation extracts text as Unicode, from any supported document format:

java tool.pdf.ExtractText Annotations.pdf

java -classpath .../Multivalent.jar:.../DVI.jar tool.doc.ExtractText tripman.dvi ls.1

2. Extract links

java tool.doc.ExtractText -links index.html PDFReference16.pdf tripman.dvi

3. Dump document tree

java tool.doc.ExtractText -output doctree -span -link tripman.dvi

4. XML

java tool.doc.ExtractText -output xml tripman.dvi

5. XML with layout and style

java tool.doc.ExtractText -output xml -bbox -style tripman.dvi

Technical Notes

Garbage text

Sometimes text that appears fine when rendered by the browser extracts as apparant garbage. Here are some issues:

PDF, especially, is susceptible to the above, so before reporting a bug check against a copy-and-paste in Acrobat.


Performance is generally excellent due to various technical optimizations, ranging from buffered I/O to skipping non-text elements such as images. If using this tool for mass text extraction, you should start java with the -server flag (note this flag is given to Java, as opposed to this tool):
java -server tool.ExtractText [options] files

While extraction on most formats is extremely fast, PDF takes more time because text is stored in fragments usually without spaces, and it must be analyzed to reassemble words and lines. Also, options that request layout and other options require more work and therefore more time. This tool first collects all text in a given document before returning it all at once; for PDFs that are complex and have many pages, this can take a minute or more.