Multivalent Tools: Extract Text

Extract Text

Extract text and more from any supported document type, including PDF, HTML, and DVI.

Extract

text, optionally normalized to Unicode
document structure
hyperlinks
layout (bounding boxes)
style (font, colors)

Since the extractors are based on the same engine that is used for rendering, they are not confused by markup or spacing; HTML pastes together words across style markup, and PDF tries hard to paste together words from fragments that may have been positioned with intervening kerning spacing.

Options

java tool.doc.ExtractText [options] URL|filename...

-output [format]
- unicode -- Extract text, normalized to Unicode. (Default.)
- doctree -- Display document tree, which is the primary internal runtime data structure for the document, as structure and text in XML syntax (but not necessarily well nested, which is required for valid XML).
- xml -- Output text content normalized to Unicode, and document structure as XML tags. Note that this does not convert the source document to XML! It reports results in XML syntax.
-bbox -- Report layout bounding boxes for structure and each word, in coordinates relative to a tag's parent tag. This requires document tree or XML output, above. The bounding box for the enclosed subtree is given on the bbox attribute in the form width x height @ x, y.
-links -- In Unicode text mode, output only the hyperlink URIs. In document tree and XML modes, also report hyperlinks, normalized to HTML a href=... syntax.
-span -- In addition to reporting structure in XML format, also report spans, which are used for hyperlinks and much of the formatting.
-style -- In XML mode, normalize styling information across document types into the following HTML tags: B, I, U, FONT (with FACE, SIZE, COLOR, and BACKGROUND-COLOR attributes). Mutually exclusive with -span.
-page range -- for paginated documents, extract text only in range

Examples

1. The default operation extracts text as Unicode, from any supported document format:

java tool.pdf.ExtractText Annotations.pdf

java -classpath .../Multivalent.jar:.../DVI.jar tool.doc.ExtractText tripman.dvi ls.1

2. Extract links

java tool.doc.ExtractText -links index.html PDFReference16.pdf tripman.dvi

3. Dump document tree

java tool.doc.ExtractText -output doctree -span -link tripman.dvi

4. XML

java tool.doc.ExtractText -output xml tripman.dvi

5. XML with layout and style

java tool.doc.ExtractText -output xml -bbox -style tripman.dvi

Technical Notes

Garbage text

Sometimes text that appears fine when rendered by the browser extracts as apparant garbage. Here are some issues:

Text may be drawn not with fonts but with vector shapes or in an image. Use OCR software to extract this text.
Some documents may not have any text. Paper scanned into images or PDF can be such a case. (Scanned paper can have OCR text hidden behind the paper image, and this tool will find that text.)
Text may not have a Unicode mapping. PDF Type 3 fonts often do not, and TeX DVI has characters that do not have Unicode equivalents.
The Unicode encoding may be buggy. Open Office maps some characters into the same Unicode, resulting in apparant letter dropping and doubling.
Some formats, such as PDF, can draw text out of reading order — in any order in fact. Text is extracted as it is encountered in the source. Further analysis by the user using layout coordinates can try to recover reading order.

PDF, especially, is susceptible to the above, so before reporting a bug check against a copy-and-paste in Acrobat.

Performance

Performance is generally excellent due to various technical optimizations, ranging from buffered I/O to skipping non-text elements such as images. If using this tool for mass text extraction, you should start java with the -server flag (note this flag is given to Java, as opposed to this tool):

java -server tool.ExtractText [options] files

While extraction on most formats is extremely fast, PDF takes more time because text is stored in fragments usually without spaces, and it must be analyzed to reassemble words and lines. Also, options that request layout and other options require more work and therefore more time. This tool first collects all text in a given document before returning it all at once; for PDFs that are complex and have many pages, this can take a minute or more.