Scansoft XDOC

Manipulate scanned paper with ScanSoft XDOC analysis as a document as opposed to a picture.

Demos

Choose from Berkeley Digital Library project's Document 620 (backup copy), 1148, 1752, 2598, 3025. Almost any of the thousands of scanned documents available from the project server are suitable, but unfortunately at present the query interface hides essential information behind a URL query string.

Description

In digitizing all the paper in the world to take advantage of the easy copying and distribution of computers and networks, capturing the page image in a scanner gives the highest fidelity digital version. Running optical character recognition (OCR) to extract the text and segment out images is useful, but for most documents there are errors in translation and even in the best case there is often some loss of formatting.

In the Multivalent Browser, you can work with the page image, yet manipulate it in many ways as if it were a document in an editor. We've used ScanSoft's XDOC to geometrically align OCR words with the page image. Behaviors can use whichever representation they need, and then display the results on the high fidelity image. For example, you can drag out a selection in the image, and pull out the corresponding OCR when pasting into another application. Searching examines the OCR layer and draws boxes in the image layer. Like all document formats, scanned images can be annotated, as with hyperlinks, highlights, floating notes -- even blinking text. Some annotations, such as Short Comment and Move Text reformat the page to open up space between lines for a message, despite the fact that it is an image.

You can see the OCR behind the image by choosing the Show OCR lens from the Lens menu. The Magnify lens can be overlapped with with Show OCR to produce magnified OCR. You can see the entire page in OCR by choosing Scanned as OCR from the View menu. When looking at the entire page in OCR, the Show Image lens has the opposite effect of Show OCR. Verbose XDOC Info displays information about the page analysis, such as the number of characters, questionable characters and words; it also reports the version of the XDOC specification, of which we handle 7.2, 9.0, 10.0, 12.0 and 12.5.

The interface treats the collection of page images and XDOC as a single paginated document, so you can page forward and back, and save annotations from all the pages into a single file.

As in all paginated documents, HOME goes to the first page and END to the last, space/PageDown at the bottom of the page goes to the next page, PageUp at the top of a page goes to the previous page.

ScanSoft's Developer's Kit 2000 might generate XDOC files. Email the project at www@elib.cs.berkeley.edu for information about conventions for processing XDOC and corresponding scanned images for use in the browser. Also note that various OCR software, including Adobe Capture and FineReader, produce the image-OCR hybrid documents; and our PDF media adaptor looks for this and enlivens such documents with OCR lenses and all the other OCR-related behaviors.

Status

Unfortunately, the document analysis does not report all the "ink" on the page in the XDOC, hence the message warning of possible drop-out when the page reformatted. Not aware of widespread use of this format.

Demos

Description

Status

See Also