How to Write a Media Adaptor

Parsing

As described in "Behaviors", a media adaptor is a type of behavior whose primary job is to bridge parse a concrete document format and build a document tree. Since media adaptors are behaviors, everything in "How to Write a Behavior" applies. Furthermore, media adaptors have additional conventions.

Media adaptors treat their data in one of three ways: as a byte stream, as a character stream, or as a random-access file. New media adaptors should subclass, respectively, multivalent.MediaAdaptorByte, multivalent.MediaAdaptorReader, or multivalent.MediaAdaptorFile.

The essential method for media adaptors is parse(INode). In fact, often a media adaptor is little more than subclassing and implementing this method. A media adaptor can assume that its input has been established externally, either a java.io.InputStream byte stream, java.io.Reader character stream, or java.io.File, which is created when a MediaAdaptorFile first invokes getFile(). Then the parse(INode) method is invoked with a node of the document tree to which to attach the content of the document being read.

parse(INode) is usually a big loop that reads from the document format and creates a runtime representation in multivalent.Nodes. As far as possible, the runtime document tree should be slavishly faithful, preserving all information found in the file format. Structure or hierarchy should be represented as internal nodes; content such as text or images should be represented as leaves. Presentation / appearance should be captured as a stylesheet if possible, or as multivalent.Spans if not. If spans start and end points can be separated by an arbitrary distance, as in HTML, multivalent.Span's open(Node) and close(Node) can be a convenient way to attach spans to content. Metadata, such as author and dates, should be stored in the closest containing multivalent.Document. Most media adaptors make a "top node" or "document root" of their own, underneath the passed Node, and give it a tag/name that's that same as the document format; this is a convenient way for the associated stylesheet to affect the entire document.

Documents such as HTML that produce a long scroll should be created in their entirety. Paginated documents, such as DVI and PDF, should supply the page count to the prevailing document under the Document.ATTR_PAGECNT attribute, and should produce the single page specified by the Document.ATTR_PAGE attribute.

If encountering an unfixable/unrecoverable parsing error, usually due to an invalid data format, throw a multivalent.ParseException. java.io.IOExceptions are not parsing errors, and should be reported as I/O errors.

The node passed as a parameter to parse(Node) can be used to obtain the prevailing/enclosing multivalent.Document and multivalent.Browser, via Node.getDocument() and Node.getBrowser(). However, media adaptors can be used outside of a browser environment, as to supply parsed text full-text indexing, and so media adaptors should not rely on the node being connected to a larger tree.

It is recommended that media adaptors construct document trees that directly and fully represent the document format. However, it can be expedient to write a quick-and-dirty converter into another a document format, such as Perl POD to HTML. In that case, the converter can generated the target format and throw it to MediaAdaptor.parseHelper() to convert that to a document tree.

Distribution

Media adaptors are packaged like other behaviors, in JARs. Media adaptors usually hook into the system in a different way, however. Media adaptors want to be invoked when a document of the right type is encountered. The relationship between a document MIME type and/or file suffix and a media adaptor is established in sys/Preferences.txt. At startup the system will read its own sys/Preference.txt startup file, then those in all JARs in the same directory in an undefined order, then the one found in a user's home directory.

In sys/Preference.txt, the mediadaptor command maps from MIME type and/or suffix to genre. The remap command maps from genre to Java class. For instance, this is DVI.jar's sys/Preferences.txt:

mediaadaptor	dvi	DVI
mediaadaptor	application/x-dvi	DVI

remap	DVI	tex.dvi.DVI

Most document formats also use a stylesheet. At this time stylesheets that are automatically instantiated are written in CSS. The stylesheet is stored in the sys/stylesheet directory under the same name as the genre, plus the suffix .css. For example, DVI.jar has its stylesheet at sys/stylesheet.DVI.css. Stylesheets, especially those for media adaptors that translate to another format and use parseHelper(), can import other stylesheets using the CSS @import statement. Stylesheets in Multivalent.jar can be retrieved with the systemresource protocol, as in systemresource:/sys/stylesheet/HTML.css.

The hub for the media adaptor is stored in sys/hub under the genre name, as in sys/hub/HTML.hub.

Example

See Plucker for complete example: source code, stylesheet, hub, JAR packaging.