Compact PDF Specification

Compact PDF is a new format that can give an additional compression of 30 to 60% on many classes of PDF beyond what is possible in PDF 1.5. For instance, the PDF Reference 1.5 shrinks from 12.2MB as distributed by Adobe down to 4.4MB in Compact format. See Compress results to see what compression ratios one tool is able to achieve with this format.

PDFs can be re-compressed with a free tool to achieve smaller sizes in Compact format. Compact PDF is presently supported by the Multivalent Browser and the Multivalent PDF Tools. Compact PDF is not directly supported by Acrobat, but that same tool can convert back to standard PDF whenever needed. The reference implementation, which is open source, is provided by the classes multivalent.std.adaptor.pdf.PDFReader and multivalent.std.adaptor.pdf.PDFWriter.

Overview

PDFs are compressed internally, but for various reasons they are often not as small as they could be. Compact PDF introduces three techniques over what is defined in PDF 1.5 to achieve greater compression.
Bulk compression of most objects in same stream
PDF 1.5 introduces object streams which are very effective in compressing hyperlinks, which can be plentiful and were previously uncompressed. However, pages cannot be put into object streams and must be compressed separately. This is unfortunate for the general purpose compression algorithms (LZW and Flate), since they achieve compression by replacing patterns previously seen with short indexes to the pattern, and with less text there is less opportunity for sharing. Compact PDF compresses all or most pages together, into a compact stream.
BZip2
PDF includes LZW and Flate general purpose compression algorithms. In almost all cases Flate compresses better than LZW. Flate is fast and compresses well. However, for plain text, BZip2 often compresses better, and underneath the compression and encryption, PDF pages are text-based command streams. Compact PDF accepts BZip2 compression.
Unencrypted Type 1 fonts
Type 1 fonts are encrypted for historical reasons, although the encryption has long ago been broken and in fact Adobe publishes the method. Encrypted data is like random noise to compressors. Compact PDF decrypts Type 1 so that they are amenable to compression. Implementors should also consider converting these fonts to Type 1C.

Compact PDF is in valid PDF syntax, although non-Compact-aware PDF libraries won't know how to extract objects from the Compact stream. Since it is valid PDF, standard PDF features for encryption and incremental updating can be applied. Compact format is at odds with Linearization (aka Fast Web View), since Linearization allows parts of the PDF to be trasmitted as needed, whereas Compact puts most of the PDF in a single stream. Implementations could capture most of the benefits of each by putting the first page or few pages in Linearized format and the rest in Compact format.

PDF software generally needs random access a PDF. Software can convert a PDF in Compact format into one in standard format very quickly and then operate on it just as before. In tests on a 500MHz Pentium III, conversion took about 1 second per megabyte of PDF. Since the large majority of PDFs are less than 2MB in size and machines are several times faster, this preprocessing is hardly noticable. For larger PDFs, the results can be cached, just as a web browser caches web pages fetched over the relatively slow network.

PDF Syntax

Compact stream
The main source of additional compression over PDF 1.5 is compressing almost all objects together in the same stream. It is somewhat similar to PDF 1.5 object streams, but the Compact stream can contain other streams. Within the stream, objects are written out one after the other as follows: object number, whitespace, object data, whitespace. Embedded streams are introduced with the single character s (an abbreviation of stream) between dictionary part and byte data; otherwise the object wrappers obj / endobj / endstream are omitted. Generation numbers are assumed to be 0. Objects can be left out of the Compact stream, such as metadata, encryption, and objects incrementally added. Within the Compact stream, indirect references are written without the generation number, which is implicitly 0, and with the R immediately without intervening whitespace following the object number; e.g., 50 0 R is written 50R. Streams within the Compact stream must given /Length values as immediate integers, not indirect references. The Compact stream can be compress with Flate or BZip2, but BZip2 usually produces smaller data on PDF object.
KeyTypeDescription
Ninteger(Required) the number of objects in the stream, similar to PDF 1.5 object streams
FilternameAlso support BZip2Decode for BZip2 compression
Cross-reference table
The objects in the Compact stream are not represented in the cross-reference stream. This drastically reduces the size of the table, often to object 0, perhaps the /Info dictionary, and of course the Compact stream itself. The cross-reference table can be written as a table or as a cross-refence stream as defined in PDF 1.5.
Trailer
The document trailer contains one additional key, which gives a direct reference to compression dictionary.
KeyTypeDescription
Compressdictionary (not indirect reference)compression dictionary (see below)
Within the compression dictionary (all values are direct):
KeyTypeDescription
Filtername(Optional) Compression method. Known methods: PDF1.5, Compact. Default: PDF1.5
Versionname(Optional) Version of compression method. Default: 1.0.
Rootdictionary(Optional) if present, overrides /Root in trailer. This feature can be used to show a information page for non-aware viewers.
Compactindirect reference(Required if filter is "Compact") Indirect reference to the Compact stream object
LengthOinteger(Optional but recommended) Original length of the PDF.
SpecOname(Optional) Original PDF specification. For example, if the original PDF document adhered to PDF 1.3 but compression used object streams from 1.5, then SpecO would hold 1.3.
Embedded Font Streams
Within the Compact stream, any embedded Type 1 fonts are decrypted in order to improve compression. On reading Type 1 fonts, they should be re-encrypted. The encryption method is described in the Adobe Type 1 Font Format, available online.
KeyTypeDescription
Typestringif value is Type1U, font is an unencrypted Type 1