Compact PDF Specification
Compact PDF is a new format that
can give an additional compression of 30 to 60% on many classes of PDF
beyond what is possible in PDF 1.5.
For instance, the PDF Reference 1.5 shrinks from 12.2MB as distributed by Adobe
down to 4.4MB in Compact format.
See Compress results to see what compression
ratios one tool is able to achieve with this format.
PDFs can be re-compressed with a free tool to achieve smaller sizes in Compact format.
Compact PDF is presently supported by
the Multivalent Browser and the Multivalent PDF Tools.
Compact PDF is not directly supported by Acrobat,
but that same tool can convert back to standard PDF whenever needed.
The reference implementation, which is open source, is provided by the
classes multivalent.std.adaptor.pdf.PDFReader
and multivalent.std.adaptor.pdf.PDFWriter
.
Overview
PDFs are compressed internally, but for various reasons they are often not as small as they could be.
Compact PDF introduces three techniques over what is defined in PDF 1.5 to achieve greater compression.
- Bulk compression of most objects in same stream
- PDF 1.5 introduces object streams which are very effective in compressing hyperlinks,
which can be plentiful and were previously uncompressed.
However, pages cannot be put into object streams and must be compressed separately.
This is unfortunate for the general purpose compression algorithms (LZW and Flate),
since they achieve compression by replacing patterns previously seen with short indexes to the pattern,
and with less text there is less opportunity for sharing.
Compact PDF compresses all or most pages together, into a compact stream.
- BZip2
- PDF includes LZW and Flate general purpose compression algorithms.
In almost all cases Flate compresses better than LZW.
Flate is fast and compresses well.
However, for plain text, BZip2 often compresses better,
and underneath the compression and encryption, PDF pages are text-based command streams.
Compact PDF accepts BZip2 compression.
- Unencrypted Type 1 fonts
- Type 1 fonts are encrypted for historical reasons, although the encryption has long ago been broken
and in fact Adobe publishes the method.
Encrypted data is like random noise to compressors.
Compact PDF decrypts Type 1 so that they are amenable to compression.
Implementors should also consider converting these fonts to Type 1C.
Compact PDF is in valid PDF syntax, although non-Compact-aware PDF libraries won't know how to extract objects
from the Compact stream.
Since it is valid PDF, standard PDF features for encryption and incremental updating can be applied.
Compact format is at odds with Linearization (aka Fast Web View), since Linearization allows parts of the PDF
to be trasmitted as needed, whereas Compact puts most of the PDF in a single stream.
Implementations could capture most of the benefits of each by
putting the first page or few pages in Linearized format and the rest in Compact format.
PDF software generally needs random access a PDF.
Software can convert a PDF in Compact format into one in standard format very quickly
and then operate on it just as before.
In tests on a 500MHz Pentium III, conversion took about 1 second per megabyte of PDF.
Since the large majority of PDFs are less than 2MB in size and machines are several times faster,
this preprocessing is hardly noticable. For larger PDFs, the results can be cached,
just as a web browser caches web pages fetched over the relatively slow network.
PDF Syntax
- Compact stream
- The main source of additional compression over PDF 1.5 is compressing almost all objects together in the same stream.
It is somewhat similar to PDF 1.5 object streams, but the Compact stream can contain other streams.
Within the stream, objects are written out one after the other as follows: object number, whitespace, object data, whitespace.
Embedded streams are introduced with the single character
s
(an abbreviation of stream
) between
dictionary part and byte data;
otherwise the object wrappers obj
/ endobj
/ endstream
are omitted.
Generation numbers are assumed to be 0.
Objects can be left out of the Compact stream,
such as metadata, encryption, and objects incrementally added.
Within the Compact stream, indirect references are written without the generation number, which is implicitly 0,
and with the R
immediately without intervening whitespace following the object number;
e.g., 50 0 R
is written 50R
.
Streams within the Compact stream must given /Length
values as immediate integers, not indirect references.
The Compact stream can be compress with Flate or BZip2, but BZip2 usually produces smaller data on PDF object.
Key | Type | Description
|
---|
N | integer | (Required) the number of objects in the stream, similar to PDF 1.5 object streams
|
Filter | name | Also support BZip2Decode for BZip2 compression
|
- Cross-reference table
- The objects in the Compact stream are not represented in the cross-reference stream.
This drastically reduces the size of the table, often to object 0,
perhaps the
/Info
dictionary, and of course the Compact stream itself.
The cross-reference table can be written as a table or as a cross-refence stream as defined in PDF 1.5.
- Trailer
- The document trailer contains one additional key, which gives a direct reference to compression dictionary.
Key | Type | Description
|
---|
Compress | dictionary (not indirect reference) | compression dictionary (see below)
|
Within the compression dictionary (all values are direct):
Key | Type | Description
|
---|
Filter | name | (Optional) Compression method. Known methods: PDF1.5 , Compact . Default: PDF1.5
|
Version | name | (Optional) Version of compression method. Default: 1.0 .
|
Root | dictionary | (Optional) if present, overrides /Root in trailer.
This feature can be used to show a information page for non-aware viewers.
|
Compact | indirect reference | (Required if filter is "Compact") Indirect reference to the Compact stream object
|
LengthO | integer | (Optional but recommended) Original length of the PDF.
|
SpecO | name | (Optional) Original PDF specification.
For example, if the original PDF document adhered to PDF 1.3 but compression used object streams from 1.5,
then SpecO would hold 1.3 .
|
- Embedded Font Streams
- Within the Compact stream, any embedded Type 1 fonts are decrypted in order to improve compression.
On reading Type 1 fonts, they should be re-encrypted.
The encryption method is described in the Adobe Type 1 Font Format, available online.
Key | Type | Description
|
---|
Type | string | if value is Type1U , font is an unencrypted Type 1
|