Extract various objects, such as images and "embedded file streams" (see PDF Reference section 3.10.3), into external files.
Also, developers can easily extract arbitrary objects stripped of encryption and compression filters.
and optionally unembed images and fonts and other large objects into external files.
A new, valid PDF is produced that is smaller than the original because it points to the external files rather than embedding that data.
Extracted objects are checked against files in the current directory
and if an identical file is found, it is shared, rather than making a new file with the same data.
- embedded file streams
- obtain files, such as images or the original HTML if the PDF conversion embedded it
- edit the external files (for example, images in an image editor), and then re-Embed the changes
- reduce disk space by sharing common images and fonts among PDFs
java tool.pdf.Extract [options] PDF-file(s)
- Object selection
- -embed --
extract embedded files
- -font --
extract (unsubsetted) fonts
Note that text is sometimes constructed from from vector drawings, not as text.
CFF not common in external files.
- -image --
Note that graphical shapes can be vector drawings, not images.
Note that images in PDF often store to match its coordinate system, with Y coordinates going up,
which is upside down in other image viewers and editors.
- -streams --
extract all streams, including fonts, images, videos, and content command streams
- -page range --
extract objects from selected pages
Objects by multiple pages are extracted only once.
- -min size --
minimum size of object that will be extracted.
- Write new PDF
- -edit --
write new PDF file that points to extracted objects.
Extracted objects are written in a form that can be edited,
but which cannot be used directly by PDF, so the new PDF is invalid.
For example, image data may need a wrapper for some valid format.
After editing, Embed can follow the pointers in the PDF
and rewrite the extracted, edited objects to build a valid PDF.
- -share --
write a new valid PDF file that points to extracted objects.
Objects are in a form needed by direct use by PDF,
which may not be easily editable.
- -cache directory --
directory in which to place extracted files and objects.
For example, if splitting off fonts only, this directory could be the font library.
Default: current directory.
- -obj range --
just dump objects to stdout, unaffected by restrict / rewrite / cache.
For use by developers.
- -password password --
password if PDF is encrypted