PDF Explained
John Whitington
Copyright 2011 John Whitington
OReilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (.
Nutshell Handbook, the Nutshell Handbook logo, and the OReilly logo are registered trademarks of OReilly Media, Inc. PDF Explained , the image of a lesser anteater, and related trade dress are trademarks of OReilly Media, Inc.
Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this book, and OReilly Media, Inc., was aware of a trademark claim, the designations have been printed in caps or initial caps.
While every precaution has been taken in the preparation of this book, the publisher and authors assume no responsibility for errors or omissions, or for damages resulting from the use of the information contained herein.
O'Reilly Media
Preface
The Portable Document Format (PDF) is the worlds leading page description language, and the first format equally useful for print and online use.
PDF documents are now almost ubiquitous in the printing industry, in document interchange, and in the online distribution of paginated content. They are, however, widely viewed as opaque and delicate and are poorly understood, even by those of a technical disposition.
This is partly due to a perplexing lack of documentation; the file format reference is freely available, but is of a size and complexity which requires a time investment unlikely to be plausible for the majority of those working with PDF.
This book aims to be an approachable introduction. It is suitable both for the technically -minded, and for those who just want to understand a little of the PDF format to give context to their work with tools which produce or process PDF documents.
Who Should Read This Book
Weve tried to write a book which serves as a general introduction, with some optional technical interludes, giving you the chance to type in example PDF files and see how they display.
This book is suitable for:
Adobe Acrobat users who want to understand the reasons behind the facilities it provides, rather than just how to use them. For example: encryption options, trim and crop boxes, and page labels.
Power users who want to use command-line software to process PDF documents in batches by merging, splitting, and optimizing them.
Programmers writing code to read, edit, or create PDF files.
Industry professionals in search, electronic publishing, and printing who want to understand how to use PDFs metadata and workflow features to build coherent systems.
Organization of Contents
In this chapter, we give a history of the PDF format and put it into context. We look at the advantages PDF has over similar technologies, introduce specialized kinds of PDF files such as PDF/X and PDF/A, and take a brief tour of the elements which comprise a typical PDF document. We conclude by looking at how PDF is used in industry.
We begin in earnest, building a simple PDF file from scratch in a text editor. We show how to process this into a fully valid PDF and open it in a PDF viewer. We explain each component of the file, taking our first look at various parts of the PDF syntax.
In this chapter, we describe the layout and content of a PDF file, and the syntax of the objects from which it is built. We describe how a PDF document is read from a flat file into a structured format and, conversely, written from that structured format to a flat file.
In this chapter, we leave behind the bits and bytes of the PDF file, and consider the logical structure of its objects, describing how pages and their resources are arranged into a document.
We describe how to create vector graphics and raster images in PDF, and how to deal with transparency, color spaces, and patterns. We illustrate with examples, showing the code and the result in a PDF viewer.
In this chapter, we look at the PDF operators for building and showing text strings using different fonts and sizes, and how to build lines and paragraphs. We describe the different types of fonts and encodings in PDF documents, and how they are defined and used. We look at the process of text extraction from a PDF document.
Here, we discuss topics not directly related to the visual appearance of the document, but to ancillary data: bookmarks, metadata, hyperlinks, annotations, and file attachments. For each, we describe how they are defined in PDF and give examples.
We look at how encryption and document permissions work in PDF, and see how to inspect encryption information in Adobe Reader. We describe how programs which process PDF files read, write, and edit encrypted documents.
In this chapter, we show how to use the popular pdftk program for the command-line processing of PDF files, looking at common usage scenarios. We describe what a program such as pdftk has to do internally to achieve certain tasks (for example, merging or splitting documents).
Here, we describe both Adobe and open-source software for viewing, converting, editing, and programming with PDF files. We give sources of further documentation and other resources such as support and discussion forums.