Let's Make a PDF

Sometime in 2000 or 2001 I was charged with generating reports for daily distribution to all of our locations. Our current process didn't guarantee that our reports would look identical when printed as the text was very dense and included barcodes. Why I thought PDF would be a good fit is a mystery to me now, but that's what I decided to use.

I settled on PDF version 1.2 for compatibility reasons (we still had some computers running Windows 3.1) and dove head-first in to the spec. If you're reading this, chances are that you've done the same thing -- and have the same feelings about it.

Dealing with text, particularly font metrics and embedded fonts, like our barcode font, was a nightmare. Ultimately, I ended up making a "fake font" for both text and barcodes. The resulting files had no text, just simple lines and filled boxes. Files were a lot bigger than they needed to be, but they worked, and solved all of the problems with the earlier reports.

Today, it's a easy decision to just use one of the countless PDF libraries. Still, it's not always the best decision. PDFs are complicated, and PDF libraries are equally complex. Maybe you can't find a library that's a good fit or you just need something simple and don't want the extra dependency. Whatever your reasons, we'll make a few simple PDFs so that you can go and make your own.

Contents

A Very Basic PDF

PDFs, at a high-level, are structurally simple. We have a header, some indirect objects, an indirect object index (xref), and a footer.

Below is (almost) the simplest PDF that I know how to make. It's just a single blank page. We'll examine all the parts in more detail.

Note that line endings here are DOS-style (\r\n). If you're copying this in to a text editor, the values in xref won't be correct if it outputs unix-style (\n) endings. Also note that xref entries need to be exactly 20 bytes long, including the line ending. If you or your editor are outputting unix-sytle endings, you should add a space or \r before the \n character.

Download example: blank.pdf

%PDF-1.7
1 0 obj
<< /Type /Catalog /Pages 2 0 R >>
endobj

2 0 obj
<< /Type /Pages
/Kids [4 0 R]
/Count 1
/Resources << /ProcSet 3 0 R >>
>>
endobj

3 0 obj
[ /PDF /Text ]
endobj

4 0 obj
<< /Type /Page /Parent 2 0 R
/MediaBox [ 0 0 612 792 ]
>>
endobj

xref
0 5
0000000000 65535 f
0000000010 00000 n
0000000064 00000 n
0000000162 00000 n
0000000197 00000 n

trailer
<< /Size 5 /Root 1 0 R >>
startxref
277
%%EOF

The header is pretty simple. It's just a comment that identifies the file as a PDF of a particular version. We'll use 1.7, though just about everything here will work fine in much older versions.

%PDF-1.7

Following the header is the main body of the PDF. It contains all the indirect objects that comprise the document. Indirect objects start with an id number, followed by a generation number and the keyword 'obj'. Following 'obj' is a collection of objects, terminated by the keyword 'endobj'.

The first indirect object is the Catalog. This is required for all PDFs as it serves as the root object in the document's object hierarchy. It contains one object, a dictionary, that contains information about the PDF. A dictionary is just a set of key-value pairs, bracketed by << >>. In this case, the dictionary contains two entries, a type and an indirect object reference (id# gen# R) to the root Page Tree Node.

1 0 obj
  <<
    /Type /Catalog
    /Pages 2 0 R
  >>
endobj

The next object serves as the root Page Tree Node. It's the indirect object referenced in the Catalog. It contains only a Pages dictionary. This could not have been included in the Catalog directly, as that entry is required to be an indirect object reference.

The Type, Kids, and Count entries are requred for all Page Tree Nodes. Every Page Tree Node except the root is also required to have Parent entry, which is an indirect object reference to its parent node. This is the root, so the Parent entry is not included. Included instead is a Resources entry, which contains an indirect reference to the Proceedure Set (explained below).

Kids contains this node's Child Nodes. This is an array of indirect object references which can refer to either other Page Tree Nodes or Page Objects. Arrays are denoted by square brackets ([ ]).

Count just contains the number of Child Nodes listed in the Kids entry.

Resources contains a dictionary that contains named resources (other PDF objects) that are outside the current content stream, like fonts and images. Here, it contains only an indirect reference to the Proceedure Set. Resources named here are inherited by the Page Tree Node's descendants.

2 0 obj
  <<
    /Type /Pages
    /Kids [4 0 R]
    /Count 1
    /Resources << /ProcSet 3 0 R >>
  >>
endobj

Next, we find the Proceedure Set referenced earlier. It contains only a simple array. From Page 547 of the 1.7 Spec:

These procedure sets shall be used only when the content stream is printed to a PostScript output device. The names identify PostScript procedure sets that shall be sent to the device to interpret the PDF operators in the content stream.

Valid entries are PDF, Text, ImageB, ImageC, ImageI.

A Resources entry is required for all Page Objects, but it can be inherited from any ancestor. This is why I put a Resources entry in the root Page Tree Node. There's no reason to include a Resources entry on every page.

3 0 obj
  [ /PDF /Text ]
endobj

The following object is an actual Page. It contains only a Page Dictionary. Type, Parent, MediaBox, and Resources entries are required. Both Resources and MediaBox can be inherited.

Parent is an indirect object reference to the Page Object's Parent Node in the Page Tree. In our case, and in most simple cases, this will just be the page tree root.

MediaBox has a 4-element array that specifies a Rectange in the default User Space Units. That Rectange defines the size of the physical medium upon which the page is printed or displayed. Basically, it's the size of the page. By default, 1 Unit is 1/72 of an inch or 72 units per inch. A Letter-size page, then, would be 8.5 × 72 by 11 × 72 or 612 by 792.

As MediaBox is heritable, you can include it in an ancestor node. This is handy when all of your pages are the same size.

The first coordinate pair represent the x, y position of the lower-left corner, the second pair the upper-right corner. This means that, for a letter-size page, the coordinates for upper-left corner are x:0 y:792, not x:0, y:0 like you'd expect.

4 0 obj
  <<
    /Type /Page
	/Parent 2 0 R
    /MediaBox [ 0 0 612 792 ]
  >>
endobj

Next is the Cross Reference Table (xref), an index of all the indirect objects in the body of our PDF. This section starts with a line containing nothing by the keyword 'xref'. The next line begins an xref subsection. The first number is the first object number in the table. (Object numbers in the subsection must be contiguous.) The second number is the number of entries in the subsection. It's unlikely that you'll need more than one subsection.

Entries in the subsection must be on their own line. They are exactly 20 bytes wide, including the end-of-line marker. Each entry contains three values, each separated by a single space. The first value is a zero-padded 10-digit Offset to the object either from the beginning of the file (what we'll be using) or the beginning of the decoded stream. The second value is a zero-padded 5-digit, Generation Number (for us, this will always be zero). This is followed by an f (meaning free) or n (meaning in-use). As each entry must be 20 bytes wide, the end-of-line marker needs to be 2 bytes long. It should be either a space and a \n or a \r\n.

Looking at our example, the subsection starts with 0 5. That means the first entry in the table will be object 0 and there will be 5 total entries, one for object numbers 0, 1, 2, 3, and 4 in that order.

The first subsection entry refers to object number 0. Object number 0, which is conspicuously absent in the document body, is reserved for the linked list of 'free' objects. (For free objects, the 10-digit offset is used to refer to the next free object number. As there are no other free objects, and the last free object must point to object 0, it points to itself.) Don't worry about that too much, as it's very likely not going to apply to you. Just remember that the first entry should be 0000000000 65535 f

The following entry, should point to object number 1, the next object in the sequence. It should be located at address 10, from the start of the file. A quick check with a hex editor, or some careful counting, will show that object number 1 starts at address 10. The same is true for the remaining entries.

Be aware that many PDF viewers will try to silently rebuild the xref table if it's incorrect, so you might not see any errors when you test your own PDFs. Be sure to check the offsets you're generating with a PDF validator (or manually with a hex editor) to make sure that you're creating the xref table correctly.

xref
0 5
0000000000 65535 f
0000000010 00000 n
0000000064 00000 n
0000000162 00000 n
0000000197 00000 n

Following the Cross Reference Table is a trailer. The trailer lets a PDF reader quickly find the xref table and the object that holds the Catalog Dictionary. The first entry is a dictionary that tells the reader how many entries are in the xref table, and a reference to the Catalog.

This is followed, on it's own line, by the keyword 'startxref'. The next line contains only the offset of the xref table from the beginning of the file.

The last line of the file, the very last 5 bytes, must be '%%EOF'. Don't add an end-of-line marker.

trailer
<<
  /Size 5
  /Root 1 0 R
>>
startxref
277
%%EOF

Adding Graphics

Vector graphics are easy to add, being composed of simple primitives. We'll create a Content Stream and add a Contents entry to our Page dictionary.

Download example: graphics.pdf

%PDF-1.7
1 0 obj
<< /Type /Catalog /Pages 2 0 R >>
endobj

2 0 obj
<< /Type /Pages
/Kids [3 0 R]
/Count 1
/Resources << /ProcSet [ /PDF /Font ] >>
>>
endobj

3 0 obj
<< /Type /Page
/Parent 2 0 R
/MediaBox [ 0 0 612 792 ]
/Contents 4 0 R
>>
endobj

4 0 obj
<< /Length 23 >>
stream
100 512 m
200 412 l
S
endstream
endobj

xref
0 5
0000000000 65535 f
0000000010 00000 n
0000000064 00000 n
0000000171 00000 n
0000000269 00000 n

trailer
<< /Size 5 /Root 1 0 R >>
startxref
350
%%EOF

Looking at indirect object 3, you'll see that we've added a Content section to the Page dictionary. In this case, it contains only a single reference to indirect object 4, a Content Stream containing all of the page content. You can also use an array to include multiple content stream objects.

3 0 obj
<< /Type /Page
/Parent 2 0 R
/MediaBox [ 0 0 612 792 ]
/Contents 4 0 R
>>
endobj

Indirect object 4 is a Content Stream. Content Streams begin with a dictionary containing a single entry, a Length, which is the exact length, in bytes, of the stream content between the 'stream' and 'endstream' keywords, excluding the end-of-line markers at the end of the 'stream' keyword, and before the 'endstream' keyword.

Graphics are composed using primitive instructions composed of operators and operands. Operands preceed their operator. If you've used languages like Forth or PostScript, this will be familiar to you. Operands can be of any type except stream. Dictionaries can be used, but only with certain operators. Indirect object references are prohibited.

In our example, we draw a single line using three instructions. The first two instructions compose a "path" which can later be "painted". The third instruction paints the path.

The first instruction, operator 'm', sets the current point to some position, which begins a new subpath. It takes two operands, an x,y coordinate pair. The second instruction 'l' adds a line to our path from the current point to the point specified by its two operands. 'S' strokes the path according to the current Graphic State.

A complete list of path and path-painting operators can be found in Tables 59 and 60 of the 1.7 spec on pages 132 and 135. Operators that change the graphics state can be found in Table 57 of the 1.7 spec on page 127.

4 0 obj
<< /Length 23 >>
stream
100 512 m
200 412 l
S
endstream
endobj

Adding Text

Text is a lot more complicated than graphics, but it's not too bad if we stick to the Standard 14 fonts. We need to create a font object and name it in the Resources dictionary (either for the page or an ancestor).

Download example: text.pdf

%PDF-1.7
1 0 obj
<< /Type /Catalog
/Pages 2 0 R
>>
endobj

2 0 obj
<< /Type /Pages
/Kids [4 0 R]
/Count 1
/MediaBox [ 0 0 612 792 ]
/Resources << /ProcSet [ /PDF /Text ] /Font << /F1 3 0 R >> >>
>>
endobj

3 0 obj
<< /Type /Font
/Subtype /Type1
/BaseFont /Helvetica
>>
endobj

4 0 obj
<< /Type /Page
/Parent 2 0 R
/Contents 5 0 R
>>
endobj

5 0 obj
<< /Length 49 >>
stream
BT
/F1 24 Tf
100 700 Td
( Hello World ) Tj
ET
endstream
endobj

xref
0 6
0000000000 65535 f
0000000010 00000 n
0000000066 00000 n
0000000222 00000 n
0000000300 00000 n
0000000371 00000 n

trailer
<< /Size 6
/Root 1 0 R
>>
startxref
478
%%EOF

The Font dictionary we added to resources can contain more than one entry. Entries do not need to be sequential.

/Resources
<< /ProcSet [ /PDF /Text ]
/Font << /F1 3 0 R >> 
>>

The font object referenced in out font dictionary is very simple, when using the base 14 fonts. They'll vary only on the name. You can replace Helvetica with any of the following: Times-Roman, Times-Bold, Times-Italic, Times-BoldItalic, Helvetica, Helvetica-Bold, Helvetica-Oblique, Helvetica-BoldOblique, Courier, Courier-Bold, Courier-Oblique, Courier-BoldOblique, ZapfDingbats, Symbol.

3 0 obj
<< /Type /Font
/Subtype /Type1
/BaseFont /Helvetica
>>
endobj

In our content stream, text works similarly to graphics, but uses a different set of commands. Here, we use BT and ET to begin and end a text stream. Tf to set the font, Td to position the text, and Tj to draw a text string. You can find a complete list in tables 107, 108, and 109 on pages 248-251 of the 1.7 Spec.

stream
BT
/F1 24 Tf
100 700 Td
( Hello World ) Tj
ET
endstream

Unfortunately, Td doesn't work the way you'd expect. From page 252 of the 1.7 Spec:

Move to the start of the next line, offset from the start of the current line by (tx , ty ). tx and ty shall denote numbers expressed in unscaled text space units.

This makes positioning text on a page simple in one specific case, but much more complicated for things like forms and reports. For that, we'll need to use the Tm command to set the text matrix. You can find a description of text space in section 9.4.4 on page 252 of the 1.7 Spec.

For our purposes, we just need a simple method to position text. You can, and should, use the Td, ', ", and T* functions for blocks of evenly spaced text to reduce the size of the file.

Download example: text2.pdf

5 0 obj
<< /Length 97 >>
stream
BT
/F1 24 Tf
1 0 0 1 100 700 Tm
( Hello World ) Tj
1 0 0 1 150 670 Tm
( Hello World ) Tj
ET
endstream
endobj

Briefly, the text matrix is a 3x3 matrix that defines an affine transform. The six parameters passed to Tm represent the first two columns of the matrix, the last column is implied.

a b 0 c d 0 e f 1 1 0 0 0 1 0 x y 1 a b c d e f Tm 1 0 0 1 x y Tm

Finding Text Width

If you need to do any sort of type setting, you're going to need a function that accepts a string and returns the width in user space units for a given font size. Functions like that can get complicated quickly. However, a simple function that just sums the widths of each character should be adequate for most uses.

Adobe provides AFM files for the Base 14 Fonts under very permissive terms:

This file and the 14 PostScript(R) AFM files it accompanies may be used, copied, and distributed for any purpose and without charge, with or without modification, provided that all copyright notices are retained; that the AFM files are not distributed without this file; that all modifications to this file or any of the AFM files are prominently noted in the modified file(s); and that this paragraph is not modified. Adobe Systems has no responsibility or obligation to support the use of the AFM files.

You can download the AMF files from Adobe Core14_AFMs.zip or from here Core14_AFMs.zip

From page 11 of the Adobe Font Metrics File Format Specification Version 4.1:

All measurements in AFM, AMFM, and ACFM files are given in terms of units equal to 1/1000 of the scale factor (point size) of the font being used. To compute actual sizes in a document (in points; with 72 points = 1 inch), these amounts should be multiplied by (scale factor of font) / 1000.

The coordinate systems in which these units exist is defined by convention. For instance, the origin for roman characters is on the baseline, a little to the left of the character, and the x-axis runs along the baseline.




Home - October 2017 - Last modified: October 2017