These days, with everything coding related getting simplified, abstracted, and dummified, there are APIs for everything, and they are just a google search away. But sometimes, one has to delve into the innards of protocols, file structures and the like……
This was such an instance. We had worked on some preexisting code which inserted an image and the text recognized from it via OCR into a searchable PDF. It used the libharu – an excellent open source library. This worked well till the day UTF8 characters needed to be displayed. A patch later, and with some changes in the way utf8 characters were read and written, this was working beautifully. Or so we thought.
While the pdf file got created nicely, and users could actually search for characters, select them, copy and paste them with true reproduction, a problem happened when external programs tried to extract text from the file.
Why were programs able to display the UTF8 text properly, but not able to extract it? After stepping through the code, it slowly became clear that the only way was to dive into the PDF file format.
We found out interesting things in the way and put it down here in the hope that it is helpful for someone else.
The best way to get introduced to the pdf syntax is to create a minimal pdf file and open it in an editor such as Notepad++. This is easily accomplished
In my machine, the size is around 80kb, and it should be similar in other machines also. Based on the version of PDF, the size may change.
Open this in Notepad++. The first thing that you notice is that it is text – mostly J. Also, if you feel that there is some structure behind in the way it is organized, you are right. But before we step into the structure of the pdf, let us take a look at the basic building blocks of the PDF Syntax.
There are different types of objects, here are the important ones.
Numeric objects are written with digits and the period sign. Objects are represented as so 1 0 24. This consists of three tokens separated by space. The first number is the id of the object and this handle is used to refer to the object elsewhere. The second number is called the generation number of the object. This generation number is significant but that will be explained later in the document. The third token is the object itself. In this case it is a string.
String objects are any character enclosed within the ( and ) signs. For example (Hello World). This would be represented as so
2 0 (Hello World)
Since this is the same string that we entered into the word document, it must be somewhere in the document right? However, a search for it yields nothing. There is no need to get frustrated with pdf and stop looking though. The reason for this can be one of two
Names are PDF’s way of representing tokens. Names start with the / sign. Names are used to refer to objects by names, and are used mainly in dictionaries for representing indirect objects. A name can be used to direct to the object created above as in
/String1 2 0 R
The R indicates that this is an indirect object called /String1 which is referring to the object with id 2 and generation number 0.
Dictionaries are hashtables, and are represented as collections of Key Value pairs. Each entry in a dictionary consists of two items a Key and a value. The key as mentioned has to be a name. The value can be a direct object such as String, integer or other Dictionary, or it can be an indirect object referred to as a Name. An example of a dictionary is
/ADirectString (Hello World)
The dictionary is enclosed within << and >> signs. Two tokens taken together makes an entry. This dictionary has 3 entries – First entry points to a string defined elsewhere, whereas the second entry has an integer and the third entry has a string.
Arrays are a set of objects arranged sequentially [ 21 30 HelloWorld ] is an array with three elements 21, 30 and HelloWorld. How can we change the third element to Hello World? Since space is considered a delimiter, wouldn’t it be considered as two characters? As you have inferred, the solution is to represent the string HelloWorld using a Name and then referring to the string using the name in the array.
Finally, the Stream element is a mechanism of representing any object. It consists of multiple lines.
The first line looks like 10 0 obj 10 is the id, 0 is the generation number and obj indicates that this is an object.
This is followed by a line with an optional Filter. A filter is the name of an algorithm which should be applied to the following stream to get the data in the stream. FlateDecode is an example of a filter, it indicates that the stream is actually compressed, and it decompresses the stream that follows.
This is followed by a line with the single stream keyword.
Following this, the bytes of the stream are seen. At the end of the stream, the endstream keyword on a separate line is present, indicating the end of the stream.
The diagram below (taken from Adobe file format reference – http://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/PDF32000_2008.pdf) depicts the structure.
The header consists of one line identifying the version of pdf. In our file, it is %PDF-1.5. The % sign at the beginning of a line indicates a comment. The second line is a comment too, to indicate to editors that this is a binary file (with feelings and not to be trifled with!).
Following this, is the body of the file. The body consists of various objects represented using the syntax mentioned above.
The Cross reference table represented by xref key word is like a table of contents. It serves many purposes. First of all, it gives order and structure to the document. A PDF reader first consults the xref table to parse and display the objects in the document. An Xref table consists of many sections. Each section has a header consisting of two integers. The first integer gives the id of the first element of that section, the second integer gives the number of objects in the section.
The xref table of HelloWorld.pdf is given below
0000000008 65535 f
0000000017 00000 n
0000000124 00000 n
0000000180 00000 n
0000000410 00000 n
0000000620 00000 n
0000000788 00000 n
0000001027 00000 n
0000000009 65535 f
0000000010 65535 f
0000000011 65535 f
0000000012 65535 f
0000000013 65535 f
0000000000 65535 f
0000001504 00000 n
0000001709 00000 n
0000079732 00000 n
It has only one section. The header mentions that the first element of this table has an id of 0. This will be followed by 17 elements with consecutive ids.
For each element, there is an entry with two numbers and a character (either n or f). The first number mentions the location of the object. The second number is the generation number of the object. The generation number concept is explained below. The third token designates whether this object is in use (n) or free (f)
Xref tables are also a mechanism for incremental updates. When an object is deleted, it is not actually deleted from the file. Instead, it’s entry in xref table is marked as free. When a new object is created, instead of assigning a new id, an existing id of a free object can be used. In this case, the new object will share the id with the old object, but the generation number will change. The new object will be appended to the file, so that offsets of existing items are not changed.
The advantage of incremental updates is that it can be fast, and involve little data transfer if it is over a slow connection. On the other hand, the pdf file keeps on growing as objects get deleted and new objects get added.
The Trailer is the last section of the file. A conforming pdf reader should start reading the document from the bottom of the file ie the trailer. The trailer is a dictionary which gives information about the file, especially the root element and the offset of the xref table.
How does a conforming PDF reader parse and display the file? For that to happen to happen smoothly, Objects in a PDF file are organized in a known hierarchy called a document structure. This is shown in the figure below.
This figure looks complicated – but not all parts are not equally important.
The Document structure of a pdf document can best be viewed by using a program such as PDFExplorer by O2 Solutions. It is free, so download it, fire it up and open HelloWorld.pdf with it.
It can be seen that the Catalog is the root element of the document, a PDFReader locates it by the /Root element in the trailer dictionary. The document Catalog element has many children, but the Pages (Page tree in the figure above) is the element of interest. The Pages element has Kids element, which is an array of pages. Expand Kids element. Locate the Contents element under it. The Contents element can contain many elements under it. In the case of the HelloWorld document, it consists of a single stream object. Click on this. The PDFExplorer applies the FlateDecode filter on it and displays the resulting decompressed text. This consists of multiple Text operators.
The first one BT indicates that display of text is about to begin.
The second one Tf indicates that the font should be set to /F1 with size of 11. What is /F1? It can be located under Resources child of Kids. This font is embedded into the file. But more of this later.
The second operator Tm refers to the Text matrix operator. This is not very clear, only thing that I understood is that it defines the coordinate system for the coordinates that follow.
This is followed by the TJ operator, which is to show the text string. The arguments to this is an array of text strings. [(H)3(ello)-3( )9(Wo)-6(rld)]. The meaning of the numbers is not clear.
This is followed by the ET operator, which is an instruction to end display of texts.
A Pdf reader locates the root Catalog element from the Trailer, looks up its location from the xref table, finds the Pages child of the Catalog element. It locates the Kids child of Pages, which is an array. It then iterates through the elements of this array. For each of the elements, it locates the Contents element in it, gets the children which could be text or Graphics operators and executes each of them.
Please note that each operation is represented as a postfix expression. The arguments appear first and the operation is the last token in the line.
The extraction algorithm is similar. It iterates over the document structure exactly as for display. But instead of displaying the text, it extracts it and writes it out to the destination file (if there is one).
However, there is one significant difference between Display and extraction. This has to do with Fonts.
A font is used to display characters on screen, and in pdf to also interpret encodings of strings.
A character is displayed using an image called a glyph (A very simplistic way of thinking about it, but it conveys the idea).
A font consists of a set of glyphs. It consists of a map called a CIDToGIDMap which maps character ids to Glyph ids. This is used for displaying fonts. As the pdf reader reads a string, each character is looked up in the CIDToGidMap and the corresponding glyph is displayed.
Conceptually, this is what happens – of course it is much more complicated than that. For example, storing each glyph as an image would take a lot of space, so it is usually stored as a vector. Similarly, font size, spacing, italics, bold etc need to be handled. However, it is easy to ignore those aspects and look at it this way.
A font also consists of a ToUnicodeMap. The ToUnicodeMap maps characters in the input string to UTF8 code points.
The following links give more information.
http://www.joelonsoftware.com/articles/Unicode.html gives an excellent introduction to character sets and encoding.
http://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/5411.ToUnicode.pdf explains the ToUnicode syntax.
http://www.4xpdf.com/2010/03/technical-background-to-pdf-font-options/ gives and extremely good overview of how fonts are displayed.