Blog

PDF file formats and CMAPs

These days, with everything coding related getting simplified, abstracted, and dummified, there are APIs for everything, and they are just a google search away. But sometimes, one has to delve into the innards of protocols, file structures and the like……

This was such an instance. We had worked on some preexisting code which inserted an image and the text recognized from it via OCR into a searchable PDF. It used the libharu – an excellent open source library. This worked well till the day UTF8 characters needed to be displayed. A patch later, and with some changes in the way utf8 characters were read and written, this was working beautifully. Or so we thought.

While the pdf file got created nicely, and users could actually search for characters, select them, copy and paste them with true reproduction, a problem happened when external programs tried to extract text from the file.

Why were programs able to display the UTF8 text properly, but not able to extract it? After stepping through the code, it slowly became clear that the only way was to dive into the PDF file format.

We found out interesting things in the way and put it down here in the hope that it is helpful for someone else.

First things First

The best way to get introduced to the pdf syntax is to create a minimal pdf file and open it in an editor such as Notepad++. This is easily accomplished

Fire up Microsoft Word. Create a new document. Type a small string “Hello World” in it. Choose File/Save as…, and in the resulting dialog choose pdf file format. I saved it as Helloworld.pdf.

In my machine, the size is around 80kb, and it should be similar in other machines also. Based on the version of PDF, the size may change.

Open this in Notepad++. The first thing that you notice is that it is text – mostly J. Also, if you feel that there is some structure behind in the way it is organized, you are right. But before we step into the structure of the pdf, let us take a look at the basic building blocks of the PDF Syntax.

There are different types of objects, here are the important ones.

Numeric objects are written with digits and the period sign. Objects are represented as so 1 0 24. This consists of three tokens separated by space. The first number is the id of the object and this handle is used to refer to the object elsewhere. The second number is called the generation number of the object. This generation number is significant but that will be explained later in the document. The third token is the object itself. In this case it is a string.

String objects are any character enclosed within the ( and ) signs. For example (Hello World). This would be represented as so

2 0 (Hello World)

Since this is the same string that we entered into the word document, it must be somewhere in the document right? However, a search for it yields nothing. There is no need to get frustrated with pdf and stop looking though. The reason for this can be one of two

Some PDF writers (in our case Microsoft Word) compress the strings before storing. This is what has happened in our case. There is an easy way to view these strings, which I shall explain later in the doc.
Some PDF writers store strings in Hex format. These strings are enclosed within < and >. For example Hello World would be represented as <48656c6c6f20576f726c64> . In this case, 48 would represent H (48 is hex for 72 which is the ASCII Code for H) and so on.

Names are PDF’s way of representing tokens. Names start with the / sign. Names are used to refer to objects by names, and are used mainly in dictionaries for representing indirect objects. A name can be used to direct to the object created above as in

/String1 2 0 R

The R indicates that this is an indirect object called /String1 which is referring to the object with id 2 and generation number 0.

Dictionaries are hashtables, and are represented as collections of Key Value pairs. Each entry in a dictionary consists of two items a Key and a value. The key as mentioned has to be a name. The value can be a direct object such as String, integer or other Dictionary, or it can be an indirect object referred to as a Name. An example of a dictionary is

<</AnIndirectstring /String1
/AnInt 23
/ADirectString (Hello World)
>>

The dictionary is enclosed within << and >> signs. Two tokens taken together makes an entry. This dictionary has 3 entries – First entry points to a string defined elsewhere, whereas the second entry has an integer and the third entry has a string.

Arrays are a set of objects arranged sequentially [ 21 30 HelloWorld ] is an array with three elements 21, 30 and HelloWorld. How can we change the third element to Hello World? Since space is considered a delimiter, wouldn’t it be considered as two characters? As you have inferred, the solution is to represent the string HelloWorld using a Name and then referring to the string using the name in the array.

Finally, the Stream element is a mechanism of representing any object. It consists of multiple lines.

The first line looks like 10 0 obj 10 is the id, 0 is the generation number and obj indicates that this is an object.

This is followed by a line with an optional Filter. A filter is the name of an algorithm which should be applied to the following stream to get the data in the stream. FlateDecode is an example of a filter, it indicates that the stream is actually compressed, and it decompresses the stream that follows.

This is followed by a line with the single stream keyword.

Following this, the bytes of the stream are seen. At the end of the stream, the endstream keyword on a separate line is present, indicating the end of the stream.

File Structure

The diagram below (taken from Adobe file format reference – http://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/PDF32000_2008.pdf) depicts the structure.

The header consists of one line identifying the version of pdf. In our file, it is %PDF-1.5. The % sign at the beginning of a line indicates a comment. The second line is a comment too, to indicate to editors that this is a binary file (with feelings and not to be trifled with!).

Following this, is the body of the file. The body consists of various objects represented using the syntax mentioned above.

The Cross reference table represented by xref key word is like a table of contents. It serves many purposes. First of all, it gives order and structure to the document. A PDF reader first consults the xref table to parse and display the objects in the document. An Xref table consists of many sections. Each section has a header consisting of two integers. The first integer gives the id of the first element of that section, the second integer gives the number of objects in the section.

The xref table of HelloWorld.pdf is given below

xref

0 17

0000000008 65535 f

0000000017 00000 n

0000000124 00000 n

0000000180 00000 n

0000000410 00000 n

0000000620 00000 n

0000000788 00000 n

0000001027 00000 n

0000000009 65535 f

0000000010 65535 f

0000000011 65535 f

0000000012 65535 f

0000000013 65535 f

0000000000 65535 f

0000001504 00000 n

0000001709 00000 n

0000079732 00000 n

It has only one section. The header mentions that the first element of this table has an id of 0. This will be followed by 17 elements with consecutive ids.

For each element, there is an entry with two numbers and a character (either n or f). The first number mentions the location of the object. The second number is the generation number of the object. The generation number concept is explained below. The third token designates whether this object is in use (n) or free (f)

Xref tables are also a mechanism for incremental updates. When an object is deleted, it is not actually deleted from the file. Instead, it’s entry in xref table is marked as free. When a new object is created, instead of assigning a new id, an existing id of a free object can be used. In this case, the new object will share the id with the old object, but the generation number will change. The new object will be appended to the file, so that offsets of existing items are not changed.

The advantage of incremental updates is that it can be fast, and involve little data transfer if it is over a slow connection. On the other hand, the pdf file keeps on growing as objects get deleted and new objects get added.

The Trailer is the last section of the file. A conforming pdf reader should start reading the document from the bottom of the file ie the trailer. The trailer is a dictionary which gives information about the file, especially the root element and the offset of the xref table.

Document structure

How does a conforming PDF reader parse and display the file? For that to happen to happen smoothly, Objects in a PDF file are organized in a known hierarchy called a document structure. This is shown in the figure below.

This figure looks complicated – but not all parts are not equally important.

The Document structure of a pdf document can best be viewed by using a program such as PDFExplorer by O2 Solutions. It is free, so download it, fire it up and open HelloWorld.pdf with it.

It can be seen that the Catalog is the root element of the document, a PDFReader locates it by the /Root element in the trailer dictionary. The document Catalog element has many children, but the Pages (Page tree in the figure above) is the element of interest. The Pages element has Kids element, which is an array of pages. Expand Kids[0] element. Locate the Contents element under it. The Contents element can contain many elements under it. In the case of the HelloWorld document, it consists of a single stream object. Click on this. The PDFExplorer applies the FlateDecode filter on it and displays the resulting decompressed text. This consists of multiple Text operators.

The first one BT indicates that display of text is about to begin.

The second one Tf indicates that the font should be set to /F1 with size of 11. What is /F1? It can be located under Resources child of Kids[0]. This font is embedded into the file. But more of this later.

The second operator Tm refers to the Text matrix operator. This is not very clear, only thing that I understood is that it defines the coordinate system for the coordinates that follow.

This is followed by the TJ operator, which is to show the text string. The arguments to this is an array of text strings. [(H)3(ello)-3( )9(Wo)-6(rld)]. The meaning of the numbers is not clear.

This is followed by the ET operator, which is an instruction to end display of texts.

Display algorithm

A Pdf reader locates the root Catalog element from the Trailer, looks up its location from the xref table, finds the Pages child of the Catalog element. It locates the Kids child of Pages, which is an array. It then iterates through the elements of this array. For each of the elements, it locates the Contents element in it, gets the children which could be text or Graphics operators and executes each of them.

Please note that each operation is represented as a postfix expression. The arguments appear first and the operation is the last token in the line.

Extraction algorithm

The extraction algorithm is similar. It iterates over the document structure exactly as for display. But instead of displaying the text, it extracts it and writes it out to the destination file (if there is one).

However, there is one significant difference between Display and extraction. This has to do with Fonts.

Fonts

A font is used to display characters on screen, and in pdf to also interpret encodings of strings.

A character is displayed using an image called a glyph (A very simplistic way of thinking about it, but it conveys the idea).

A font consists of a set of glyphs. It consists of a map called a CIDToGIDMap which maps character ids to Glyph ids. This is used for displaying fonts. As the pdf reader reads a string, each character is looked up in the CIDToGidMap and the corresponding glyph is displayed.

Conceptually, this is what happens – of course it is much more complicated than that. For example, storing each glyph as an image would take a lot of space, so it is usually stored as a vector. Similarly, font size, spacing, italics, bold etc need to be handled. However, it is easy to ignore those aspects and look at it this way.

A font also consists of a ToUnicodeMap. The ToUnicodeMap maps characters in the input string to UTF8 code points.

The following links give more information.

http://www.joelonsoftware.com/articles/Unicode.html gives an excellent introduction to character sets and encoding.
http://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/5411.ToUnicode.pdf explains the ToUnicode syntax.
http://www.4xpdf.com/2010/03/technical-background-to-pdf-font-options/ gives and extremely good overview of how fonts are displayed.

Recently, we were faced with an interesting problem. Our capture system uses master data from host systems for validation. Master data changes infrequently, but change it does, and so it is imported from the host system daily. Master data is supplied in the form of xml files, and massive ones too. An xml file can be more than 300 mb in size. Every time, the whole master data is given to our system. Parsing through this whole data and inserting it into a database is time consuming, so we needed a solution where we could compare the current xml file with the last imported xml, and insert only the differences.

What is the best way to compare two very large XML files? What are the issues involved? Going the DOM way is fraught with risks, because XML trees take around 10 times the size on disk. A 320 MB file would get expanded to 3.2 GB in memory. It would impossible, if not difficult to keep two such files in memory.

We first evaluated some tools to do the task for us. Evaluation was done with two xml files of around 327 MB each. The machine was a VMWare VM with 3 Gigs of RAM and 10 Gigs of swap space. The CPU is an i5 processor running on a DELL Vostro 3700 laptop.

Libxmldiff is a an efficient way of comparing large XML files. The web site promises to compare 100 MB plus files easily. However, with two files of 327 MB each, it gave a segmentation fault.
Xmlcmp.de has an xml toolbox which promises to compare and merge large xml files easily. The web site promises

“The comparison of two 1.400 MB xml-files needs only 6 minutes.
(Configuration: AMD Opteron Processor 244, 2 processors, 7 GB memory, 2 RAID-10-Diskarrays)”
The product fairly lives up to its promise. It takes some time to understand the configuration, but once that is understood, the tool compared our two files in around 6 minutes, which is not bad at all. However, the price is a bit steep (1700 Euros for a single server license), so we decided to explore a bit more.
Since there were not too many other products in the market, we decided to write the code ourselves. Perl was an obvious choice, due to its strong text processing capabilities, as well as the fact that it is already installed in our production systems.
First job before we started was to choose the right XML parser. DOM or SAX? The answer was clear – it had to be SAX. Or we could break the file into pieces and use DOM. Both approaches were tried.
There are many XML parsers in the perl world. Which one was fastest? Some existing studies helped. Here are two benchmarks

http://www.xml.com/pub/a/2004/09/15/pl-perf.html

http://xmlbench.sourceforge.net/results/benchmark/index.html

The first one gives a benchmark of all perl xml implementations. XML:SAX:ExpatXS is the best, and our tests also proved it.

The approach we followed was to

Convert both files to csv
Sort both files according to a known key
Diff the files using the unix command
Convert the resulting file to an XML format.

Of the different steps above, the first one is the most time consuming. So we tried out different approaches to convert an XML file to a CSV file.

Many small sub trees

The first approach was to read through the file and extract one xml element at a time (using perl regular expressions). For each element, construct a DOM tree using the ExpatXS parser. From this DOM tree extract and write out the attributes and sub elements on to a csv file.
Even though this approach looks crude, it is surprisingly effective. A 327 MB file can be converted to a csv file in around 6 minutes and 7 seconds.

STX

STX or streaming transformations in XML (http://stx.sourceforge.net/) is a method for quickly transforming large XML documents using an XSLT like language. There are implementations in Java (Joost), and perl (XML::SAX::STX). The perl version is available at http://search.cpan.org/~pcimprich/XML-STX-0.43/STX.pm. The version is 0.43 and seems to have been last updated in 2004.
The idea here is to write a small XSLT like script, and run it using the STX processor. The script below matches any element of types /bpColl/bp/addrCity, /bpColl/bp/ addrCompanyName and writes it out using the . operator. This script can be run from a perl script which uses XML::STX (which needs to be installed).

<?xml version=”1.0″ encoding=”ISO-8859-1″?>
<stx:transformversion=”1.0″ xmlns:stx=”http://stx.sourceforge.net/2002/ns” xmlns=”http://www.w3.org/1999/xhtml”>
<stx:template match=”/bpColl/bp/addrCity”>
<stx:value-of select=”.”/>
<stx:text>,</stx:text>
</stx:template>
<stx:template match=”/bpColl/bp/addrCompanyName”>
<stx:value-of select=”.”/>
<stx:text>
</stx:text>
</stx:template>
</stx:transform>

This is a very nice, elegant API. However, the performance leaves much to be desired. It takes around 4 hours to convert an XML file to csv file format. The STX implementation uses XML::SAX::ExpatXS, so the speed should be fine, unless STX itself has overheads.

XML::SAX::ExpatXS

It is possible to parse the file using this SAX parser. This is quite fast, it finishes in 6 minutes and 30 seconds. The time goes down to 4 minutes and 52 seconds if we ignore the character nodes that we are not interested in.

XMLGAWK

Xmlgawk is a very interesting utility that uses the awk idiom to parse xml documents. A script similar to an awk script can be written to parse the documents. The script itself is very short, and performance is fairly good. A simple script to convert some nodes to csv format, took around 6 and a half minutes.

PreProcessing xml file before perl parsing.

The input xml file can be preprocessed with grep or sed to remove nodes that are not required. This is very fast. The XML file that we used had a structure that consisted of multiple records, with each record containing multiple fields. The record was represented by an XML element and each field was represented by an xml element within it. If each xml element appears on a separate line, it is easy to remove it using grep with the command
Grep –EV ‘<elementname’ input.xml > cleanedinput.xml
Multiple elements can be removed from the input xml file like this by using the | separator. This is very fast. Running the xmlgawk script or the perl program after this preprocessing is done reduces the processing time by 75%

Some other promising approaches

Some other approaches were considered but were not followed up due to paucity of time. One was to use a Java parser such as Saxon. Piccolo is a very fast Java SAX parser.
http://search.cpan.org/~rbs/XML-Filter-Dispatcher-0.52/lib/XML/Filter/Dispatcher.pm
XML::Filter::Dispatcher and Twig are two other options as mentioned in the article below ( a bit dated though).
http://tomacorp.com/perl/xml/saxvstwig.html
This url has a very good discussion of different techniques for parsing using SAX.
http://www.velocityreviews.com/forums/t167818-perl-sax2-slow.html

Introduction and Background

Do today’s web applications truly exploit the facilities provided by the cloud? Not many do. Even though increasingly, web applications are being deployed on the cloud, they do not exploit the full power of the cloud.
Most of them use traditonal (in premise) apis and infrastructure for development as well as deployment.
Meanwhile, the world has moved on, and exciting new cloud based apis and deployment options are becoming available. Heroku is a cloud deployment platform built on top of Amazon Web services that enforces standard development practices for developing and deploying applications on the cloud. Using cloud based APIs to develop, and then using Heroku for deployment is now becoming the rule.
Experienced programmers who have not used Heroku and Github may require a quick background to the architecture and interplay between these systems. This document attempts to provide that.

Structure of this document

Git background and its differences with other conventional Version Control systems are described first.

A model of Heroku with enough details to operate it is described. The interconnections of Heroku and Git is described. Way to setup Gradeables on Heroku for the first and subsequent times is described Finally, a deployment architecture and set of processes is proposed.

Git Essentials

Git is a distributed version control system (DVCS). It is different from earlier Version control systems in the sense that it can work in a disconnected mode and the entire repository is stored in the client machine. This section takes material from http://git-scm.com/book, with references to corresponding chapters. Where some point is not completely clear, reader can check the corresponding chapter in the book.

Structure of Git and Differences with conventional version control systems

Entire repository is copied

Conventional systems store all the different source code versions and metadata on the server (called the repository). Git stores the entire repository on each client as shown in the diagram below.

Because of this, clients do not checkout code from the server they clone code from the server. When a repository (say Gradeables) is cloned on the local machine

A directory with name of the repository (Gradeables) is created
A hidden subdirectory called .git is created (Gradeables/.git), the entire repository with version differences is stored here
A checkout is performed into the top level directory and the latest version of the source code is available here.

Versions are stored as snapshots and not as deltas

As the following diagrams illustrate, versions are stored as snapshots in Git, not as deltas

Versions in CVS

Versions in Git

As can be seen from the pictures above, whenever a new version is created in a conventional version control system the delta, or difference with previous version is stored. With Git, and entirely new directory structure is created for each version, all unchanged files are stored as links to the previous versions, while for changed files a copy is made.

Checkins are local

In Git all checkins done via git commit are local and do not change the remote server. Since all checkins, checkouts, modifications and commits are local, the central Git server is usually termed a remote

States of git files

A file is in state (untracked) when it is first locally created. After adding the file to the local git repository using the command git add, the file goes to unmodified state. If changes are made to the file, then it goes to modified state. Using the git command git add <filename>, moves it into a special state called staged. On committing the file using the git commit command, the file goes into unmodified state and changes become part of the local repository.

All commits are local. For changing the common remote server, git remote push command is required. To pull latest versions from remote server, git pull command is required.

The standard practice for using git is

Developer first initializes the repository, adds files and pushes to git remote server(usually github.com)
Developers create a clone of the remote repository using the clone command
Developers make changes
When developer has finished with changes, the code needs to be pushed to the remote server. For this
- Developer first commits changes locally
- Pulls latest changes from remote server using git pull (not clone), in case other developers have made changes to the same files. In cases of conflict, developer resolves them.
- Developer pushes the changes to the remote server.

Git Server

Git can be installed as a server, which can be used by multiple developers to collaborate. Github.com is the most popular git server, but Git can be installed as a server on any machine. When directories are pushed to Git server, only the .git directory is stored, the current working copy is not stored.

Git Server Hooks

Git server provides a facility to run server side scripts when a push is initiated or after a push is complete. This is very important in the context of heroku because heroku uses this feature to run deployment scripts.

Working with Git – common commands

Cloning a repository
Making changes
Updating server with changes
Reverting
Branching
Merging

Heroku Essentials

This section contains excerpts from the book “Hacker’s guide to Heroku”, an excellent guide to Heroku. Please refer this book for more detailed information on Heroku.
Heroku is a cloud based deployment system which abstracts best practices for deploying SAAS applications. It runs on top of Amazon Web Services and provides facilities to easily deploy and massively scale applications.

Heroku components

The following diagram illustrates Heroku components.

Users can create accounts in Heroku. Once an account is created, they can create applications on Heroku. Whenever an application is created, a corresponding git repository is automatically created. This repository is created within a Git Server running on the heroku system.
Whenever an application is created, a virtual server running ubuntu (512 MB RAM, 4 CPU Cores) is created along with it. This virtual server is termed a Dyno (short for Dynosaur). Heroku provides the facility to create more Dynos on demand. Heroku takes care of load balancing and directing requests to corresponding dynos.
Whenever an application is deployed to Heroku, Heroku runs a deployment script called the Slug compiler, and creates a binary version of the deployment – this is termed a Slug.
Whenever a new Dyno is created/provisioned, Heroku automatically deploys the slug on the Dyno and makes it ready to run.
Heroku provides a set of infrastructure components – The database component creates instances of database servers, Cloudmail provides the facility to receive emails using specified email addresses and forward them to the application, whereas Memcache provides a cache facility. Heroku provides the facility to attach infrastructure components to applications.

Interacting with Heroku

Users would need to interact with Heroku to

Configure application
Attach infrastructure components to application
Push and pull source code from/to Heroku
Deploy application to Heroku
Configure runtime Dynos
Dial into a runtime Dyno to check health, logs etc

Heroku provides a set of client tools called the heroku toolbelt for interacting with heroku. When the heroku toolbelt is created, the heroku command is available.
The following are the most important commands

heroku login– to login to heroku. A heroku account is required
heroku accounts– to switch between accounts. Plugin heroku:accounts needs to be installed
heroku apps– app related commands. Most important app commands are create and destroy
heroku config– to set environment variables in the Dynos
heroku run bash– this is to open a shell terminal to the virtual server (Dyno). A file system consisting only of the application files are displayed, and many shell commands can be executed here
heroku run console– A rails console is opened and debugging can be done from this console

Heroku Git Interplay

How are applications pushed to the git repository within Heroku? After pushing an app to Heroku, how does it get automagically deployed?
Whenever an application is created in Heroku, it automatically creates a corresponding git repository for it, and the url to that repository is returned.
After the developer finishes development, this application can be pushed to the git repository using a git push heroku command.
Heroku intercepts a git push by using the Githooks mechanism, by hooking up a language specific script called a Buildpack to Git. As soon as push completes, the buildpack is run. It essentially

runs a language specific slug compiler and creats a slug (or binary)
deploys it in all the Dynos (virtual machines) configured for the application
Deployment can be customized by providing a Proc file in the root directory of the application. An example Procfile could contain the line web: bundle exec unicorn -p $PORT -c ./config/unicorn.rb -E $RACK_ENV

Deployment Steps

First Time

The following steps (and commands) need to be executed for the first time. A heroku account needs to be created by visiting heroku.com and the heroku client toolbelt installed.

Clone existing code from github.com
Login to heroku
Create an application
Push code to heroku
Set configuration variables
- Environment – staging or production
- Gmail credentials
Import data into the application database
Change some parameters in configuration files for css (once system is finetuned, this should go away)

System is ready to go!

Subsequent times

Pull code from heroku
Make changes
Push code to heroku
Import data into the application database (if data has changed)

References

Git book – http://git-scm.com/book
Heroku book – http://www.theherokuhackersguide.com/
Article on Heroku push deployment – http://www.jamesward.com/2012/07/18/the-magic-behind-herokus-git-push-deployment
Buildpacks – https://devcenter.heroku.com/articles/buildpacks

PDF file formats and CMAPs

Blog

PDF file formats and CMAPs

First things First

Display algorithm

Extraction algorithm

Fonts

Handling massive XML files with perl and shell scripts

Many small sub trees

STX

XML::SAX::ExpatXS

XMLGAWK

PreProcessing xml file before perl parsing.

Some other promising approaches

Deploying web applications on the cloud with github and heroku

Introduction and Background

Structure of this document

Git Essentials

Structure of Git and Differences with conventional version control systems

Entire repository is copied

Versions are stored as snapshots and not as deltas

Checkins are local

States of git files

Git Server

Git Server Hooks

Working with Git – common commands

Heroku Essentials

Heroku components

Interacting with Heroku

Heroku Git Interplay

Deployment Steps

First Time

Subsequent times

References

Quick Links

SUBLINKS

RESOURCES