Wednesday 18 July 2012

Parsing PDF Files with C# - Part 1

I've been working on some code in C# to parse PDF files and check for security restrictions such as the existence of user and owner passwords and settings such as printing disabled and thought I should share some notes. 


PDF File Format
The PDF file format is extremely flexible which is great in some aspects but when writing a parser that supports such flexibility, it can become a challenge. For example newline sequences can vary depending on the OS or software used to create the PDF file. Different versions of the PDF specification do things in different ways and the contents of files may be encrypted or encoded in different ways making writing a complete parser a challenging prospect.


Apart from a couple of elements, PDF files are largely made up of objects. These objects can described where to find other objects in the file, text or images that appear on the page, fonts and descriptions of which objects appear on which page.


PDF files have two passwords. The owner password is required to change the document and the user password is require to open the document. If a file has restrictions such as No Printing but doesn't prompt you to enter a password when you open it in Adobe Reader, you will find that the document is encrypted using a default user password. You can find more information about this in the specification document.


Using C#
I'll include more details of how I approached implementing a PDF parser in C# in a future post but here are some general thoughts about it.


It should go without saying, but write as generic code as possible. Most elements of PDF files are Objects each of which usually has a Dictionary and a content Stream.


I ended up using FileStream to read files. I initially used a StreamReader but came across several problems. When reading a PDF file, you will need to jump around the file to various points and I found that StreamReader buffers its reads so StreamReader.BaseStream.Position and StreamReader.BaseStream.Seek may refer to one location in the file but the StreamReader read methods may read data from a different point in the file.


It is common for object streams (the content of an object) to be compressed using FlateDecode compression. This can be decompressed using the System.IO.Compression.DeflateStream class, although I found that I needed to skip the first two bytes when deflating object streams.


Checking if the file has a user password requires a combination of creating MD5 hashes and encrypting using RC4. An MD5 class is provided by .NET which  works well and I was able to use an RC4 encryption/decryption class I had previously written for encryption purposes.


Resources 
The best resource for information about the PDF file format is the specification document which is available on the Adobe site


Adobe also have a very useful forum where you will most likely find that someone has already had the same problem as you.