A look at pdftk
I don’t know how many ways you can create PDF files in Linux. Most applications let you save documents directly to PDF, and you can convert files to PDF quite easily. But manipulating those PDFs is a bit trickier. various applications let you to fiddle with PDFs in one or two ways. But if you’re a command line junkie, an app called pdftk (PDF Toolkit) is practically an all-in-one solution. It’s the closest thing to Adobe Acrobat that I’ve found for Linux.
pdftk’s developer describes it as the PDF equivalent of an “electronic staple remover, hole punch, binder, secret decoder ring, and X-ray glasses.” That’s pertty close to the truth. Pdftk can:
- Join and split PDFs
- Pull single pages from a file
- Encrypt and decrypt PDF files
- Add, update, and export a PDF’s metadata
- Export bookmarks to a text file
- Add or remove attachments
- Fix certain damaged PDF
- Fill out PDF forms
You can download pdftk either as source code, or in packages for various flavours of Linux — for example, Debian, RPM-based distributions, FreeBSD, or Gentoo. If you’re going to compile pdftk, read this to learn out about the program’s dependencies.
As I mentioned earlier, pdftk is a command-line tool. Its options can be complicated, especially for complex operations. You’ll be doing quite a bit of typing, but that shouldn’t put you off using pdftk. When I started working with pdftk, I found myself using only a few of its functions: joining and splitting PDF files, adding metadata, and password protecting the file
Combining PDF files
pdftk’s can combing two or more PDF files, similar to joinPDF (which I discussed here). To do that, open a terminal window and change to the directory containing the PDF files that you want to combine. Then, type the following command:
pdftk file1.pdf file2.pdf cat output newFile.pdf
cat is short for concatenate — join together, for those of us plain plain — and output tells pdftk to write the combined PDFs to a new file; in this case, newFile.pdf.
Pdftk doesn’t retain the bookmarks that might have been in one or all of the files you’re combining, but it does keep hyperlinks to both destinations within the PDF and to external files or Web sites. Where some other applications point to the wrong destinations for hyperlinks, the links in PDFs combined using pdftk managed to hit each link target.
Splitting files
Splitting PDF files with pdftk can be … interesting. The burst option breaks a PDF into multiple files. How many? How about one file for each page. To use it, type:
pdftk style_guide.pdf burst
With larger documents you wind up with a lot of files with names corresponding to their page numbers, like pg0001 and pg0013. It’s not very intuitive or useful, especially if you want only a few pages.
Of course, pdftk remove specific pages from a PDF file. For example, to remove pages 10 to 25 from a PDF file, type the following command:
pdftk myDocument.pdf cat 1-9 26-end output removedPages.pdf
The options 1-9 and 26-end tell pdftk to ignore pages 1 through 9 and page 26 to the last page, and copy the pages between those ranges to the file removedPages.pdf.
I’ve used this feature quite a bit — mainly to trim pages from work samples that I have posted on my company’s Web site, and to extract articles from back issues of a magazine to which I contribute. The resulting files are small, and the PDFs are clear and easy to read.
Adding attachments to a PDF
To be honest, I miss Adobe Acrobat’s ability to attach files to a PDF. When working with PDFs on Windows, I regularly used this feature to include addenda, surveys, or additional information with a published PDF. Until I found pdftk, I was forced to move my PDF documents to a computer running Windows whenever I needed to attach a file.
Why attach a file to a PDF instead of sending an archive? Mainly convenience. If you move a PDF from one computer to another, and don’t move the archive along with it, you won’t have access to the attachments. And instead of pulling a file from an archive to view it, you just double-click on the attachment’s icon to open the file from your PDF viewer.
Using pdftk, you can easily attach binary and text files to a PDF. You can even specify what page of the PDF you want the attachment to appear on. Just type the following command:
pdftk htmltidy.pdf attachfiles commandref.html topage 24 output htmltidybook.pdf
Obviously, attachfiles is the option to attach files. topage 24 tells pdftk to attach the file command_ref.html to page 24 of the resulting PDF.
I’ve attached OpenOffice.org Writer documents, tar.gz and zip archives, and text and HTML files to various PDF documents. Apart from a noticeable increase in the size of the PDF file, there were no nasty side effects.
How do you know a PDF contains an attachment? Look for the thumbtack icon in the PDF. This only works in Adobe’s Acrobat Reader, though. Attachments don’t appear in applications like Xpdf, Evince, KPDF, or gv.
Adding metadata and passwords to a PDF
Pdftk has a number of options that you might use infrequently, but that are very useful when you need them. Two of them are updateinfo and userpw.
When you create a PDF, it might contain no or incomplete metadata, which is information that describes the PDF. Metadata can come in handy when you or your users need to organize or index a set of PDF files. Using pdftk and a text file, you can change or add metadata to the PDF by typing the following command:
pdftk DocBookOverview.pdf updateinfo data.txt output DocBookOverview.pdf
In this case, the file data.txt contains an InfoKey and InfoValue pair, like this:
InfoKey: Keywords
InfoValue: DocBook,writing,documentation,background
You can change only the following metadata items with pdftk: title, author, subject, producer, and keywords.
If you’re working with PDFs that contain sensitive information, you may want to make sure that only certain people can view a PDF by apply a password to it with the user_pw option:
pdftk salesreport.pdf output SalesReport.pdf userpw PROMPT
You will be prompted for a password of up to 32 characters. When someone tries to open the PDF, they will be asked to enter the password.
Conclusion
pdftk is one of the most useful tools for manipulating PDF file that I’ve found for Linux. It’s not the easiest software to work with, but you’ll get the hang of it after a bit of practice.
Pingback: (Re)inserting metadata into a PDF file | Ubuntu Musings