Difference between two PDF files?

I’m going through thirty gigs of files from various sources trying to sort them. A lot of the original PDF files have been modified by someone adding their name to the first page. That’s easy to spot. But this is only one case of many I expect to encounter.

They are 16 pages of schematics. I see no difference other than the size. Konqueror with KPDF only shows a difference in the modified date and will open both of them. PDFEdit will not open the smaller file.

517.6KB File
899.7KB File

Doubtful but is there anyway to find out what was modified, such as a table of contents added?

Is there a command line program that will list the differences between various PDF files?

To compare the text, use GNU diff (or your favourite GUI therefor) and xpdf’s pdftotext:


$ pdftotext file1.pdf >file1.txt
$ pdftotext file2.pdf >file2.txt
$ diff file1.txt file2.txt

To compare the metadata (title, author, bookmarks, etc.), use diff and pdftk:


$ pdftk file1.pdf dump_data output file1.txt
$ pdftk file2.pdf dump_data output file2.txt
$ diff file1.txt file2.txt

To do very low-level comparison (for which you will probably need to know a bit about the PDF file format, which is a bit like PostScript), you can use pdftk to decompress the page stream:


$ pdftk file1.pdf output file1_uncompressed.pdf uncompress
$ pdftk file2.pdf output file2_uncompressed.pdf uncompress
$ diff file1_uncompressed.pdf file2_uncompressed.pdf

Oh, and I just thought of a way to do visual page-by-page comparisons, which works even with pages that have graphics. You will end up with a PDF with the differences highlighted in red. Assume your two PDFs are called foo.pdf and bar.pdf with N pages. Then do the following (in bash):


$ pdftk foo.pdf burst output foo_%d.pdf
$ pdftk bar.pdf burst output bar_%d.pdf
$ for f in {1..N};do compare foo_$f.pdf bar_$f.pdf foobar_$f.pdf;done
$ pdftk foobar_{1..N}.pdf cat output foobar.pdf

These commands use pdftk’s burst function to split the PDF files into one file per page. It then calls ImageMagick’s compare function on each pair of files to produce a “difference” image. Then pdftk’s cat function is called to stitch the difference images back together into a PDF called foobar.pdf. View the PDF in your favourite PDF viewer and you’ll see all the parts that are the same in grey, and all the parts that are different in red.

You could print the pages out, place them side by side, and use my method.