How does compressing a .PDF file work?

My wife was having some problems emailing some PDF documents, so I used a website to compress them. How does this compression work, exactly? If there is so much that can be compressed, why doesn’t Adobe just make the file smaller to begin with?

I was taught not to compress pdf’s or jpegs. Because they are already compressed when they get created. Jpegs for example have a quality setting that controls how much they get compressed. Most software I’ve seen has a slider bar to set quality/compression.

I’m pretty sure jpegs will suffer loss if they get compressed with WinRAR or Zip. I always create uncompressed WINRar archives.

Do you have GMail? Google Drive offers some free shareable storage that’s perfect for a few pdf’s or jpegs. I use mine frequently and email the link to my friends. You have to set the read options on each file (giving a specific email address read permissions)

My wife works for a child psychiatrist, and she sends forms to new patients via email (Gmail). Two of the forms are several pages, and some of the recipients complained that their email (whatever it was) was unable to handle the larger documents. That’s why I compressed them.

A PDF can contain multiple things. Two common ones are text and images. You may think that’s obvious, but an image can be an image of text. This takes up much more space than actual text. You can normally check this by trying to select some of the text in Acrobat or Reader. If you can copy the text and past it into something like notepad, it is probably text. I say “probably” because there is a type of PDF which contains hidden selectable text on top of an image. This is generally when the PDF was created by OCR software and that mode was selected.

Another possibility is that [one of] the font(s) in the document is not one of the ones generally assumed to be available everywhere. In that case, the PDF will either include the entire font or convert the text to images when creating the PDF (this is normally an option when you create the PDF).

Some non-Adobe PDF creation utilities may create uncompressed documents by default. They may or may not have the ability to create compressed ones.

If the document doesn’t contain sensitive information, you could post it somewhere and I can download it and look at it for you to see if I can tell what is going on. This is the version before you compress it, of course.

As you said in the OP. Adobe already compresses pdf’s when they are created. I’m not sure how much space is saved by compressing again. I guess if it doesn’t corrupt the pdf, compressing again is ok.

We send a lot of google drive file links to get around this problem. Can’t imagine getting work done without it.

Terry Kennedy makes a good point. What’s included in a pdf can make a very big difference in size. Designing a pdf with the most common fonts, links (instead of images) etc. takes experience. We have a lady at work that creates most of our pdf’s and Word forms.

This is not correct. Zip & RAR use lossless compression; JPG, however, uses lossy compression.

If that’s an issue, just use a service to transfer them. I’ve used MailBigFilein the past.

You upload the file to the site and send an email with a link to download it. It’s erased after a week.

There are many levels of PDF compressions - the mildest one is making sure that only the information needed in the file is includes, e.g removing embedded fonts that are not utilized in the document. For more severe compression, PDF objects that are not directly visible in the document, such as alternative images, links, tumb nails, Java script actions and form fields can be removed, leaving print and display quality, but compromising interactive features. For most severe compression, embedded images can be down sampled, irreversible reducing image quality. See here for the various options of pdf size reduction.

Not disagreeing, but …

The first two of your examples aren’t properly termed “compression”. Yes, they result in a smaller file, but are more properly termed “leaving out optional file elements”, or “creating a simplified pdf, not a full featured pdf”.

Your last step, down-sampling images, is most definitely compression, and of a lossy nature.

Yes. The reason for not compressing JPGs is that since they’re already compressed using ZIP compression on them won’t make them any smaller, and may actually make them larger.

That is my understanding also. Please correct me if I’m wrong on the next part: ZIP and RAR are lossless because they are intended for files in general, and therefore it is very important that the decompressed the file is exactly the same as the original. JPG however is intended for images, and therefore the designers of jpg feel entitled to estimate much can be lost and still look like the original.

Perhaps it’s an obvious tip, but it may not be obvious to some people–Acrobat has a built-in method to reduce file size.

On Acrobat 9 Pro, go to Document>Reduce File Size.

Acrobat X has an option to save as “reduced size PDF”. It reduces the size of most scans I’ve used it with by more than half, but it also reduces the size of documents published directly to PDF.

In the interest of science, I just tried it on a nine page letter and it reduced the document size from 80.2 KB to 51 KB.

How much you can compress something depends on what it is. Text written in a real language usually has a lot of redundancy in it: Some letters, combinations of letters, or words are much more common than others. With the right compression program, that understands the sorts of patterns that show up in text, you can usually compress text down by a factor of 2 or 3. Images can be compressed either losslessly or lossily. How much lossless compression you can get in an image depends on the sort of image it is: Something with large expanses all the same color can be compressed a great deal, while something with a whole lot of fine detail can’t be compressed much, if at all. With lossy compression, you can always compress more, but at a cost: The more you compress, the more information you lose. There’s usually some amount of information loss that’s considered acceptable, but where that point is can be a matter of taste. Different programs can also lose different kinds of information, which may or may not be more or less acceptable. Even if you’re using lossy compression, how much compression you can acceptably get will depend on the image, and again, a lot of fine detail will usually mean less compression.

Sound and video files are similar to images, in this regard (though the exact algorithms used will be different): You can usually tolerate some loss of information, and how much loss and where will vary from one situation to another.

No matter what compression scheme you use, none of them will work well on random information, nor on information that’s already been compressed somehow (in fact, the output of good compression schemes should look almost indistinguishable from random information). No lossless compression scheme can ever exist that shrinks all inputs, and if it shrinks some inputs, then it’s guaranteed to make others larger. With a well-designed compression program, it’ll significantly shrink a small proportion of all possible inputs, and very slightly increase the size of all other inputs. Fortunately, the small proportion that’s shrunk will cover almost all files that people would really be interested in shrinking in the first place.

What happens if you take that 51k document and compress it using say… WinZIP on the highest compression mode? Does it get any smaller?

That’s a good way to tell if you’re actually getting binary compression, or just some sort of housekeeping style compression within the PDF file.

We thought about that, but she doesn’t want to confuse people with multiple emails. By compressing the files, she’s able to email all the forms together in one email.

Also note that sound and video compression schemes are tied very closely to the fact that humans are going to be directly perceiving the result. JPEG compression normally downsamples colour information relative to intensity, knowing that humans are less sensitive to colour changes than they are to brightness changes. Some compression schemes used for transmitting phone calls take advantage of the fact that human speech doesn’t need very high and low frequencies to be transmitted to be understood.

For information passed between computers for processing, lossy compression doesn’t help so much because computers don’t have these convenient perceptual gaps.

You don’t need to use multiple e-mails. With services like that, you put all of your big files, in all their big file glory, up on the servers, where they stay. The actual e-mail that gets sent is tiny, and just has links to where the files are stored. The person receiving the e-mail just clicks on the links, and downloads them. From their perspective, it’s hardly any different from downloading them from the e-mail, but the e-mail never contains any large file.

Good question. I just tried it and it dropped to 36.8 KB. Zipping the original non-reduced PDF results in a 61.8 KB zip file.

Just to answer this part of the question, file compression comes with a performance penalty. Uncompressed, a PDF can simply be loaded into memory and opened. Compressed, a PDF has to be loaded into memory, decompressed either to disk or to another location in memory, and then opened from there. It might not be a big performance hit, especially for smaller PDFs and/or modern, powerful computers, but it’s a performance hit nonetheless. With storage space and bandwidth as cheap as they are, a developer would probably choose to optimize end-user performance over file size.