How does compressing a .PDF file work?

These days the costs of in-memory decompression is much less than the costs of transmitting data over a network or reading it off disk. The cost of decompressing is normally more than paid for by reducing the amount of data being read from a slow storage medium.

As for why developers make a text-like format and then compress it instead of just making a smaller format, usually it is convenience. Highly compacted data is harder to work with, and usually isn’t smaller than “simple format + gzip”.

Of course the REAL answer to all this is that you shouldn’t be sending PDF forms to your clients at all, and instead host the required forms online and have them fill them in by clicking on a link in the email.

  • Kinthalis - web application developer.

:cool:

I’m troubled by a lot of the preceding discussion, as there are a number of somewhat vague and off-tangent answers in the mix. It’s been my experience that most users have no idea what a PDF file is and have absorbed a lot of weird notions from casual use, alternate software and, well, confused discussion.

PDF and AI (Adobe Illustrator native file format) and EPS (the grand-daddy of both, an Encapsulated Postscript file) are almost the same thing in most ways. A true PDF is an evolved and encapsulated Postscript file, which contains vector and font information that Reader and other tools use to reconstruct the document on the fly - just as a Postscript printer takes PS instructions and builds a page to print, Acrobat Reader takes very similar, highly congruent instructions and renders the page image.

This kind of PDF is enormously compact and even a complex form can be stored in 40-50kb, because it’s NOT a bit-image and the vector-based imaging of Postscript and its heirs is inherently very, very compact. It does put the load on the output device, be it a printer or a workstation, because it does take computational load to read those vector instructions and render the page at whatever maximum resolution the device is capable of. (As someone pointed out, though, this computation load is trivial for pretty much all computers newer than 10-12 years old, and not too bad even for old Pentium boxes.)

The confusion comes in when PDF is used as a container for scanned images. In that, it’s only slightly better than Word (which is effing abysmal; nothing could make me put my head down on my desk and cry like some five-page document scanned into a 6MB Word file that would fault, break and corrupt because of gamma-ray reflections from the Moon.) Too many people, I think, see and understand PDF as a variant of JPEG or TIFF another scanned image/document form.

When you get a PDF that’s basically a scanned image, it is a variant of JPEG or BMP or TIFF… encapsulated in a PDF framework. It’s not Acrobat’s fault that the file is gigantic and slow to open and slow to manipulate and most of the tools will not work on it because it’s not a live document, with vector-rendered layout and text. That is also an entirely secondary use of PDF - like every other document transferred using PDF, it’s probably the best way to get the material to another user or many other users and be sure that they will be able to open, view, print and maybe manipulate it, without having to rely either on a common software platform (like a compatible version of Word, say) or leave it up to the recipient to figure out which tool is best to open, say, a JPEG scan and use it as a document.

The ability to create a compact document that can be opened nearly perfectly by any other user with nothing more than Acrobat Reader is a huge boon. (PDF is also the basic file format for nearly all graphic arts file submission, which eliminates past eras of endless hassle with file compatibilies, font matching, “packaging” file sets for transfer, etc. Nearly ANY program can “print” to PDF; nearly any computer that can run Reader can open and view that file. It’s a fraggin’ miracle from the viewpoint of those of us who used to have to spend hours getting a file to a print shop and ensuring it was received, processed and printing correctly.)

PDFs compress in two ways. If the file is “pure PDF” - vector and font info only - then using the various compression and file-reduction steps do things like strip out excess information, reduce font sets to only those characters used, trim images to actual displayed size, etc. This is a good thing if you want a final-final file of minimum size; further up the chain, you want to leave a lot of that “excess” info in place to allow more sophisticated print control, file manipulation, etc.

If the file has raster images in it, or if the whole thing is just a raster image embedded in the PDF framework, compressing it is just like compressing any image file. You can only do it by going to a very high-efficiency lossless compression algorithm, which PDF pretty much uses by default (meaning there’s very little to be gained by using additional compression), or you choose a lossy method and the image quality will degrade proportionally. It’s just like taking a picture from your camera and reducing it from 3-4MB to 140k; it’s going to lose a ton of detail and quality that can never, ever be regained from that copy.
TL;DR vesion: PDF represents two completely different file formats. One is vector-based and both inherently infinite-resolution and very compact; the other is encapsulating a raster image, which is subject to all the problems of tradeoffs between image size and image resolution/quality. They are not all one thing.
True story: I worked for years with a guy who produced a highly-regarded niche journal. Because had once, long ago, worked briefly in a newspaper print room, I couldn’t tell him anything about publications and graphics. He would laboriously lay out each edition (in WordPerfect, and this was not 2001), then print out the final version and carefully scan the pages into a PDF file, which was often 10MB or more. Now, he had a thousand reasons to keep these files small, easy to exchange, flexible and at least lightly manipulatable… none of which apply to scanned-page PDFs. Absolutely no amount of trying to explain that he could very easily “print to PDF” and produce a vastly superior result - 1/50th the size, to start with - would convince him that “PDF” worked any other, or better, way.

It does.

Well, blame Adobe for not differentiating between the two different PDF formats. I mean, I know the difference, but I’m young and hip. I am aggravated but not really surprised when my boss demands to know why he can’t copy and paste from a particular PDF document, because it’s not really his fault for not knowing.

Well, the whole point of PDF is that it’s truly trans-application, trans-platform, trans- almost everything that would prevent Person B from looking at a document Person A wants them to see without the fluidity and loss of things like HTML. (Or even the loss of fidelity from opening a .DOC file on a different system.)

There was a lot of long-lasting hate for PDF in the webslinging crowd, might still be, because it’s so contrary to the idea of “fluid content you pour into a container of your own damned choosing.” That rather fundamentalist viewpoint is mostly gone, but I had bitter arguments with a HTML purist around 2000 who absolutely fell on the floor and had seizures because I was sending him a document that had its own structure and typography and layout.

(Of course, it was a rather complex reference book, and it was a fairly high privilege for him to be seeing advance pages, but instead of just hitting “print to PDF” from within FrameMaker, I was supposed to elaborately convert them to HTML for his techno-theological convenience.)

Anyway, all the technical guts and gore are there if you choose to look, but PDF is presented to the larger world as a one-size-fits-all-needs tool (which it pretty much is), and expounding on internal differences would be contrary to Adobe’s overall efforts. I do wish more users understood how to directly create a PDF and hadn’t learned only to do so with a scanner app.

One quibble with this: Vector-based imaging is very, very compact for the type of images that it’s well-suited to. If you just want a corporate logo, for instance, yeah, that should probably be a vector image. If, however, you want a recognizable picture of a particular person’s face… well, technically, you could do that with a vector format, but it’d be absolutely huge, far larger than if you used the right tool for the job.

And there aren’t really two different kinds of PDF files, since the same file can contain both. A scientific paper, for instance, will probably have most of the page space filled by formatted text, which is very well-suited to the “pure PDF” format. It is also likely to contain some graphs and charts, which are well-suited for vector graphics. But it may well also contain photographic images, which it would store as some suitable raster format.

It’s a mistake to introduce the concept of a “pure PDF” file as used above as it is misleading and confusing. PDF is a page based format, and there can be any number of pages in a document. Each page can have text objects, vector graphic object, images (using various types of encoding), and various type of annotations, and some other things. A PDF with one or more of these elements can be fully compliant with the specification (ISO-32000).

There are a number of things, particularly with forms, that are unnecessary and can cause bloat. It would take some analysis to say for sure if the forms in question can be safely reduced in size. XFA-based PDF forms (created with LiveCycle Designer) are often much larger than the equivalent AcroForm (created with Acrobat), but it’s not always possible to create an equivalent.

Early standards of pdf didn’t have the best compression.

You choose the compression to be compatible with reader version 2,3,4 …
I guess the only answer is “necessity is the mother of invention”… it didn’t have to have good compression in the early days…

The very simple answer to the question is that “compressing” a PDF changes any pictures or graphics from a print resolution to a screen resolution.

Often the same effect can be achieved by using the “Print to PDF” option.

A lot of the time a PDF format is used so that the layout of the document doesn’t change according to local settings on different computers, or so that it is “universally” readable - which used to be a much bigger problem than it is now given incompatibility between the nasty devil sucking fruit and the terrible whore spawned by Gates.

A duh for anyone who knows what they’re doing (or even what they’re talking about. But this conversation, and my long post, are about the teeming millions who have only the vaguest idea why there is more than one “image format.” No, vector images are not good for complex, “organic” images, not in general.

What I meant, more than “images,” was that a PDF document composed of type, graphic elements and maybe that corporate logo can be stored compactly and rendered with maximum device resolution from a “pure” PDF file. No variation of scanning in a document will produce that.

Again, duh - the difference being whether the document contains any raster elements at all. But there’s still a difference between an essentially live PDF/PS document with some embedded raster images, and a static PDF that contains nothing but a rasterized page image.

I can open a PDF in Illustrator and move around live elements - edit text, change graphics, move images, add and subtract material almost as easily as in the original creation app. While I can open a rasterized page image in Photoshop and so some limited manipulation, it’s akin to photo editing, not document editing. That’s the difference.

No.

I often work with mega-resolution photos and other images while creating a printed item, and then use careful PDF export settings to downsample (or “compress”) those images to a consistent, press-compatible resolution, typically 300dpi. Not having to pre-convert those images gives great flexibility to layout changes and final PDF export for a wide range of uses (from home-printer output to maximum-res process color printing)… without maintaining a hugely bloated file that would choke the average home computer. Or commercial RIP (the “computer” that drives a digital press or platemaker).

The effect achieved by “Print to PDF” is a broken facsimile of a real PDF. (The tool has gotten better, but it produces vastly inferior results, often with font changes, graphics munging and crude rendering of things like lines and shading.) Even a good shareware PDF maker is often better. A real PDF is made using Acrobat Pro, in an export step handled somewhat differently from “print to.”

That’s exactly what PDF is for, although the purpose varies across a wide spectrum. PDF preserves font and layout perfectly, in (usually) a compact file size, and can be read and printed on almost any computer platform in common use. At one end, this makes IRS forms easily available to anyone with a cheap laptop; at the other end, it lets pros send demandingly-spec’ed, precise, press-ready material to a printer without ten stages of “processing” in which it can get screwed up.

It’s one of the few technological things I call a godsend, with a straight face.

And it’s not just things like Windows to Mac that cause rendering differences, for other file formats. As an example, both my mom and I use Mac, and we both use OpenOffice. She recently asked me to print up a document for her that she had created on her computer. Even just using slightly different versions of the same OS and word processor, I had to edit her page numbers (which she had put in manually), because they didn’t match up to the pages on my computer. That’s the sort of thing that’s avoided by using PDF.

Yeah, I think a lot of this discussion is overthinking it a bit (though Amateur Barbarian makes particularly informed contributions, IMO). Probably what has happened here is that so-and-so scans some paper forms and saves it as PDF. With default settings for image scanning, the hardware/software is probably capturing those images at 300 or even 600 dpi and something like 2550x3300 “pixels” (the “pixel” equivalent of 8.5x11 inches at 300dpi). That’s going to give you a big-ass file. There’s NO REASON a “several page” form should be so large that Gmail won’t accept it. So probably what this “compression” is doing is down-sampling that to 72 dpi or something more appropriate for a regular document. As it was at first, it probably was a lot like having a few pages each of which had the equivalent of an 18MP or so camera image, which even with .jpg compression can be 7 or 8 MB a piece…that’s how you end up with a 4-page document weighing a massive 35 MB, when it could easily be 500kB or less.

That’s more likely due to a different printer or fonts being installed, or even simply a different font resolution being selected. When 600 dpi printers came in, I recall our legal department having terrible trouble.

Which, to reiteratively reiterate, is pretty much the point of PDF in everyday usage. It makes documents utterly independent of platform, OS versions, app versions, font sets installed, etc. What the creator “prints” is what every recipient will get.

Having lost a good part of my life f*cking around trying to keep circulating documents in one readable piece across a variety of Word platforms, and get complex pieces to come out the other end of commercial printing without all but doing it by hand, and the like, I grovel and kiss PDF’s tiny little feet. :smiley:

I had forgotten about things like changing printer settings causing Word docs to explode. rubs temples

I hope your web application can guarantee the level of confidentiality that is required here.

Web designers: one hammer for all problems, and those that remain are the user’s. :smiley:

Okay, not a lot different from programmers, but most of us don’t run into (and need to use) the work of millions of amateur and incompetent programmers.

If they’re forms intended to be printed out and filled in by hand, they could be hosted online and printed from the browser.

Just want to jump in here and say that everything Amateur Barbarian is saying is very much correct and I applaud him for the detailed descriptions of what’s going on for use by a lay person. I fear we could spend too much time swapping war stories :slight_smile:

Trust me, I run into similar description problems weekly.

And I also fully concur that PDFs are a god-send. Being able to distribute the same document all over with it looking exactly the same each time is so, so very nice.

  • Celidin
    (Director of Technology for a large ad agency)