Could use some help with fixing PDF files . . .

I know exactly zilch about PDF file formats, and I don’t have access (that I know of) to any apps for generating or updating PDFs. But I have some PDF files that need some fixing. Maybe some Dopers here could answer a few questions?

(1) I have 4 files. They are multi-page scanned images of tax returns for a non-profit organization (local humane society), that we want to publish on our web site. All 4 files are 12 to 16 scanned pages. Yet one of the files is 86MB in size, while the other 3 files are about 1MB in size.

What might explain this difference? Can PDF files exist in both compressed and non-compressed formats? Can I compress that 86MB file somehow to make it more like 1MB like the other 3 files?

(2) Two of the files have ALL their pages scanned upside down. We have Adobe Viewer here, and it is trivially easy to rotate all the pages to be right-side up. Just right-click, and the menu has a “Rotate Pages” choice right there, and it does ALL the pages at once! Problem is, with Adobe Viewer, it is not possible to save the fixed file. I need a way to fix this and save the fix.

(3) I have an Ubuntu Linux machine at home. I’ve never really explored its document preparation apps. Is there a fairly standard Linux app that comes with most Linux systems, that I would be likely to have? I think I’ve got some or all of the Open Office suite. Is there something there I could use?

Check Foxit pdf reader, it has a few features adobe doesnt not including a save option, and its super small compared to adobe as well.

I believe you need Adobe pro to edit and save PDF files… unfortunately it isn’t free.

Just did a search for PDF in the Linux Mint (Ubuntu derivative) Software Manager, and pdfshuffler looks like it would do some of what you need:

Not sure if it can reduce the size of the one file, though.

Also, I tried opening a couple PDFs in Libre Office, but they looked like garbage.

It can depend on a lot of things. You can almost certainly reduce the size of that large file without much if any noticeable change in quality, but you will need to use a program capable of editing PDFs. Acrobat is able to optimize a scanned document in a number of ways and an evaluation version for Windows is available as a free download. None of the features are disabled, so you can test it out to see how much it can reduce the size. It can rotate individual pages and they will stay rotated after the document is saved.

Foxit Reader won’t help with any of your issues, and Adobe Reader 11 is now able to save non-enabled documents after they have been modified, e.g., filled-in forms, comments/markups, etc.

The free version of PDF-XChange Viewer will solve your second problem very easily, but you have to use the “Rotate pages” function to be found in the Documents menu, not the “Rotate view” function on the toolbar. Do it the first way and you can save the rotated pages. I haven’t looked at Foxit for a long time, but I believe the consensus is that PDF-XChange is better, and it certainly has a lot of very useful features that the free Adobe Reader does not. I do not know if it will run under Linux, though.

It does seem very odd that one of your documents should be so much larger than the others if they were all scanned the same way. Could it be that some were scanned with OCR and others just as graphics? Or maybe the big one was scanned in color (even if all the colors pretty much look like black or grey) and the others in monochrome, or perhaps the big one was scanned at an unnecessarily high resolution. resolutions.

If OCR is the issue, you might be able to fix it with appropriate settings on the PDF-XChange OCR tool (but I am not holding my breath for that one).

Thanks for the suggestions so far, everybody. Not sure when I can try these things, but I expect I’ll get onto it sometime within a week. Sounds like maybe my best bet is to download that evaluation version that Dickerman suggested and try that?

I know at least one of the docs is in color, because I saw one of the pages had a rubber-stamped “Copy” on it, in red. But I didn’t notice which doc that was (or if it was just one of them).

If it turns out that the 86MB doc is in color and the others aren’t, is there an option somewhere to convert a color doc to black-and-white?

There’s an added weirdness too now: Our current treasurer went and re-scanned ALL of the documents from scratch, and those are all right-side-up. But most of them are several scanned pages shorter than the earlier versions (earlier were mostly 16 pages, these are 11 pages) so I’m wondering what pages got left out. AND FURTHERMORE, these docs are all about 10MB, whereas most of the others were about 1MB. Is that caused by differing color or resolution options? These 10MB files are really too big for me to stick on our web site.

there’s a lot of variables in a scanned image size.

resolution will increase the file size by a factor of 4 for every doubling of dpi. 150dpi is a basic minimum for scanning text, 300dpi a general standard for images.

color contains 3-4 times the info of a greyscale file.

lossy file formats like jpg can get much smaller then non-lossy formats like tif or gif. the tradeoff is image quality if you compress the file too much.

tl, dr: resave your 10mg files as jpg (medium to high quality), maybe 150dpi if just text.

if saving to a pdf, first do the above then combine into a pdf. pdfs can contain scanned images or live text that is editable, scanned image sizes will depend on what you import into the pdf.

so simplest step if you have the documents and a scanner might be to rescan them as 150dpi jpgs, and just post the jpgs on the site or combined them into a pdf.

if you have any graphics progams resave the larger files as 150dpi jpgs.

otherwise to modify pdfs you need Acrobat Pro, in the settings you can specify maximum dpi for the files and Acrobat will compress the files when you save them. you can also rotate files and save them that way.

if none of this makes sense, get someone who does graphics to help.

Thanks, ed anger for the suggestions – but, these are the things I mostly can’t do. I don’t have the source documents, nor a scanner, nor the software. I was just handed the PDF files, and that’s how I got them. Now I’m hoping I can do some improvements, working with the files I’ve got.

Somebody else has the source documents and already re-scanned them, with results that I consider even worse. The files that were about 1MB are now about 10MB (and seem to have a lot fewer pages; I don’t know what got left out). And the person who did that doesn’t seem to be helpful – all she knows is “Well I re-scanned them and that’s what I got.” Can’t argue with that. And I know zilch about the matter too, for all the good that does.

ETA: But the things you’re telling me about colors, grey scales, d.p.i. and resolution, etc., could still be helpful. Maybe I can talk TPTB into re-scanning them (yet again), and tell them what kind of parameters to pay attention to next time.

If you get the evaluation version of Acrobat 11, there is a feature called Preflight (Tools > Print Production > Preflight), which includes a fixup that will convert to to grayscale (Profiles > PDF fixups > Convert to grayscale).

JPEG is not the ideal compression method for images of text. Acrobat has a specific feature for optimizing scanned documents: Tools > Document Processing > Optimize Scanned PDF

There are a number of options you can try here. Which one you choose depends a bit on the document you’re dealing with, so give several a try and see how it goes.

If the “documents” are merely images assembled into PDFs (which happens far too often), then the easiest way to edit/fix them is to extract the images and manipulate the images. (Flip, recompress, etc.)

There are several programs to extract images from PDFs. I use one from Somepdf.

You could try PrimoPDF. It is a software that allows you to print anything into a pdf file. It also gives you some settings about picture quality and resolution. Once you install it, you can open the large PDF in whichever software you use and “print it” in lower quality.

Graphic artist of over 20 years here. PDF or Portable Document Format was designed using a language called Postscript. Postscript uses textual and mathematical formulas (code) to “tell” the screen or your printer (actually created for digital pre-press printing) what to show or do with the PDF. The Postscript method of file creation enables fully scalable images and fonts. They can be re-sized either proportionally: X and Y “stretched” the same amount. Or disproportionately: one side stretched more than the other. They can also be made to print to a printing device’s highest resolution.

The other method of rendering an image is called a bitmap. A bitmap image is a bunch of dots, like the name sounds. The dots can be black and white, gray, or a combination of Red, Green, and Blue (RGB) or Cyan, Magenta, Yellow, and Black (CMYK). RGB is generally used for monitors or projectors and CMYK is used for printing. But unlike Postscript, the dots are just, well, dots and they take up the space they take up and stretching or shrinking them disproportionately makes them look stretched or squashed.

Now again in layman’s terms: A scanned image is a bitmap. The higher the resolution, the more dots and the bigger the file size. Those dots take up visual and virtual space.

A PDF is a Postcript image and it is made of instruction code (Words not dots). Words take up about the same amount of space no matter what the resolution. What folks don’t realize is that there are more than one flavor of PDF. The larger file size comes from MORE words. These words don’t just say stuff like use these colors or this font - they say stuff like here’s a bit where you enable you to open this PDF and edit it using other Postscript software like Adobe Illustrator. Or words like - we know you want to use this file on a four color press for offset printing so we are going to add lots of instructions on how to use it there as well along with lots of details about resolution, screen angle, blah blah blah. This is not the kind of file you need. You aren’t going to edit it in Acrobat Pro or Illustrator or Photoshop. You might want to print it on your office printer, not a huge 4 color web press.

You need the smallest file size flavor. The simple “snapshot” of the pages. Unfortunately if you have the huge HiRes print, lots of code, super detailed, more than you’ll ever use version, you need a program like Adobe Acrobat Pro to cut through all of the unneeded words and keep just the ones that say - make me a small manageable that I can email or pop on my website. No amount of free file shuffling, page adding, and rotating kind of PDF software can reduce the file size of a HiRes PDF to more manageable file size. That’s the bad news.

The good news is that there are places that have the expensive Pros software and will do file conversions for you at a reasonable price. I don’t know exactly where they are because I have my own $900 copy of Adobe Acrobat Pro for Mac and PC, nanny boo boo. But I’m willing to bet places like Kinko’s, Office Depot, Staples and even some local graphics shops (Google - graphic art or digital printing) can take your files, open them and save them down to a manageable size for a very reasonable price (certainly less than $900 I got stuck with). They can even save them as GIF or JPG which are the preferred file formats for the web.

But wait there’s more: There are programs like MS Word for Mac or Windows and Open Office for Mac or your Linux system that can save files directly to PDF. You can even set the quality of the PDF file. Tip: The higher the quality, the larger the file size. Go for 60% and see how it looks and what size file it creates. Fudge up or down until you find the right balance. So, if your client created these documents in an MS Office (widely used) program like Word or Excel, you can open them in MS Office or Oracle Open Office and save them as the PDF you want. OR have the client save them as a smaller file size PDF.

Hope this helps and wasn’t too boring. My employees will confirm I can be a sedative sometimes. :slight_smile:

Thanks again, all you additional people who added comments since yesterday. Especially DavidPeab for your lengthy treatise. Is a PostScript PDF just a lengthy plain-text file (behind the scenes) that I can open with any plain-old plain-text editor and look at? I never even thunk to try that. Other than that, is there a way I can examine an existing file and see its properties, like what level of color scales or what resolution it has?

With my files, I figure they must be bitmap images in whatever format PDF files encodes that. They are scanned multi-page IRS tax forms, complete with all the lines and boxes and tiny print that tax forms have, plus pages of attachments with tables of numbers in various formats, and some handwritten signatures here and there, plus the random haze of little black dots scattered all over that you tend to see in scanned images. (ETA: And, all the numbers in the forms and various other bits are handwritten.) So I’m thinking they must be bitmap images. At least one of them has a rubber stamped word “Copy” on it in red.

Somebody suggested to me that I could just make a ZIP file of it to compress it and any PDF viewer would be able to read it directly in that format. So I’m going to try that too for the huge file.

I think I’ll try going with some of the software suggestions mentioned by several people. I have Linux and Winders XP mochines here to play with. I’m hoping I can open the existing files, fiddle with them in various useful ways, and save the resulting files.

Of course, I’ve got this thread bookmarked for future reference. I imagine I’ll get around to doing some of this sometime in the next week or so.

The Postscript (Postscribble is what we call it sometimes) is like any other computer language, like Linux, Java, C++, etc. So, it can’t be read as plain text. Actually it could, but would look like gobbledy gook.

I’m not sure a PDF reader can decompress a ZIP file and read it. When you send the file to someone, their operating system - Window or OS X - will ask if they want it unzipped and where to store the extracted file(s). Then those could be read by a PDF reader. So, that’s a way to move those large files.

Since you have Windows, like someone else here mentioned, you can go to Adobe.com and download a 30 day free trial of Acrobat X Pro. If you open the files in that program you will have the option to “save as” and “Reduced file size” or for more control “Optimized PDF” under the “File” menu. I suggest you save the files with a new name like blah_blah_small.pdf so you don’t accidentally overwrite the originals with a screwed-up version while messing around. BTW, when the 30 days is up you can uninstall the program with no obligation or activate it from withing the Acrobat program.

DavidPeab’s posts are unfortunately quite confusing about what is going on really inside a PDF.

First of all, PostScript is more like the spiritual ancestor to PDF. Some of the encoding system was carried over, but a lot was omitted. Especially the full blown interpretive language part has been scaled way, way back. (The hardware requirements for PostScript printers was quite high back in the day.) But even more significantly, a huge amount has been added that is not at all PostScript related. There is really no point in thinking about PostScript in terms of understanding PDFs. Better to just think of the format in its own terms.

Adobe has a page explaining their view on this, but it’s rather technical. Note how they refer to a PDF page not containing a PostScript page so much as the output of a partial rendering of a PostScript page.

If you go way back on already know a lot about PostScript, that can help you understand a small part about PDFs. But if you never heard of PostScript, then there’s no point in learning much about it since most of it will be useless in working with PDFs.

There are several filters for embedding images in PDFs. Depending on the nature of the image, some are better than others. For a high quality natural image, one might use the jpeg-based filter. But, for example, if the run-length encoding one is used instead, that file will be huge. OTOH, run-length encoding is great for scans of simple line drawings.

ftg: You are correct and thanks for the link to the Adobe page. After a good night’s sleep I realize my mistake in comparing postscript graphics to bitmap graphics. My long-winded treatise should have compared VECTOR graphics to Bitmaps. Still, probably not too helpful to the original poster’s question.

I have to admit that I first started doing computer graphics at NASA’s Goddard Space Flight Center in MD, using huge tape drives in 1982. Then moved on to the Mac (1st gen pretty useless for real-world applications), then to PCs: 8088 to Pentium. Now Win 7 Pro and OS X Lion on a HUGE Mac Pro. Somewhere in there I moved from knowing lots and doing lots to knowing less and telling folks what to do. We call these people art directors. :wink: I get to make more creative decisions which is cool. But I am not on the cutting edge of the technology and need to be reminded of that sometimes.

So, now that PS issues has been clarified, can the original poster still use some of the advice I gave about how to solve their PDF dilemma? Or are we going to hi-jack their thread, like most blogs, and make it into a technical forum about the history and composition of computer graphics?

Start a new thread and I’ll meet you there, because this stuff is really cool to me, but not sure how much folks searching for help with PDF usage will get out of our “help.” That’s the firm guiding hand of the art director part of me coming out. No harm intended.

yes, if he can DL a trial of Acrobat Pro, he can resave the files at different settings and see which ones work for the quality level and size he wants:

DavidPeab, ed anger, and ftg, be my guest to “re-purpose” this thread to talk about technicalities of PDF file formats, PostScript, the Universe, and Everything. I think I’ve got enough answers up-thread to work with for a while, and the tech talk is interesting too.

While the details for PDF and graphics formats may be over my head, I’m generally computer-technically literate. I was a Unix sysadmin from 1984 to 1987 at a hi-tech Silicon Valley company – specifically, a company that built laser printer systems (in the day when laser printers were really newfangled impressive whizbangs!), and I also did technical customer support.

Remember, this was exactly the era when PostScript was invented and started to be used for driving laser printers! When this began to get big, PostScript (and companies using it) was our company’s biggest threat! We responded by developing our own super-duper uber-PostScript-like language. But the marketplace was way ahead of us. More seriously, my company’s apparent intention was to keep our super-PostScript proprietary, and it was going to be so fantabulous that the whole world was going to beat a path to our door to buy our printers!

Didn’t happen. Adobe took the world by storm, and the company had an internal upper-management revolution, whereupon the company basically fell apart and self-destructed.