Why are scanned pdf's so much more compact now?

Every year, I do a presentation to articling students about what it’s like to be a lawyer. I usually give them a series of articles, statutes, cases, etc to look at, which I scan in our photocopier as pdf’s and then send to them by e-mail.

Typically when I scanned in an article that way, the pdf would be over 1 MB in size. But when I was preparing it this year I scanned in a new article and it came back as about 300 kb. So I tried scanning one of the older articles from previous years, and it too came back as about 300 kb. So I did them all.

Some times, the new scans were as much as 90% more compact than the older ones.

So what’s happened with scanner technology that’s led to these more compact pdf’s?

Probably scanning in Black and White vs. Greyscale.
Or, more aggressive jpg compression.

You may be comparing oranges to apples. You didn’t say if scans #1 and #2 were done on the same machine, with the same settings. There is a vast difference between the best and worst settings in the PDF world. Think color/BW/grayscale, bit values (8/16/24/48) and compression ratios.

Not a lot has been done with scanner or compression tech for decades other than to optimize or change default settings.

Even if it was the same machine, its likely someone in the office just changed the settings.

Acrobat’s gotten better at compressing PDFs for the web world.

As a slight tangent (hopefully quick), why is so much of the legal world still on paper instead of in electronic databases? Doesn’t that make finding case law, citations, etc. that much harder?

Federal courts (in the US) have migrated to e-filing systems in the last several years now. Ninth circuit filings require paper copies to be sent to the court, bit all filing is done electronically. The only time I’ve handled paper copies in District courts is when dealing with pro see litigants, as they don’t usually have accounts to file electronic documents.

As far as legal research, all reported cases, and some unreported cases, are all available electronically (e.g., westlaw or Google scholar). I can’t remember the last time I’ve looked up caselaw on a physical paper reporter (it was probably in law school).

It isn’t. We’ve gone from having a large library with a full time librarian to having access to online services and a few books around for decoration or because they are old but still useful editions.

So is the legal librarian becoming an endangered species as an an occupation outside academia? I used to know one at a large firm decades ago and she made quite a good living ( if usually a bit of a boring one in her humble opinion ). Or have the job duties just shifted a la other library-type jobs?

At least at the institution where I work (in the legal department), legal librarians are evolving into “legal knowledge management experts”. They’re still running the library, but in addition to that they run the internal document management and archiving systems, provide advice on how to research articles and case law in databases, etc.

Thanks for the comments, all. Yes, same machine. So someone altered the settings, it sounds like?

What is the difference between grey scale and black and white?

Grey scale records 8 (or more) bits/pixel while B&W is just one bit, for a considerable savings in file size. The advantage of greyscale is it can record shading, where B&W is on/off, so good for text, but not good for photographs.

Without getting into too much technical detail, for a bunch of sample dots taken from your document, grayscale encodes the color at each spot as a range between black and white over a particular range, or scale. The scale can vary but is often 0-255: 0 = perfectly white, 255 = completely black. So a dot halfway between black and white is given the value 127.

Black and white uses just a single bit of information: Is this spot black? If not, then it’s white. In a black and white document areas can still appear gray to the eye. This is done with stippling.

So, before taking into account any compression, the raw representation of a 0-255 (8-bit) grayscale image is eight times larger than a black and white representation at the same resolution.

From a bit of a distance they look the same. Zooming in closer though black & white is just that, each scanned pixel is either all black or all white (bitmap), which means even large scans can be fairly small in file size. With grayscale each pixel is any number of shades of gray. While it’s not as much information as color pixels, it’s a lot more than black & white.

The advantage of black & white over grayscale is that you can scan at a higher resolution to get really crisp edges on text and line work while still getting nearly photographic reproduction of photos via dithering, just not in color, and all at a fairly small file size. These reprint much better too, especially if the resolution of the scan matches or is a multiple of the resolution of the printer. In grayscale or color scans you tend to get some fuzziness on what should be sharp edges when you print them because those edges usually go black-gray-white across a couple of pixels rather than black-white, so the printer tries to halftone those gray pixels and it just makes it a bit muddier.

The advantage to grayscale is that you can see more information at a lower resolution, it just might not be as crisp. In most photo viewing apps, a black & white and grayscale image will look pretty much the same when zoomed out because of interpolation. Zooming in on the grayscale image will just make everything bigger. Zooming in on the black & white image however will at some point “snap” from the interpolated view that renders as grays to the single-pixel black & white view, which tends to look kind of gross an hard to read.

Here’s a comparison of what they look like zoomed out vs. zoomed in: http://jjakucyk.com/straightdope/color_gray_bw.jpg

Too late to edit, but wanted to note that I got the commonly used representation of black and white backwards. Need more coffee.

Should be 0 = completely black, 255 = perfectly white.

Only if dithering/halftoning isn’t used. If 50% threshold sampling is used, then yes it’s going to look like a Rorschach test or some horrible early 1980s computer image where anything that’s darker than 50% will be black and anything lighter than 50% will be white. A dithered black & white bitmap image does need a higher resolution to look better than grayscale (say 600 dpi vs 200 dpi), but newspapers have been printing photographs since the 1880s with bitmap technology.

A halftone image is not a photograph (at least, not in my book).

Most likely better compression. PDF used to not support as many different types. Lower resolution is also a possibility–but I assume you kept the DPI settings the same.

The only way we could tell you for sure is if you were able to give us both an old scan and a newer one, so we could compare. Are any of your documents things you can share with the public?

Compression can get better. What I mean is that over the years and decades they have introduced many improvements to the compression.

So just seeing that the pdf is compressed doesn’t mean that is using the best (latest) compression technique.

Perhaps your software OCRed the text and is sending the text of the document (which is small) and not just a bitmap image (which is large.)

How many pages and how physically large are the pages?