Distorting PDFs to prevent character recognition

Is there a way to distort a PDF so that optical character recognition will make enough mistakes that using it would be useless while nonetheless keeping the images legible enough for people to read?

The reason I ask is that I have data that I want to present, but I don’t want anyone copying it. I know that that’s impossible, but I would at least like it to be difficult enough to copy that very few people would bother. That is, if they only way to copy the data would be to type it in manually, that would be good enough for me.

What I’m doing now is photocopying the PDF, and then making a photocopy of that photocopy, and so on until it’s sufficiently distorted, but that is a waste of paper and time.

So, is there something out that can do that, ideally easily, for me?

Let me tell you a little story…

Many years ago, I bought some schematics from a “company” (probably just a guy in his basement) who reverse-engineered them. In order to protect “his” intellectual property, he xeroxed the schematics on very dark blue paper, making them impossible to copy. And, essentially impossible to read. I was so pissed off about this that I spent a great deal of time scanning them and cleaning them up in Photoshop - so that I could use them.

I never considered buying anything from them again.

So, the moral is: if you want to really tick off your customers, make your data difficult to read - but don’t expect that it will prevent them from copying it.

Yeah, that would be going too far. I don’t want to make anything difficult to read, I want to make it difficult to OCR. A copy of 11 other copies isn’t hard to read, as far as I can tell, I just want something similar without having to waste paper. Also, to address your last point, I don’t have any customers, and no one is buying anything from me.

Just put a 25% - 40% screened background behind the text - that should hose any OCR software…

There is absolutely no way to publish information and not have it stolen in proportion to its value. None.

Your PDF documents should not cost more than they are worth to any new user. Price them accordingly, and you won’t see much piracy.

You know how hard it is decipher modern CAPTCHA. Evidently that’s how much you need to distort the document before they’re difficult for OCR to read.

I’d try some other way to obfuscate the information. E.g. if they are numerical data, just show a graph without axis labels.

If it’s hard for a computer to read, it’ll be hard for humans to read, too. The only difference is that humans are better at compensating for that. But it’ll still be really annoying at best.

Years back, computers games used a key that had to be input to get the game to run. The key varied for each start of the game. Some developers printed the keys in purple or dark red on red paper. No copier at the time could handle it.

My example is a guy who regularly published a little book of “secret” radio frequencies for the ham community. He had it printed in pale blue ink - dropout blue - which at the time was functionally invisible to all copying methods because blueline board and pale blue guidelines were used for layout. You could retype the content, but it was dozens of pages of frequency data, very dense and detail-critical.

In line with my above comment, he priced the book at a point where it wasn’t worth it to pirate, about $5, IIRC. If it had been $20, he would have sold one tenth as many copies, a net loss. But given shelling out $5 to a community member or wasting time trying to photograph or copy out pages, it was an easy choice.

To be clear: Neither of these things would work especially well against OCR. OCR does not work like a photocopier.

Essentially, OCR works by finding the edges of letters to determine shapes, and matching shapes to a pre-existing database that tells the software what letters and numbers look like. I can’t really see how using colors alone could possibly defeat that: The threshold for the software not being able to see the edges because the contrast between the letters and the background is too low is well past the point of a human no longer being able to read the text because the contrast is too low. The only way to defeat OCR is to add noise, either in the form of random lines and dots on the page, weird-shaped letters, or both. This is what a CAPTCHA does, and reading multiple pages’ worth of CAPTCHA is nobody’s idea of a fun afternoon.

However:

This is the solution, really. Make piracy a bad choice for simple economic/annoyance reasons and you’ll reduce it as far as it can be reduced.

But surely you could make a legible copy using a camera? Color film is designed to mimic the color response of human eyes, it should distinguish colors just as well.

You can just disable OCR from your PDF when creating it…

However, that doesn’t stop someone from printing your pdf and scanning it, and OCR-ing the scan.

But it’s the best you can do short of ticking people off with almost-unreadable/blurry text.

Unless I’m missing something, that seems to prevent your own copy of Acrobat from performing OCR prior to conversion to Word or Excel. It doesn’t prevent other people’s software from doing it, with or without conversion.

And it certainly won’t prevent people using non-Acrobat software (FineReader, etc.) to perform the OCR.

What if you go in the opposite direction? Add lots of random characters and words in the negative space. Not anything that’s going to get in the way of reading the page, but if someone tries to scan it in, the OCR software will pick up all the other stuff as well.

Good point… Okay… I tried this new thing called google and came up with this

That still only works with Acrobat Reader. Other programs usually ignore those… suggestions.

Well, that’s the thing… It’s like a security system or lock on your door. It’ll stop a lot of people, but someone determined to get into your home will still get in. You have no way of stopping someone, but the majority of people get the free actrobat reader and that will stop all of those folks.

If someone is determined enough, they’ll just type it out themselves. Your only choice in this matter, then, is to orally give your report, ban recording devices and then hit them with the flashy thing from Men in Black ehrn you finish… but then what is the point of the report in the first place if people are not allowed to remeber/use/regurgitate to knowledge delivered?

If you still want the document to be truly legible to the human eye, and not annoying, use a sans-serif font with a lower hook on the lowercase L. Recent versions of Acrobat Pro still have a problem with this, so it is at least a hassle to fix the OCR’d results. But you can only keep honest people honest. It costs next to nothing to outsourced document transcription to third-world countries, so if someone really wants to steal your work, they’re going to do so, and any sly attempts to stop them will fail.

You’re better off establishing your copyright on the work and taking it to court in the event that someone violates your rights. Of course, that only works when it is convenient to sue and you can find the violator.

I think I saw paper like this - Anti-Copy Paper Technology | HGT - in a movie or on TV. It would prevent scanning or copying, but I suppose they could still photograph the original, and then extract the data from the picture.

Just google security paper or anti copy paper for more info.

Yes, indeed. Given my age, and some health problems, it’s becoming harder for me to read many things. and I;m beginning to strongly resent webpages & other things that are hard to read.

That was a common trick used by credit card & collections companies back then. Terms & Conditions, etc, on the back side were printed in a color called non-xeroxing blue – a shade of blue that was nearly invisible to copiers.

But a 10¢ sheet of yellow transparency defeated that. You placed the yellow sheet on top of the blue-printed document and copied it. The yellow made the blue ink stand out from the light blue background enough to get a good copy. It was once fairly common to see such a yellow sheet around the copy machine.