Hidden source web pages

I know I have visited some websites where I was not allowed to copy-paste the page text into a word processor. I also looked at the source of the webpage and there was very little there, it seemed like no html at all. How do they do that?

Also, is it possible to not allow a page to be printed?

Thanks gurus.

I think what they might do is make a jpeg of their page, and then publish the page that way. I’m trying to find a good example, all I have come up with so far is this:

http://www.azlyrics.com/

Notice the “Warning - Spyware Notice”, you will see that you can’t copy the text from that part of the page.

I’m not seeing it.

Well, the problem is they are rotating which jpep they display on that web page. But I guess my point is, if the text is “in” a jpeg you can’t cut and paste it from your browser to a word processor.

The reason I am looking into this is for a legal application. If someone can cut and paste the text into a word processor, and then maliciously change the text before they print it out, well you can see the problem. If we make the text a jpeg, all they will be able to do is print the screen, I think.

theres a few ways you can do this with javascript etc.

One way that was around the traps about a year ago was making the source actually have about 4 pages of blank lines and then show your source.

have a look, i think you can find it at javafile.com

Ah, you’re talking about that ad spot up on the right side there. Some of that seems to be static images (either JPG or GIF) and some of it is Macromedia Flash. Well, you’re right about the fact that you can’ just copy and paste JPEGified text, but it would still be possible to use an image editor like PSP or Photoshop and do pretty much whatever you want with it, as long as you can either match the font, or cut and paste characters within the image itself. It’s really difficult to put content on the internet that’s not falsifiable, to some extent or other.

If someone is going to go through the trouble of altering legal text and then print it, they’ll have no problem editing an image in something like Photoshop and printing the edited image.

I don’t have enough information to know is this would work or not, but one possibility might be to use PGP to sign any text that you don’t want altered, and include the resultant hash at the end of the text. That way, if someone tries to alter the text without access to the key used to sign it, the hash won’t match the contents, and the text can be shown to not be original.

That should say:

I don’t have enough information about your specific case to know if this would work for you or not…

Is it possible to “lock” a jpeg? That is, make an image that can’t be altered by image processing tools?

No. If the user can see an image, they can copy it and edit it.

Now, there is the possibility of embedding a digital watermark that would be destroyed if the image were altered, but if you’re just imagizing text, it’d be easier to just digitally sign the text to achieve the same effect.

OK, now I’m really going to show my lack of understanding Java applets.

If you create a Java applet and make reference to that applet (stored on your server) from the web page loaded on the client’s PC, doesn’t that cause a “munched” file that used to be Java code to be loaded down and executed on the client? So, what I am getting at here is, maybe I could embed the text into a Java applet, then reference that applet in the web page, thereby essentially “hiding” the text from the user? Also, the Java applet would create “image” text rather than plan old ascii characters.

It all depends on what you’re trying to achieve. Do you want the user to be unable to make any sort of copy of the text? If so, then you’re pretty much SOL. If the user can see it, they can copy it, one way or another. Even if it means re-typing the text into a text editor by hand.

Do you want the user to be unable to copy, alter, and redistribute the text? Still SOL. See above, and add “If they can copy it, they can modify it”.

Do you want the user to be unable to copy, alter, and redistribute the text, and attempt to pass off their altered copy as being legitimate? THAT, you can do. PGP encryption also gives you the ability to digitally sign text using your private key. Such a signature can be verified as legit by anyone with your public key, but, lacking access to the private key, it cannot be altered without causing the signature to become invalid, thus marking it as altered.

But, since I’m still not sure exactly what you want to prevent, I don’t know if that’s of any use to you or not.

OK, let’s just say I don’t want them to be able to cut-and-paste the text from the website, or by examining the source of the web page. Will the Java applet idea work? Or is there an easier/better way? Just as an example, think about a webpage with 1,000,000 words on it.

This sounds difficult enough that I’d advise going back to the underlying reasons you don’t want them to cut and paste. Clearly you want the text to be visible, or there’d be no point publishing it. If that’s the case, then I believe there is no foolproof way in which you can protect yourself from a sufficiently determined copier. No technology that I’m aware of is going to stop someone from taking screenshots of your site and running it through some sort of character recognition software. This actually works pretty well these days, and is built into pretty common things like Adobe Acrobat.

For what it’s worth, though, it sounds like your Java applet idea would probably work, if only superficially (it would be no more secure than multiple images, for the reason above). I suspect anyone sufficiently determined would be able to extract the text without too much effort though, much as you can open up normal binaries in a hex editor and spot bits of plain text lying around. This page seems to think that Java bytecode is particularly vulnerable to decompilation attacks, and strongly advises against assuming that bytecode is secure from examination. While I’ve no personal experience with decompiling bytecode, long strings of text are precisely the sort of thing that will probably be very easy to spot for anyone with the right tools.

Adobe’s PDF document format has some built-in digital rights facilities like you mention, but I’ve not had any practical experience of these. I’ve just had a look, and you can save certified documents, modify user permissions and suchlike. It certainly lets you stop people printing a document or cutting and pasting stuff, but this is only if you save it with certificate or password protection (needless to say, the former will be much more secure than the latter). I’d recommend giving it a look; it’d certainly be a lot less work-intensive than writing Java applets all over the place, and the conversion tools Acrobat comes with mean it’d be relatively easy to get the documents into PDF from whatever they’re in at the moment.

You could build the Java applet to display the text in an image format or you could display it in a text window but disable any selection options on that window. A Java applet is bytecode, which is compiled code but not compiled all the way to low-level code like a C-compiler would do (aside: this is how they achieve platform independence, distribute apps as bytecode and let the native JVM that runs the program finish the conversion). Bytecode can be decompiled back to source code pretty easily but there are obfuscators you can run that will make that process more difficult. Using a Java applet would work, but not better than using a static image and maybe worse depending on how well a decompiler handles your situation.

You’re still left with a situation that doesn’t really give you what you want. Even if you produce your text in an image format (using static images or Java applet) I can still get the image. My OCR app could extract the text probably faster than I could download the images, and if I stumbled on a website that was doing this, I would extract the text and email it to the owner just to prove the point. Unless you make the text almost illegible using low contrast or annoying backgrounds, OCR is going to work. If you alter it enough to mess with OCR, most of your audience won’t be able to read it.

Worst case, what keeps someone from just retyping the content? I wouldn’t do it for 1M words, but if it was valuable enough to be worth all this effort and I couldn’t make OCR work, I’d farm out the typing job to a room full of low-wage monkeys.

What exactly is the point of this anyway? If you put something on the Internet so people can read it, they will be able to obtain the source one way or another. Yes, they will be able to modify it, but why do you care? If they make malicious modifications, it’s not like they can post it back to your site (unless you’re very insecure or running some user-posted site like a wiki, in which case all of this is moot anyway). If they host it somewhere else, how does that reflect on you? It’s not that much different than if they were to make something up from whole cloth and post it somewhere.

At some point you have to put your users ahead of your attackers. Posting 1M words in image format will raise the bar on who can copy the source, but it’s also going to make it that much more onerous to the people who have legitimate cause to read it. Longer download, no search, etc. If it were me, I’d post it as either raw text/HTML or PDF formats, depending on how important the layout was to me, and digitally sign it to verify the source.

People who go to a lot of trouble to make their web page text unselectable and uncopyable irritate the starch out of me.

Give it up. If I want it beyond the casual effort of a passing fancy, I’ll take a 300 dpi screen shot and then throw OmniPage Pro at it and there’s not a damn thing you can do about it.

Agreed. That said, a common ruse to prevent cut and paste, rather than turning the whole content into an image, is to use css layers to place a large empty transparent .gif over the entire page. They can see the page content through the “shield”, but can’t select anything but the invisible .gif. More manageable than trying to make images of your content.

Hiding the HTML source is something people seem to ask for a lot. If you poke around, you find that the common suggestions are an applet, like we’ve been discussing, or a javascript based decoder that reads an encrypted form from your site, and writes the entire document on the fly. Poke around and you’ll find several examples of the latter.

That’s what I would like to know. Everything that you could do to make it harder for someone to copy the text can also potentially make it harder for people to view the text in the first place.

For instance, storing the text as an image will make it impossible for blind users to have their text-to-speech software read it to them. And if you use a java applet, that will mean that people who don’t have java installed won’t be able to see the text. So you can prevent people from casually copying the text using copy-and-past with those methods, but at the cost of possibly preventing some users from seeing the text at all. And that still won’t prevent someone who is very determined to make a copy from doing so.

And I really don’t see why it matters if users are able to copy text off a website in the first place. I mean, what’s the point? Users can see the text, so why shouldn’t they be able to copy it? And is it really worth all the trouble of implementing either of the above methods when they won’t actually stop someone is willing to take the extra time to OCR or hand-type the text from doing so?

Giving content to people you don’t trust is a tricky problem. Most of the time you will find that you greatly inconvenience the people that you want to see the content but don’t greatly inconvenience the people that want to do something with the content that you don’t like.

What a lot of people don’t fully realize is that the internet and computers function by copying data from one place to the next until it appears on your screen or your speakers. So it is a really difficult problem to try and stop the copying at some arbitrary point.

Great stuff guys and gals, thanks.

I agree, no matter what you do the text can be reproduced, I would just like to make it as difficult as possible. Let’s say we are displaying the most sensitive and personal information we can find about YOU on this website. Now you wouldn’t want that kind of information to be easily cut and pasted would you?