Legallity of the Google cache

jovan · June 19, 2003, 12:49pm

This question was kinda asked in this thread however, the answer I’m curious about wasn’t brought up.

Google automatically caches the pages it indexes (unless specifically told not to) and makes those copies available to the public.

The material they copy is, for the most part, copyrightable. So, is Google taking great legal risk by using this content without permission or is there some loophole I’m not aware of?

RealityChuck · June 19, 2003, 12:57pm

No loophole, and Google is taking a risk. But no one’s called them on it. I suspect if it came to a court case, Google would just remove the cached versions of the pages they are being sued for. In addition, the copyright holders probably aren’t upset enough to go to court.

http://www.archive.org also has the same issues. However, I believe they allow you to opt out.

aahala · June 19, 2003, 1:44pm

I believe the cache SHOULD be illegal, but the issue is slippery.

What’s the fundamental difference between displaying results of current pages and those of the past? I’m hard pressed to come up with one. In both cases, the site owner wants or doesn’t want the pages displayed by google.

It also takes a pretty far fetched example that the damages of showing old sites with the opted out option could justify the costs of such a suit.

Q.E.D · June 19, 2003, 1:56pm

Also, bear in mind that any time you view a web page, your browser downloads and stores a copy in its cache folder (if you have the cache feature turned on). A heavily-trafficked website will have its pages copied hundres or even thousands of times a day.

Dogface · June 19, 2003, 1:58pm

If there is no fundamental difference, then not only the cache should be illegal but all linking should be illegal. Court cases have already established that it is not illegal to link to other web pages without permission.

jovan · June 19, 2003, 2:05pm

Yes, but you are not redistributing this content. Google is offering a public service and as such, is more susceptible to accusations of copyright infringement.

Francis_E_Dec_Esq · June 19, 2003, 2:08pm

What is the nominal purpose of the Google cache anyway?

ftg · June 19, 2003, 2:09pm

There have been cases when Google has been asked to remove pages from cache. Scientology has done this after closing down a site that had some of their material and then finding that site cached on Google. Links 1 2.

Dogface: I am unaware of any significant ruling that linking against a site’s wishes was rejected. Every one that I have heard of has been against deep-linking and such. Cite please?

All web-admins of any reasonable compentency know that there are web crawlers out there and set their access permissions if desired.

ftg · June 19, 2003, 2:18pm

The primary reason is presumably that the web site is not up, overloaded, used up its monthly bandwith (I see that on some specialty sites), etc.

But the really nice thing about it is the ability to go back in time. Google fairs better in this regard than the Internet Archive since it has crawled more pages, but the Archive can sometimes give you several different past versions.

CaveMike · June 19, 2003, 2:22pm

To make pages available if a server is unreachable (down or busy).

A side benefit is that the cached version highlights your search terms.

Also, to expand on Q.E.D.'s point on browser caches: If a server is busy, your browser will server up its cached copy so the potential to get an old version still exists. Depending on the browser’s configuration, the window-of-time might be smaller or larger than Google’s cache.

RealityChuck · June 19, 2003, 3:03pm

Incorrect. The cache is not a link – it’s a copy of the page on Google’s server. That’s where copyright comes in.

It’s the difference between referring to a book and making a copy of that book.

Francis_E_Dec_Esq · June 19, 2003, 3:28pm

Wow! That’s a shitload of data store on the chance that a few servers are down. It surprises me that they do this. Is it really an advantage for Google to offer this service?

Come to think of it, how in the heck does Google make money anyway? They don’t have any ads that I’m aware of.

jovan · June 19, 2003, 3:33pm

AFAIK the search engine is the ad. They make money by selling their searching technology to other businesses.

Keeve · June 19, 2003, 3:37pm

Slight hijack: Does anyone know how freakin’ big that cache is? Like, with over 3 billion pages in it, they must have an awful lot of disk space.

aahala · June 19, 2003, 3:45pm

I could be mistaken, but the purpose of the cache is for the SE to operate. The DISPLAY of the cache is what the issue is.

Basandre · June 19, 2003, 4:43pm

Last I heard from Slashdot, Google has about 8000 servers with a petabyte of storage (about a million gigs) of storage.

gotpasswords · June 19, 2003, 4:46pm

It’s almost mindlessly easy to prevent Google from caching a site:

. How should I request that Google not return cached material from my site?

Google stores many web pages in its cache to retrieve for users as a back-up in case the server where the page resides temporarily fails. Users can view the cached version by choosing the “Cached” link on the search results page. If you don’t want your content to be accessible through Google’s cache, use a <META> tag with a CONTENT=“NOARCHIVE” attribute. To do so, place the following line in the <HEAD> section of your documents:

<META NAME=“ROBOTS” CONTENT=“NOARCHIVE”>

This tag tells robots not to archive the page. Google will continue to index and follow links from the page, but will not present cached material to users. If you want to allow other robots to cache your content, but prevent Google’s robots from doing so, use the following tag:

<META NAME=“GOOGLEBOT” CONTENT=“NOARCHIVE”>

Please note that the change will take effect the next time Google crawls the page containing the NOARCHIVE directive in a <META> tag. If you want this change to take effect sooner, the site owner must contact us and request immediate removal of archived content. Note also that the NOARCHIVE directive only controls whether a cached version of the page is made available. To control whether the page is indexed, use CONTENT=“NOINDEX”. To control whether links are followed, use CONTENT=“NOFOLLOW”. For more information, see the Robots Exclusion page.

See http://www.google.com/webmasters/faq.html for more.

SmackFu · June 19, 2003, 4:48pm

How about caching proxies? Those would presumably be illegal as well.

Hauky · June 19, 2003, 4:58pm

According to their newsletter from September 2002, Google has over 10,000 servers. I suspect that number has grown quite a bit since then.

Incidentally, I’ve used their cache pages more than once to find information I needed when the actual page was nothing but a 404 by then. The term highlighting thing is incredibly handy as well.

ftg · June 19, 2003, 10:01pm

AFAIK, Google only caches the text of a web page, no graphics. That keeps the storage requirements much, much smaller.

Topic		Replies	Views
Are Google's archives a copyright violation Factual Questions	10	856	August 4, 2003
Google vs. Copyright Law Factual Questions	25	1683	March 14, 2004
Google cache and copyright law Factual Questions	3	1089	October 17, 2002
How does Google cache web pages that require a subscription? Factual Questions	4	791	November 6, 2003
Could Google legally make SDMB content available in Google searches? About This Message Board	13	1147	March 11, 2003

Legallity of the Google cache

Related topics