How many bytes is the internet?

Lobsang · June 21, 2009, 10:43pm

To phrase this question more specifically: What is the best estimate of how many bytes everything (that is every pixel of an image, every pixel of every frame of a video, every word of code, including images used as buttons, backgrounds and anything else that makes up a page, everything) that is potentially accessable (that means, made viewable by being on a server and in a folder that can be accessed over the internet) takes up.
And as a bonus question - What is the best estimated percentage of how much of that is actually useful to more than 50% of the people who use the internet?

Shagnasty · June 21, 2009, 10:53pm

The answer to that question is impossible and no one even ventures to try more than a rough figure. There are numerous issues including duplicate information and the “deep web” that can be found on some search engines or by signing up to particular sites and using their search engines. The SDMB is an example that doesn’t show much on popular search engines but there are countless others. The answer is somewhere in the exabyte range but no one knows the real figure.

ultrafilter · June 21, 2009, 10:55pm

A bunch, and a vanishingly small fraction is of general interest.

BigT · June 22, 2009, 3:55am

http://www.wisegeek.com/how-big-is-the-internet.htm has an estimate: 5 billion gigabytes, or 5 exabytes.

Of course, as you said, it’s really more of a guess.

And, yes, I found that using Google, like most of my answers that have citations.

Liberal · June 22, 2009, 3:59am

42

Exapno_Mapcase · June 22, 2009, 4:41am

The SDMB alone is a good example of how difficult the question is.

There have been over 11 million posts according to the URL of this one, [noparse]http://boards.straightdope.com/sdmb/newreply.php?do=newreply&noquote=1&p=11266041[/noparse]. Presumably each one lives in a database, along with poster name, and status, and time, and location, and join date, and post count, and anything else that pops up when a post is posted.

Yet this thread is a separate entity that is more than merely the sum of the posts that the database collects and orders. Its appearance is unique to every viewer, even when they view the same number of posts. There are different skins, and different modes of display. Guests see ads, and ads are targeted so that different viewers see different ads. Even members probably see different Google ads. Or are those gone? Then there are the little rectangles next to the date indicator that glow blue if you visited the page earlier and orange if they are new since your last visit, or something like that. And of course, each thread starts with Welcome [your name here], making it unique even if every single other aspect were miraculously the same.

Now what do you count of all that? Do only the posts in the database count? The other stuff in the database? The ads in their separate database? Is that the Internet? Or is the display of all the pieces compiled into unique and always changing views the real internet? And what happens when you click on View Source? How much extra does that add, and does it count as part of the database or as part of the display or as a third thing separate from the others?

Remember that the Dope is a really simple basic little site that can be stuffed into one or two servers. Google has hundreds of millions of servers, most of which copy stuff from other servers, but keep the copies to make copies on top of copies, all of which are unique and changing and adding and subtracting stuff. How do you tabulate and catalog that?

And any number you produce today is utterly obsolete tomorrow. You have to run as fast as you can just to keep up. Yep, it’s Alice in Wonderland, not reality. It is a Wonderland. You just can’t stay there to do a census.

Measure_for_Measure · June 22, 2009, 5:26am

Hal Varian studied this in 2003:

How much information is available to researchers in the Internet age? Consider the following estimates provided by a 2003 study published online by the University of California, Berkeley:

* Print, film, magnetic, and optical storage media produced about 5 exabytes of new information in 2002. Ninety-two percent of the new information was stored on magnetic media, mostly in hard disks. The amount of new information has about doubled in the last three years.

* Digital information comprises the largest amount of this total. The World Wide Web contains about 170 terabytes of information on its surface; in volume this is seventeen times the size of the Library of Congress print collections.

* Another group of Web pages is dynamically generated and stored in Web-accessible databases. This "deep Web" is estimated to be 400 to 450 times larger than the static "surface" Web. (Lyman, Executive Summary)

I suspect “170 terabytes, in 2003 at least” is the answer to the OP.

http://liblearn.osu.edu/tutor/rightstuff.html

If we figure that the amount of information doubles every three years (cite) then we get a rough estimate of 680 terabytes.

Original work: How Much Information?

Shagnasty · June 22, 2009, 5:59am

That can’t be even close to being right by several orders of magnitude. You can buy terabyte hard drives at big box stores for not much money and that would mean that you could mirror the entire web for less than the cost of a college education. Ask Google or any other search engine about that sometime. It doesn’t come close to making any sense even in 2003 terms.

Exapno_Mapcase · June 22, 2009, 4:36pm

It’s the amount of information on public pages that was estimated to be 172 terabytes.

And those figures are as obsolete as a buggy whip manufacturer. YouTube didn’t even exist back then. The amount of information added to the Web increases exponentially.

Here are some figures from 2006, still a generation old and obsolete in Internet terms, that dwarf yours:
The Expanding Digital Universe

YouTube, a company that didn’t exist just a few years ago, hosts 100 million video streams a day.

i Experts say more than a billion songs a day are shared over the Internet in MP3 format.

ii Digital bits. London’s 200 traffic surveillance cameras send 64 trillion bits a day to the command data center.

iii Chevron’s CIO says his company accumulates data at the rate of 2 terabytes – 17,592,000,000,000 bits – a day.

In 2006, the amount of digital information created, captured, and replicated was 1,288 x 10[sup]18[/sup] bits. In computer parlance, that’s 161 exabytes or 161 billion gigabytes (see sidebar). This is about 3 million times the information in all the books ever written.

• Between 2006 and 2010, the information added annually to the digital universe will increase more than six fold from 161 exabytes to 988 exabytes.

Since Youtube and Hulu and streaming of all sorts has increased faster than anyone imagined in 2006, those numbers are undoubtedly ridiculously tiny.

Panurge · June 22, 2009, 5:51pm

Not the entire internet but a small part of it:
I work for the Danish Royal Library where we harvest the “Danish part of the internet” 4 times a year. This consists of all active .dk-domains (around 700,000) and a little below 50,000 relevant webpages from other domains. I believe that we got about 17 Tb of data last time - and we do check for redundancy in some way. In the period from 2004 to 2008 we have accumulated around 71 Tb of data.
I know that there are some data types that are a problem for us, and that we don’t go very deep (10 “layers” IIRC - still: 90% of the Danish web pages contain less than 10Mb of data). I suspect the “real” amount of data to be higher.

I am not sure that Danish domains accurately reflects global tendencies - I suspect that we are generating more data per domain than the global average (due to a very large frequency of digital photo and video cameras, for instance). Someone can try to do the math - it might give us an estimate.

The Royal Library uses a modified form of archive.org’s harvesting software and I’ve tried to find out how much data they get out of it - the Internet Archive is basically doing the same thing as us on a global scale. It should be on their web page somewhere, but I can’t find it. I don’t think it’s in the FAQ and their server stats are down right now.

Bytegeist · June 22, 2009, 10:54pm

Ah, those were the days, when the Internet was only 42 bytes. I remember they used to publish hex dumps of it in magazines, which you could type into your Apple II or TRS-80, if you had the patience.

Of course about 20 bytes of that was porn, so you had to do it when Mom wasn’t around.

Al_B.Itt · June 23, 2009, 6:20am

The reciprocal of whatever number you arrive at for the first part of your question.

4JawChuck · June 23, 2009, 7:11am

Mr. Owl: Let’s find out. ONE… TWO… THREE…
Three.

BigT · June 23, 2009, 12:08pm

Exapno Mapcase’s cite seems to be talking about something different than my cite. I believe mine was referring to the amount of information available to a single user, while EM’s is how many are being transferred.

In other words, if you and I both watched a Youtube video, my cite would only count the video once, while EM’s would count it twice, as it was copied two different places.

Where do I get the idea of the difference? My cite is by the CEO of Google, and references the amount of information indexed by the search engine. It would only make sense to cite that in reference to the amount of information available, not the amount being transfered.

Panurge · June 23, 2009, 12:39pm

The Planets Project has a conference these days at the Royal Library in Copenhagen. A colleague of mine attended the opening yesterday where he was presented with the ballpark figure (no correction for redundancy, for instance) of 700 Exabytes (700 billion gigabytes) as the total amount of information in electronic storage today. Not sure how much of it is available on the internet, though.

Mister_Owl · June 23, 2009, 2:47pm

Ahem! scowl

JKilez · June 23, 2009, 4:36pm

I have to agree. I am aware of a single, relatively small bittorrent tracker that currently exceeds that number in available seeded torrents; not upload or download rates, but actual file sizes. The larger public sites certainly track many times that number. If you are to include shared data from non-centralized servers, the amount of available data increases dramatically.

Rigamarole · June 23, 2009, 4:53pm

Nothing quite gets me going like binary porn. All those 1s going into 0s, and sometimes a string of 0s all on top of each other playing naked twister… mmm.

Chronos · June 23, 2009, 5:06pm

The way I’d go about this question (if I didn’t care about redundancy) would be to dig up sales figures for hard drives sold in the past five years or so, assume that they’re on average about half full, and then assume that all but a negligible amount of them are on the Internet. Everything on any computer that ever connects to the Internet is in principle accessible, though you’d have to jump through a lot of hoops for most of it.

Topic		Replies	Views
I need to back up the internet Factual Questions	34	2604	February 6, 2006
the internet... Factual Questions	17	1361	December 22, 2004
How big would my hard drive have to be to store everything now on the Net? Factual Questions	12	932	August 28, 2002
On searches and Internet traffic Cecil's Columns/Staff Reports	23	2846	October 12, 2005
How many bytes are there? Factual Questions	16	1560	August 19, 2004

How many bytes is the internet?

Related topics