How many bytes is the internet?

To phrase this question more specifically: What is the best estimate of how many bytes everything (that is every pixel of an image, every pixel of every frame of a video, every word of code, including images used as buttons, backgrounds and anything else that makes up a page, everything) that is potentially accessable (that means, made viewable by being on a server and in a folder that can be accessed over the internet) takes up.
And as a bonus question - What is the best estimated percentage of how much of that is actually useful to more than 50% of the people who use the internet?

The answer to that question is impossible and no one even ventures to try more than a rough figure. There are numerous issues including duplicate information and the “deep web” that can be found on some search engines or by signing up to particular sites and using their search engines. The SDMB is an example that doesn’t show much on popular search engines but there are countless others. The answer is somewhere in the exabyte range but no one knows the real figure.

A bunch, and a vanishingly small fraction is of general interest.

http://www.wisegeek.com/how-big-is-the-internet.htm has an estimate: 5 billion gigabytes, or 5 exabytes.

Of course, as you said, it’s really more of a guess.

And, yes, I found that using Google, like most of my answers that have citations.

42

The SDMB alone is a good example of how difficult the question is.

There have been over 11 million posts according to the URL of this one, [noparse]http://boards.straightdope.com/sdmb/newreply.php?do=newreply&noquote=1&p=11266041[/noparse]. Presumably each one lives in a database, along with poster name, and status, and time, and location, and join date, and post count, and anything else that pops up when a post is posted.

Yet this thread is a separate entity that is more than merely the sum of the posts that the database collects and orders. Its appearance is unique to every viewer, even when they view the same number of posts. There are different skins, and different modes of display. Guests see ads, and ads are targeted so that different viewers see different ads. Even members probably see different Google ads. Or are those gone? Then there are the little rectangles next to the date indicator that glow blue if you visited the page earlier and orange if they are new since your last visit, or something like that. And of course, each thread starts with Welcome [your name here], making it unique even if every single other aspect were miraculously the same.

Now what do you count of all that? Do only the posts in the database count? The other stuff in the database? The ads in their separate database? Is that the Internet? Or is the display of all the pieces compiled into unique and always changing views the real internet? And what happens when you click on View Source? How much extra does that add, and does it count as part of the database or as part of the display or as a third thing separate from the others?

Remember that the Dope is a really simple basic little site that can be stuffed into one or two servers. Google has hundreds of millions of servers, most of which copy stuff from other servers, but keep the copies to make copies on top of copies, all of which are unique and changing and adding and subtracting stuff. How do you tabulate and catalog that?

And any number you produce today is utterly obsolete tomorrow. You have to run as fast as you can just to keep up. Yep, it’s Alice in Wonderland, not reality. It is a Wonderland. You just can’t stay there to do a census.

Hal Varian studied this in 2003:

I suspect “170 terabytes, in 2003 at least” is the answer to the OP.

http://liblearn.osu.edu/tutor/rightstuff.html

If we figure that the amount of information doubles every three years (cite) then we get a rough estimate of 680 terabytes.

Original work: How Much Information?

That can’t be even close to being right by several orders of magnitude. You can buy terabyte hard drives at big box stores for not much money and that would mean that you could mirror the entire web for less than the cost of a college education. Ask Google or any other search engine about that sometime. It doesn’t come close to making any sense even in 2003 terms.

It’s the amount of information on public pages that was estimated to be 172 terabytes.

And those figures are as obsolete as a buggy whip manufacturer. YouTube didn’t even exist back then. The amount of information added to the Web increases exponentially.

Here are some figures from 2006, still a generation old and obsolete in Internet terms, that dwarf yours:
The Expanding Digital Universe

Since Youtube and Hulu and streaming of all sorts has increased faster than anyone imagined in 2006, those numbers are undoubtedly ridiculously tiny.

Not the entire internet but a small part of it:
I work for the Danish Royal Library where we harvest the “Danish part of the internet” 4 times a year. This consists of all active .dk-domains (around 700,000) and a little below 50,000 relevant webpages from other domains. I believe that we got about 17 Tb of data last time - and we do check for redundancy in some way. In the period from 2004 to 2008 we have accumulated around 71 Tb of data.
I know that there are some data types that are a problem for us, and that we don’t go very deep (10 “layers” IIRC - still: 90% of the Danish web pages contain less than 10Mb of data). I suspect the “real” amount of data to be higher.

I am not sure that Danish domains accurately reflects global tendencies - I suspect that we are generating more data per domain than the global average (due to a very large frequency of digital photo and video cameras, for instance). Someone can try to do the math - it might give us an estimate.

The Royal Library uses a modified form of archive.org’s harvesting software and I’ve tried to find out how much data they get out of it - the Internet Archive is basically doing the same thing as us on a global scale. It should be on their web page somewhere, but I can’t find it. I don’t think it’s in the FAQ and their server stats are down right now.

Ah, those were the days, when the Internet was only 42 bytes. I remember they used to publish hex dumps of it in magazines, which you could type into your Apple II or TRS-80, if you had the patience.

Of course about 20 bytes of that was porn, so you had to do it when Mom wasn’t around.

The reciprocal of whatever number you arrive at for the first part of your question.

:smiley:

Mr. Owl: Let’s find out. ONE… TWO… THREE…
Three.

Exapno Mapcase’s cite seems to be talking about something different than my cite. I believe mine was referring to the amount of information available to a single user, while EM’s is how many are being transferred.

In other words, if you and I both watched a Youtube video, my cite would only count the video once, while EM’s would count it twice, as it was copied two different places.

Where do I get the idea of the difference? My cite is by the CEO of Google, and references the amount of information indexed by the search engine. It would only make sense to cite that in reference to the amount of information available, not the amount being transfered.

The Planets Project has a conference these days at the Royal Library in Copenhagen. A colleague of mine attended the opening yesterday where he was presented with the ballpark figure (no correction for redundancy, for instance) of 700 Exabytes (700 billion gigabytes) as the total amount of information in electronic storage today. Not sure how much of it is available on the internet, though.

Ahem! scowl

I have to agree. I am aware of a single, relatively small bittorrent tracker that currently exceeds that number in available seeded torrents; not upload or download rates, but actual file sizes. The larger public sites certainly track many times that number. If you are to include shared data from non-centralized servers, the amount of available data increases dramatically.

Nothing quite gets me going like binary porn. All those 1s going into 0s, and sometimes a string of 0s all on top of each other playing naked twister… mmm.

The way I’d go about this question (if I didn’t care about redundancy) would be to dig up sales figures for hard drives sold in the past five years or so, assume that they’re on average about half full, and then assume that all but a negligible amount of them are on the Internet. Everything on any computer that ever connects to the Internet is in principle accessible, though you’d have to jump through a lot of hoops for most of it.