Large Hardrives/Silly Compression/Fast Internet

okay, we have fast and large hardrives (100 gigs will be standard in the not far distant future, you can buy them easly already) we have no lack of space on our computers, we are basicly swimming in disk space, we also have DVDs, which can move alot of data onto a hard drive from the store realy quick and easy like.

what we don’t have is fast internet, even the fastest internet on a t3 line could be alot faster, and no matter what phone line modems and clogged cable modems are gonna be around for years to come.

is there no reason now why not to use insainly large compression? is there some flaw in that idea I am not thinking of?

I know that the algorithm of compression gets larger and larger the smaller the output is to be. why not build “Windows 2010” with a compression algorithm that contains every single web page writeable (HTML is a finite language of text) linked to a number (or mabey ten lines would make more sense… definie a webpage as a series of little segments)

that way you could send a giant webpage as a few bytes (numbers would get long I guess, but you can be clever with that, define the number sent as turns on a binary tree to find it or something )

so that this whole page right here would be like

11024901209
12421412000
74445634321

or something similarly… that way, you barely have to send any data over the intenet… but you end up with everything EXACTLY as it was sent.

and since nothing can’t be shown as a bunch of bytes on a computer… you could send movies the same way, just chop it up into “ultra compressed” numbers then send that short number to another computer, then have it decompress it (make it something the internet protocall does automaticly)

sure storeing so many permutations would take a few gigs worth of space, lets say people were willing to give up 10 gigs worth to have the power to download things from the internet 100 times faster… just figure out a compression algorithm that takes up 10 gigs worth but uses that all in a way that actually compresses the thing down to a tiny tiny size.

(heck, run a normal winzip style tiny compression on the resulting addresses ready to be sent and you got even less to send)

is this insaine? I am a CS major, and this sounds solid… but it seems like it wouldn’t work for some reason it sounds too… silly…

Actually, check out mod_gzip for Apache. This is pretty much already going on.

Well because insanely high compression does not exist. Most likely because it cannot be done.

Your example of:
11024901209
12421412000
74445634321
has only enough room for 10^33 different pages. Which is a lot but certainly not enough. Lets say a page is English text of 500 words. Now there are about 40,000 words in the English language. Obviously many of the pages you get by randomly choosing words are crap. So lets say you have about 10 choices for each word. That gives us 10^500 different possible pages that you want to put into only 10^33 slots.

It won’t work. The combinations and permutations possible in just HTML are enormous. You would also have to generate the content of all of the pages. Pages can also be very long. Just to illustrate the size of the problem, a sentence of fifty characters can have something like 2.6^51 possible combinations of characters (assuming just uppercase and no punctuation.) Granted, a lot of them won’t be any good, but the problem changes when you get to binary data.
A single byte has 256 possible combinations, and any combination with another byte can be valid. A fifty byte sequence has 2.56^52 possible combinations. For your compression scheme, you would have to list them all and store them somewhere. Then you could send an index to the list over the internet. The problem is that it takes just many bytes to specify a position in the list as there are entries in the list. You’ve gained nothing.
A final word:
I don’t think you are cut out to be a CS major. Your grasp of mathematics would appear to be a little … weak, as is your understanding compression and the internet. You should also check into a remedial English course. You would embarrass yourself in professional level correspondence - and computers aren’t tolerant of typos.

Damn, Gaudere strikes again.

Oh, by the way, the following URL documents which web browsers are capable of accepting this compressed content:
http://www.schroepl.net/projekte/mod_gzip/browser.htm
This indicates that pretty much all modern web browsers can accept it. Note that both of the big two web servers on the market implement technology such as this… IIS has something similar to apache’s mod_gzip.

Gzipped data, yes. “Insanely large compression” no.

Here’s where you’re being silly: Even if you had an index file that contained the text of every possible web page referenced by a well-known index, you would be nowhere. Transfering text is not by any imagination a whole lot of work, where 1K of data can return 1024 characters, and most of the slowest connections can still transer at 3K/second.

The problem comes in the transferring of pictures and movies and such. Such things are already compressed to the point where doing things like gzipping them doesn’t really affect their size at all.

Now, if you wanted to do some sort of “super-compression” by index file on binary files, you would need to have combinations of all 256 possible bytes for each digit stored in the index file. So let’s start with two bytes stored in the index file.
0,0
0,1
0,2

0,255,
1,0,

255,255

It would be 256^2 in size, or take up 65536 bytes on your hard drive. Each of these 65536 entries would require a unique index to access it correctly, which means that you’d have to download enough bytes to represent 65536 different unique entries. That’s 2 bytes. So then you’d download your 2 bytes, look up the 2 bytes at that index, and get 2 bytes out of your table, for a grand compression ratio of 0%. It doesn’t get any better if you increase the number of bytes stored, for each n bytes you store in your index file, you’ll have 256^n entries in your index file, which require n bytes to be transfered to index.

Things are a bit better with text, as there are only about 128 printable characters, but there are much easier ways to get 50% + compression ratios with text files.

I won’t go so far to say that you should reconsider computer science as a major, but you certainly need some practice roughing out your grand schemes for saving the universe before presenting them for public comment.

-lv

To understand why this won’t work, you have to understand that compression, contrary to popular belief, is not a magical process that turns big things into smaller things. It turns some big things into smaller things, and it turns some big things into even bigger things. Compression algorithms are chosen based on the type of data that’s being compressed, so that the data you’re most likely to feed into it will get smaller. For example, you could design a nice compression algorithm that will work well with English text, but will make random gibberish get bigger. Programs like gzip have the exact same behavior, but they’re smart enough to skip the compression if the size reduction isn’t enough to offset the overhead.

The end result is that it’s impossible to design an algorithm which can represent every possible web page in less space than it would take to store the original page.

You might want to look at the comp.compression FAQ, which covers things like this. No link, but google’s your friend.

In addition to all the compression fallacies in the OP, note that many, many web pages are dynamically generated. Take this very web page you are looking at now. Even if it had a “code” how would your computer know which version of the page to load? The one after the OP? After the first response? Just before my post? Just after my post? And how would your computer be psychic enough to know when and what I was going to be posting to this thread at some point in the future.

All in all, a Really Bad Idea from the get-go.

(BTW, 100Gig is a trivial amount of space if you want to store video.)

Just to nitpick, HTML isn’t finite. Granted, there’s only so many characters involved, but there’s no limit for the length of the file. A 2K page and a 500K page are equally possible.

It’s worth mentioning that the theoretical best case for a compression algorithm is that, for files of a certain length, it compresses every file except one, which it leaves unchanged. But no real compression algorithm will come even close to this.