Question about the inner workings of Google

Google founded 6 yrs ago by 2 Stanford University males has the software to explore the internet and has scanned and stored about 4.3 Billion web pages which if printed and stacked would reach 300 miles plus upwards.

My question is this: Why would such a volume of information about the world we live in and all the immense detail…not to mention the entire universe as we know it be scanned into earth’s computers. I would think that loads of information from beginning to end would not necessarily be uploaded into computers from which Google search.

Yet no matter what I want to look up with rare exceptions , the information is found by Google. What is the driving need that human beings utilize to upload so much information into computers for Google to scan and store.

I would think that there is a vast volume of information about many things that has not yet been uploaded…Having read this I can see I’m not articulate, but hopefully I’ll get some non-wisecrack answers.

Computers is the way in which today’s society stores information. What would have been written down and put in a binder 50 years ago is now written down and put in a computer file. Putting it on the internet is a pretty small step from there.

Do a search on Google, for whatever you like. Look at the first site you find. Why is it there? Why does it publish the information it publishes? Do the same with the next site, and so on. You’ll find that it’s easy to think of realistic reasons.

Quite possible.

The Internet went thorugh three main phases in its history, and Google has ended up spidering through material from all three of them to a greater or lesser extent.

First, the Internet was a research tool and an ARPA project. It was born in an attempt to create a highly decentralized communications network that could survive the decapitation of the military and civilian authority structure during a nuclear war. It filtered into the academic world and to some extent beyond, where it allowed enthusiasts and various technical people to exchange information of all kinds. Usenet and Bitnet flourished during this time, and Google’s Usenet archives (inherited from the older DejaNews corporation) go back to this period.

Second, the Internet was a huge fad. The World Wide Web made it attractive to businesses and other non-technical groups, and people felt obligated to do something with it. Google crawled the pages left behind from this initial surge of interest. Indeed, it was the exponential growth during the early years of the Web that forced people to develop search algorithms as advanced as the ones Google uses. The relatively orderly and noncommercial academic world simply doesn’t need them. (`Noncommercial’ in that very few people pre-Web would think of deliberately fooling a search engine. Some of Google’s complexity, and a good part of its effectiveness, comes from the sophistication and organization required to scam it.)

Now, the Internet is a recognized way of doing business and interacting socially. Information is published online as part of a business plan (beyond ‘The Internet is a huge money pot!’ ;)), or by people who have an interest in something and want to share it with as many people as possible. The Internet is undergoing a less exuberant but more sustainable growth in this more rational time, and Google is still indexing it.

There is a whole lot of stuff that google wont find, of it it does its just comments on someones web page, rather than any real reliable source. The determinant as to the amount of results you recieve is dependant on how you phrase the query.

If youre wondering how its done, which is what I get from the title of the thread but not necessarily your OP, there are quite a few different components which can very very basically be broken up into:

The crawler, which scans pages, and then follows any links on that page to other pages, etc etc.

The query, which takes your input and parses/figures out what it is you want returned.

The rankings, which ranks results (whether it shows up higher in the results to your query or lower). Much of this logic is based upon how many other sites link/reference to a source, rather than how well it specifically matches your particular query (in the case of multiple identicle matches).

There is of course far more to it, such as indexing as well as the stuff that lets you know if its pdf or not, if you want to see a cached version, etc; not to mention all the logic in the data repositories and all that.

On the hardware end, they use a lot of clusters. A lot. Basically, its known as throwing hardware at something. I think I can say that the programmers dont have to be too concerned with things like processor usage, memory usage and all that, unlike many programmers in the web world (processor and memory are far higher concern when programming for high volume sites than for example when programming desktop apps - usually ;)). Instead, google has made the commitment to use however much hardware is necessary, which is not feasable to many other companies. They have tens of thousands of rack boxes all running Linux. They dont rent space in datacenters, they have their own dedicated datacenters.

You may be interested in a 1945 Atlantic Monthly article by Vannevar Bush called “As We May Think.” Essentially it’s about how scientific research will move forward faster as information becomes more easily available.