According to an Internet research class that I took last semester over seventy percent of the web in classified (by whom I don’t know) as Deep. This was given as one of the reasons that we had to learn all of their search techniques, and not just rely upon Google. What could a person find who was skilled in webresearch (from free pages) that someone who used Google as their only tool could not?
Also, while we are on the subject how did Google become “The Thing” anyway? Years ago I used AltaVista, Lycos, and others then it seemed like everyone just started using Google. Was this because it was somehow revolutionary or was it just clever marketing?
There’s a lot of stuff on the web that isn’t really hit by google, and there are specialized search engines for that info. Boardreader is one example of a search engine that scans popular message boards
“The Deep Web” commonly refers to pages which are not indexed by traditional search engines. Many of these pages are “dynamic” (like the SDMB), meaning that instead of existing at a fixed address, they are generated by the server each time you tell it to.
The Deep Web is that portion of the World Wide Web that isn’t registered or indexed by search engines. Some of it is inaccessible to all but registered users with passwords, some of it is accessible, but you’d need to know the URL and type it in yourself. Some of these pages will be accessible from an index page, but others aren’t listed, and you’d need to type the URL down to the last slash. I have pages on my site that I can only find with my FTP client, which lists the entire contents of my web account.
If you’re reading this thread, you’re accessing the Deep Web. The deep web is the stuff that’s stored in databases and accessed by sending database query parameters through URLs. Look at the URL for this thread. See the “?t=296669” at the end? That’s used by the server to construct a database query for the posts in the thread. Click on a post number from this thread and look at the URL; the stuff after the “?” is used to find that specific post in the database.
Search engines crawl the Web by following links. On a database-driven page, pages that may never be linked to can be created dynamically using query parameters in the URL. This is especially the case for search-driven pages, in which there are few static links. Thus a search engine like Google will never find them just by following links. (However, Google does index some of the Deep Web. As much as 30% by some estimates.)
As for why Google became popular: it’s fast, has a huge index, and has a user interface uncluttered by ads.
Google became so popular so quickly because they had a good search engine which yielded useful results and they spent a lot of time and effort making the site user friendly.
When Google started, broadband was not as widespread as it is now. Google limited graphics to thier logo and kept the whole page very, very small which meant it loaded quicker for people using dial-up. This gave them an instant advantage over Yahoo! or Infoseek or whatever other portal with a dense, cluttered home page.
Google just made it simple. Homepage wasn’t overwhelming, search results seemed a little more on topic and it was fast.
And Google’s first big advantage was its cache. If a link no longer worked, or went to a newer page, you could still obtain the information by going to Google’s cache of the page. And your search words would be helpfully highlighted.
This is a huge advantage of Google that no one else has matched. I’m always amazed - or disgusted, depending on the article - when it’s not mentioned.
And over the years Google has always been ahead of everyone else in adding new features that the others have to play catch-up on, usually very poorly.
They had a better algorithm. They ranked pages based on how many pages linked to that page, which no one else was doing. This made it more likely that the page you wanted was actually at the top of the search results. Ergo, people (like myself) switched from those other search engines to Google, because it was better.
The critical thing Google had going for it from my perspective was that the other search engines immediately soiled their product by surrepticiously ranking results based on how much the advertiser had paid. That told me they were all untrustworthy.
Google made a point, and still does, that no amount of money will affect the results of the search. Yes, they have paid ads alongside the search results, but I’m confident the search itself is giving me what I want, not what some advertiser wants to show me.
And that difference, in my mind, was 90% of why Google killed off the other search engines in 1998 -2000.
Revolutionary - not marketing at all. What I remember from first using Google was that it was freakishly fast. When you submitted a query to AltaVista, there was always a slight delay before the results. With Google, the results came back immediately. The difference was so pronounced, that people just switched over to Google searches immediately without looking back. After that, all of the other advantages mentioned in this thread were icing on the cake (smart page rankings, low noise results, cached pages, etc.).
From what I have read, Google first focused on building an amazing distriubted filesystem based on commodity hardware and then added the search engine application on top of it. This strategy has allowed them to do exactly what you said - roll out new applications and features faster and cheaper than their competitors. The cached pages, the free MB gmail accounts, and the digitialization of university libraries are examples of apps that they can roll-out, but give the competition a lot of problems.
INACG but I’ve searched for images on Google. According to what carterba posted all those images would be part of the “Deep Web”. [sup]The Deep Web is where all those porn pics are kept! Deep stuff.[/sup]
Google can search dynamically-generated Web pages. Googlebots, Yahoo Slurp bots, and other search spiders crawl through and index my vBulletin-based forum all the time. However, the SDMB server is configured to keep search spiders out, because they’ll bog down an already-hammered server. (This applies only to legitimate bots, which will look for a certain file on the server to see if access is denied. E-mail grabbing spiders from spambots will ignore the file and blaze forward regardless.)
A search engine can’t get to a site that required submittal of a form. This also includes any page that can only be loaded by clicking a form button; the link is different than a graphic button with a < a href="(some URL here) "> link.
An example of the “deep Web” would include the US census site, where there is gigabytes of data available for browsing … but behind a form-based interface. Search engines can follow links, but they can’t fill out forms or click on their “Submit” buttons. The information is available to the public, but it’s not backlinked through an HTML A HREF rag; thus, Google can’t find it.
Mailing list archives are another example. Many are viewable by subscribers only, and thus blocked from spiders. Sites requiring CAPCHA confirmation for entry – verifying a word or number in a distorted image that can only be read by humans – can’t be crawled.
Newspaper archives are usually considered “deep web.” Google may have links to articles, but it’s because the site was crawled when the article was still online. After so many days, an article goes into the archives, where Google can’t reach it. Eventually, the Google cache expired, and the article disappears completely; to see it, you have to visit the newspaper site and pay a hefty fee.
One other thing about Google is that it does not get hooked as easily on just keywords without context.
Some websites of less than savoury repute list lots of known popular search words that have nothing at all to do with the website, they are just on the page so that search engines can latch on to them.
Result is that you get zillions of pages, often with kick through links to other sites that you are not at all interested in.
Other search engines also tend to bring up huge numbers of indexes to other websites and advertising, such as Kelkoo and Dealtime, Google seems to have reduced this dramatically compared to other search engines.
If you are looking for say reviews on a particular graphics card, getting loads and loads of hits on index web pages of other sites that are trying to sell that card(nearly always at relatively high prices too), well that gets pretty irritating, and Google seems to filter out most of them compared to other search engines, which seem to mostly be about pushing dealers in merchandise rather than trying to provide information.