Google search question

Why is it that sometimes links pulled up on a Google search do not contain the text that is presented with the link? And other times they do?

For example, the link below was one of the things pulled up when I googled “neurotheology”:

Andrew Newberg
www.andrewnewberg.com/
Leading neurotheology researcher who conducted famous neuroimaging studies on monks; collaborator with Dr. Eugene d’Aquili; author of “Mystical Mind” and …

If you go to that page however you will find that the quoted text is not there.

In contrast, right next to that link is another:

Neurotheology: This Is Your Brain On Religion : NPR
www.npr.org/.../neurotheology-where-religion-and-science-collide
NPR
Dec 15, 2010 - Newberg tells NPR’s Neal Conan that neurotheology applies science and the scientific method to spirituality through brain imaging studies. “[We] evaluate what’s happening in people’s brains when they are in a deep spiritual practice like meditation or prayer,” Newberg says.

That page does contain the quoted text.

What gives?

The text quoted on the search result seems to be from here as well as several other pages that link to the andrewnewberg.com site. I speculate that the google doc server, in addition to containing the text of the document itself, can also return text from other highly-ranked pages that link to the document.

In fact, that’s the core of what made Google so much more useful than all the other search engines, back in the day when there were others: They searched by what linked to pages, not just by the pages themselves.

I found this video talking about snippets. One thing it mentions is if the site is down when it is indexed, or indexing is forbidden by the robots.txt file, they default to using the open directory project, which also has that text pointing to that site.

Note that it is not just searching the by linked text, but ranking sites by incoming links, (weighted by the rank of the pages the incoming links are on). They’re basically taking the largest eigenvector of the adjacency matrix and using the components as a proxy for importance, which does remarkably well as a ranking scheme.

I had a similar experience. I googled the name of a friend, and the first dozen or so hits were of a beauty pageant, bur neither her first nor her last name could be found in the text from any of the hits. She insists that she was not a contestant in that or any other beauty pageant, nor ever associated with one, although the linked pageant occurred in the obscure country of her residence…

I’m sorry, but I still don’t get it.

And this…

…as much as I greatly appreciate your response and willingness to help, doesn’t help.

I need an analogy. Such as, if I go to a library and ask the librarian to find me some books about neurothology, she’s going to analyze all the books that mention neurothology, then pick the most frequently referenced books among those, and recommend those most frequently referenced books, including by a given reference not necessarily a quote from the recommended book, but a quote from one of the other books that referenced it?

Is that it?

If so, then pardon me for saying so, but that seems a bit Irish. (As they reference such malarky in Australia, where I’m not from).

Google regularly updates its algorithm and Do cache of every websites before they are listed in the results. It may be a old cache of that particular website, so the text is not found on the site (since the owner may changed/modified there site).

On the other hand Google takes automatic snipped from the website to highlight them on the search result. so it may be a auto snipped one.

More often google correct this kind of pitfalls in every update…!

Yes, Google’s algorithms occasionally give you a site that isn’t what you’re looking for. But far more often, they give you sites that are what you’re looking for, even where other search engines wouldn’t realize it. Or even, sometimes, where you yourself didn’t realize it. It works. That’s why they rose above the dozens of other search engines that used to be around, and why they’re, for practical purposes, the only one still active.

Google: the search engine that thinks it knows better than you do what the fuck you’re looking for. Even when you put everything in quotation marks and click “verbatim” and use Advanced Search with “must contain all of these words” etc.

They’ve gotten so much worse about it over the last ~5 years that I’ve got a serious Alta Vista jones.

You’d rather return to the days of web pages consisting of nothing but hundreds of thousands of random words, designed for no other purpose than to be search engine targets?

There are two aspects two the problem, finding out what page is good to return, and how to present a representative snippet of that page to you in the search results. To do the former, they do basically what you suggest – pages are ranked by the number of pages that link to it, with the enhancement that the count is weighted by the ranks of the referring pages (which is determined by the ranks of the pages that refer to them, &c.)

The weighting by referrer ranks is what makes the system robust against Chronos’s “random word” pages. Such pages would have very low rank, and if no other pages link to them or only other low-rank pages link to them, they have no means to acquire a higher rank.

As for the snippetting, the video I linked pretty much says that if they can’t or are requested not to crawl the page, they use text from the open directory project which seems to be what happened in your case.

I don’t see the reason for the opposition, either to the snippetting process, or the Irish.

I don’t recall ever having had that problem.
My bookmarked search page was Alta Vista Advanced Search and I would input boolean search terms (almost never just a single word or phrase):

(last AND (hitchhiker or hitchhiking)) and (holliday or holiday or halliday) and (mystery or fiction or book)

… and I got results that were invariably correctly containing exactly what I’d searched for.

This explains things in a way I can understand. Thanks.

As for the Irish, they seem a bit Irish. And, but and or they make a good cup of coffee.

I certainly do. I remember page after page of straight-up spam with no relevance whatsoever, just because search engines looked at the text of the page (even hidden text) and took it as gospel, not taking links into account.

It used to be common to have a load of key words in the same colour as the background so they’d be invisible to the human reader (unless he or she highlighted the text of the site or looked at the page source) but picked up by search engine web-crawlers.

For a while, it was even creepier than that: Search engines were indexing pages based on meta tags, which didn’t appear at all in the body of the actual page but were completely hidden in the page’s HTML source code.

This page describes it rather well:

The early Internet was primarily academic. It was a research project, not related to nuclear survivability but to simple reliability and cost-savings over older types of network. The early Web was also primarily academic, with a few large corporations using it for admittedly commercial but straightforwards purposes. Of course, once the massive growth phase kicked in, the dishonest assholes moved in and arbitraged the living Hell out of absolutely everything in sight. They were winning (or at least succeeding in being noticed, if not succeeding in getting paid*) until Google came along and suddenly you could get search results without a page or three of keyword spam before the first useful result.

*(One of the great and terrible things about Web publishing is how cheap it is. In the print world, if your idea doesn’t turn a profit within a relatively short timeframe, you’re gone, or at least relegated to the mimeograph-and-photocopy world. On the Web, things are cheap enough that you can make a go of a loser of an idea for a lot longer, and this includes spam projects that, frankly, aren’t very successful. Therefore, just because a spam page exists, doesn’t mean it’s making anyone very much money.)

Oh, yeah, meta tags. I once saw a serious “guide to building your own webpage” in a magazine, that advised that you should make sure to put the words “pamela anderson” in your meta tags, because apparently that was the most-searched string at the time. Never mind that that’s probably not what your webpage was about, and that anyone who searched for that wasn’t going to bother to stick around to see what you did have.

I did a search on Google using two technical code-terms. (Not at all real words.)

Of the first 10 pages listed, 8 didn’t have the second of the terms. Another had it but in a link to something else, and not even part of the main body of the page.

One. One page of 10 actually had both terms. And it turned out the info on that page was flat out wrong.

This is very, very typical.

They’ve gone over the edge in terms of being “helpful”. I don’t want extra help. I want exactly what I’m searching for.

Google doesn’t care about “power users” and such. They are catering to the lowest common denominator.

Re; Altavista. They even used to have simple wildcards. E.g., “encyclo*”. That is very helpful.

OK. Are you sure that the pages you’re looking for are actually out there? Because a lot of times, given searches like that, Altavista would just not return any hits. Google prefers, instead of giving you no hits, to at least give you hits that look like they might be of interest.

If there are no hits, I want to be told that. That can be very helpful to know.

E.g., in my search, if I was told “0 hits”. I would have had my answer and stopped right then.

There’s lots of examples in Computer Science where “no information” can be shown to be useful information.