What determines whether a given web page will be included in a Net search?

For example, if someone was sick enough to search for the phrase “KarlGauss” using Google they wouldn’t detect any of my posts at the SDMB.

[hijack]

Karl,

I think you’ve misattributed your sig-quote. According to the Oxford Dictionary of Quotations, 3rd ed., it was John Wilkes to Lord Sandwich:

Oxford cites Charles Chenevix-Trench, Portrait of a Patriot (1962), but qualifies it with a reference to H. Brougham, Statesmen of George III, 3rd series (1843). Since I don’t have that, I don’t know who Brougham attributed it to.

Whoever said it, it’s a witty phrase.

[/hijack]

Many sites have a “robots.txt” file which tells crawlers and robots which pages not to include in their indexes.

Otherwise, it just depends on whether or not the page in question has been submited (or found) by the search engine.

Maybe. Several net sources (if they can be trusted) mention that the quote has been variously attributed to Wilkes/Sandwich but assert that the Disraeli/Gladstone one may be primary. Regardless, thanks and now to find a new sig line!

The robots.txt standard is a bit old (1994), but is still largely in force. There is also meta tag syntax you can place in your web pages. See:

http://info.webcrawler.com/mak/projects/robots/robots.html

Most effort being put into standards for web crawlers is concerned with making the crawler be a good citizen and not hog bandwidth by being inefficient in its operation.

These days, any well known commercial search engine will have plenty of people registering their sites for it, and many people are very concerned with how to get their site to turn up with great frequency or at the head of the list in a given search engine. There are meta tags you can use to define keywords for indexing and so on, and a lot of information out there concerning how to optimize for Google, Lycos, Altavista, etc (some of it better than others, obviously). With this sort of competition going on, a site which does not register itself to a search engine will turn up on a very hit-and-miss basis indeed.

You might ask in ATMB about whether SMDB registers with any search engines, and bring up the issue of whether it should.

Concerning web-bot’s in general, my personal hope is that emerging XML-based standards will bring some sanity to the situation, and allow the construction of intelligent agents for the mass market. Trouble is, you have to get the commercial marketplace to buy into them, and we are currently in the “pissing contest between competing standards” stage on a lot of this stuff.

‘For example, if someone was sick enough to search for the phrase “KarlGauss” using Google
they wouldn’t detect any of my posts at the SDMB.’

Maybe. As soon as someone submits the board to the search engines again, they are going to show up.

I think the reason SDMB posts wouldn’t show in a search engine is because they’re not really a web page, at least not a page that can be indexed.

If you look at the address bar, you’ll notice that theURL is www . . . /showthread.php?[and a whole buncha stuff]. As I understand it, the .php file is a script that resides on the SDMB server, and the stuff after the “?” are variables that tell the script what information to retreive. The “page” is a bunch of data residing on the server that the script locates and formats for presentation, not a discrete HTML web page that can be read.

Think of a web page as a movie theater marquee: all of the information on each showtime is put up and may be read by anyone. A script-driven page is like a theater that doesn’t put information up, and makes you ask the ticket agent: you specify the information you’re looking for (“What movies are playing at 7:00?” or "What are the showtimes for “Teen Sex Comedy III?”) and the agent (the script) processes that question and returns the answer to only your particular question. Now think of the search engine robot as a person compiling a list of movie showtimes: it can drive by the first theater and just copy the information off, but the marquee at the second theater is blank, and it is unable to properly address the agent to determine showtimes at the second theater.

The precise reasons why search robots are unable to access scripts are better left to tech people to explain. But I don’t know of any search engines that index information obtainable only via script. That’s why lots of online newspaper stories are not indexed. Papers post their articles via scripts, since it’s easier to just store the content on a database and retrieve it via script than it is to compose a separate HTML page for each article.

Ok, techies, how’d I do?

If it’s a valid URL, and if there’s a link to it anywhere, then a search engine can find it, regardless of the nature of the information. I have, in fact, heard of folks finding things on here showing up on search engines. The fact that your name doesn’t show up here on search engines means that either, A, the spiders haven’t been here since you signed up, or B, that the Chicago Reader webmasters have recently put in a robots.txt blocking the message board.

Nurlman, what you are getting at is that some pages contain dynamic content. As Chronos said, the search engine reaching the URL doesn’t have any knowledge of how the content got built by the server. Look through the results on google, for instance, and you will find various forms of CGI scripts and server plugins. And the fact that the initial link is html doesn’t mean it’s “static” neccesarily - the page may redirect or contain other URL’s which are dynamic, or there may be some daemon process refreshing the pages on the server side, which essentially makes them dynamic. The web has many wierd and wonderful ways of connecting the plumbing between your browser and the content.

A slight expansion on the above - search engines are not typically constructed to be able to reach pages which are the results of form submissions - in theory a crawler could find the inputs on a page containing a <FORM>, fill them in, and submit it. Since the engine probably lacks knowledge of what should go in the fields, it would be sort of a pointless exercise.

Of course, the result is that search engines can be inaccurate when they happen to have scanned and indexed a dynamicly generated (or regularly updated) page. This is why you get “hits” on news articles that have nothing to do with your topic - the search engine indexed an article that is no longer current.

A better way of using your movie analogy, might be that the search engine comes along, dutifully notes that there’s a theatre here playing “The Muppet Movie”, not realizing that by the time somebody interested in children’s fare pulls the address out, it might be playing “Howard Stern’s Private Parts”.

A better way to have things work for search engines and other more sophisticated agents is to have the pages be able to describe what kind of stuff they are really providing, what it’s currency is, the kind of parameters that ought to apply in forming the requests, and so on. This is what adoption of XML promises, but the details are getting worked out, so for now we have to put up with search engines that scrape the content. BTW, that is a description you hear widely employed for obtaining information from pages intended normally for display - “HTML scraping”.

I’ve actually written an HTML scraping system I intended to market commercially at one point - sort of a construction kit for 'bots.

visit ask.com, ask a question, often the SDMB shows up.