Why isn't the SDMB indexed on Google?

I remember reading them there at Google in the early days. I even suggested once they use the free search your-own-board feature they have at Google to make searching this board faster. As you can see, it wasn’t approved :slight_smile:

Umm…sure, as long as the link actually exists somewhere. Google is not going to and can not query the database to determine what threadviews exist.

I think Arnold will agree the answers are, respectively, no, no, no, and yes. :slight_smile:

‘can not’? Why not?

The other one was able to do it, the archive site, so why couldn’t, and isn’t, Google doing it too?

True, there should be no difference between a user with his browser clicking on a link and a robot following it. The pages are created on the fly but are available when requested. (if the hamsters keep up with it)

The threads are listed on the thread list pages. (Duh! :)) If a crawler knows the address of only one thread page, it could read the entire forum just by ‘clicking’ on “Last Thread” and “Next Thread”. Perhaps there is one here at this very moment! [I want a :**paranoid: smiley]

While looking over my shoulder I found out you also can exclude robots by adding a meta-tag to your pages. The only tag on the dope pages says ‘Microsoft-compatible’ :rolleyes:

Now we know it, I’d say adding one line to a text file would be definitely worth the effort to relieve the strain on the server. But don’t lock out search engines completely from the dope, because it was an Altavista hit on one of Cecil’s columns that brought me here. :slight_smile:

> “Discussed before? - Yes”
I didn’t search the board for “Google search” because the number of returned results would have reached critical mass and I was afraid of causing a nuclear meltdown. :smiley:

I checked some links at that archive site, and it looks like they were retrieving the thread pages from right here, but replacing the images with archived versions, where available. On a thread to which I had posted, I found that the archived page showed my current, to the minute post count, for instance. The icons at the bottom of each post were mostly the old icons, but “buddy” was also there, in the new form (since there wasn’t an old buddy icon).

A thought, by the way: Is it possible that one of the Web archiving sites might have caught a few snapshots of the SDMB during the Time of Uncounted Tears? I wouldn’t expect to be able to restore them to their proper place in the SDMB database (the mere thought of the work involved gives me a headache), and I wouldn’t expect everything we lost to be available, but it’d be nice if some of it were still available somewhere.

It’s definetely strange, to say the least. As I mentioned above, with my browser it seems to give me half and half. That is, I get some of the old board- content- but some of the new stuff- buddy and whatnot.

What’s stranger still is that it sometime reverts to threads current here, now. That shouldn’t happen.

For example, when I clicked on an old pit thread the other day it sent me to a new one started in, on, ::I’m getting confused:: this board. Now I know I was browsing in the arcive site but it lead me to a new thread posted only a day or two ago here.

Weird.

But what I mentioned, the really weird part, is that once I went back, only minutes ago, it showed the old thread, as it once was.

If you ask me, it seems to be a browser problem. It’s as if the browser gets confused on what you’re after. But trying it in Opera and even trying it with no other windows open and the cache and everything else cleared, it still gets the same result- screwy responses.

I’d be curious if someone who knew http could explain what might be going on.

All right, just to follow up my last post and try and provide an example of how screwy this is.

Go back to that Napster discussion I linked to. Click it and see what happens.

In my case, the first time I clicked it, I got the old style Vb and all of the posts. The second time I clicked it, I got a “Page Not Archived” reply generated by the archive.org site. The third time, I get the current Vb style, but the old posts.

Weird.

If you ask me, it’s gotta be a confused browser issue.

From a cost-benefit analysis, no, there is no practical way. Theoretical way yes; practical, no.

Another option is people’s browser cache’s, but it’s still much too hard. It would require a long night of sitting down and opening the MySQL console and writing some scripts to insert just a few threads back in - I’ve done similar things, and damn, it’s just not worth it.

I’m going to defer to Anthracite on this one because I don’t know exactly how this would work.

Google doesn’t have to know what it’s doing. If it crawls starting from sdmb.straightdope.com/sdmb without limiting its depth, it seems highly likely (possibly even guaranteed) that every thread can be encountered simply by following simple links.

For example, within three clicks of the main page, the tree fans out into links to every thread ever posted to by a moderator, and three more clicks from that, you can get to every thread posted to by anyone who’s ever posted in the same thread as a moderator. I’d call that pretty good penetration. :slight_smile: With a little more effort, you could probably find ways to get to every single post without ever having to fill out a search form.

You make a good point, galt, I admit. But Google does know whether a URL is dynamic or not - does Google index dynamic URLs? I mean, that “&” in the URL is a dead giveaway after all. :confused:

Well, I would hope google isn’t going to refuse to index a page just because it’s got a ‘&’ in the URL, but it could conceivably consider two pages the same if they only differ in the arguments (the part after the “?”). So in other words, it may consider showthread.php?threadid=1234 and showthread.php?threadid=2345 the same page, which would make indexing a site like SDMB impossible.

On the other hand, they must do some sort of pruning of the pages they look at on the basis of the arguments, since otherwise, it would be easy to get stuck on a site indefinitely (i.e. I could have a page, blah.php which links to blah.php?foo=1, and that page dynamically creates a link to blah.pho?foo=2 and so on… If google considered these all unique pages, it’d be indexing for a while).

Sites which basically have a small set of pages which show different content based on URL arguments are very common. I wonder what google does about indexing these sites in general.

[tangent]
It would be easy enough to hack a web server to handle arguments differently in order to fake something like google out: Define a directory (i.e. /showthread/) with the peculiar behavior of dynamically mapping the rest of the path to url arguments, so that a request to http://myserver/showthread/threadid=1234/foo=bar gives you the same results as http://myserver/showthread.php?threadid=1234&foo=bar. It changes the allowable character set you can use in arguments ever so slightly (you can’t use slashes), but that problem could be mitigated using quoting or some other scheme.
[tangent]

http://www.google.com/webmasters/2.html

Found nothing about on what criteria the limiting is done. :frowning:

Anthracite, in item 2: “Google does offer a custom site search service for a fee.”
So the answers are, respectively, “could be, yes - for a fee, naaaah, and yes”. :smiley:

Further inspection with my limited javascript knowledge of the code added by archive.org reveals it simply adds to every link and image a prefix like “Wayback Machine”. With javascript turned on, your browser should send all requests to them. But occasionally I see “connecting to boards.straight…” in the status line, so I may add the nitpick that the replacing is done the other way round: They give you what they have and redirect your browser to the original site to get the remainder (in the hope it’s still active and hasn’t changed the files, but they can’t know).

That, and the fact that only the browser knows how it handles requests to a site mingled with redirections to files or images that may or may not already be in the cache mixed with files or images from earlier redirections that may or may not come from the original site or from the archive or from your normal dope surfing or … OK, I’ll stop here. :smiley: CnoteChris is right. Weird. But they have to explain. :slight_smile:
I threw some queries at archive.org
http://web.archive.org//straightdope.com (1391 results)
http://web.archive.org//boards.straightdope.com/ (1144)
http://web.archive.org//boards.straightdope.com/sdmb/showthread (1228)
http://web.archive.org//boards.straightdope.com/sdmb/editpost.php?action (1003)
http://web.archive.org//boards.straightdope.com/sdmb/newreply (1061)
I suspect them to stop the search after about 1000 results because it doesn’t add up.

Their FAQ says they crawl the entire web once ‘every few months or so’. Guess that not every search engine does its own indexing. But I wonder to how much traffic those robots add up on a board of our size, if they are allowed to freely move around. ‘Limiting’ Google or not.

Our exhausted but patient hamsters told alone archive.org over and over again at least 1000 times each that it isn’t logged in and can’t edit that post or quote from it. Poor creatures. Somebody do something! I really begin to like those little animals. :slight_smile:

The stuff from the temp board is up there, or at least some of it. When I searched my own username, I found two links to one thread that I posted to twice, along with my user profile over there. (Also a bunch of links to gastrointestinal labs, complete with fun pictures of intestines!)

Looks like some good research has been done here. It has been informative, and apparently some things are different than they used to be, or I didn’t get them right in the first place.