Someone please install search engine software on the SDMB!

I noticed that searches that an enormous amount of time to complete. Like 30 seconds to 1 minute. My guess is that each time I send a search request to the SDMB, the server actually looks through all the old messages. This seems like a waste of resources.

Why don’t we have some open source search engine software installed that will let us search almost instantaneously? The search engine software would create an index of all the keywords that appear in all of the SDMB and lower the amount of CPU time needed for a search, albeit at the sacrifice of some hard drive space. However, hard drive space is cheap. Time is money.

I have no clue regarding how the SDMB conducts searches, but if Chasing Dreams is accurate in his assumptions, then I’m totally for his recommendation.

I just did a search for oxycontin through all the forums (I wanted something that I knew had been mentioned but doesn’t get mentioned very often so I wouldn’t get massive numbers of hits) and it reported that it took 0.11 seconds; even counting from when I hit enter to when the screen finished refreshing it wasn’t more than a couple of seconds. Clearly there is a search engine in there; I suspect the gerbils/hamsters/capybaras running it just stumble once in a while.

To find 750+ instances of “Amendment” took 20 seconds. But to find the same number of instances of “seizure” took only 3 seconds. I suspect that the time it takes is a fuction of what else it may be doing at the moment, not a result of a lack of appropriate search functions.

Why even consider slow software when there’s a dedicated hardware solution?

Messages are retreived using SQL, and indexing is built-in, with pointers updated when a new message is saved.

I will admit the search function doesn’t seem to be very efficient, and I suspect lengthy searches may be what holds up the SDMB when it halts for up to minutes at a time, but replacing that would have to integrate well with vBulletin software. Let’s hope the vB dudes are working on it, as the problem can only get worse with larger databases.

Can you give an example of what kind of open source search engine you are talking about, Chasing Dreams? I imagine that those search engines are built to index a series of static text web pages. But the SDMB pages are built dynamically each time by retrieving data from a relational database.

From that site:

:eek:

Yeah, but isn’t it worth it when the fight is against Ignorance itself? :wink:

Against ignorance, the gods themselves contend in vain.

What they said.

It’s inherent in the vB structure.

I too hope the vB engineers have something else in mind for us, something better.

Of course I don’t speak for the Reader but I doubt that they would look pleasantly on a 30K+ solution.

I worked on the staff of a board that experimented with key-word indexing. It was a disaster. (Some details available on request.)

I don’t know of specific open source software that will work with vB, but I have tried Seekafile and like it very much. It is built for Windows and you can modify it to search your desktop hard drive or servers. You can add iFilters to it so that it can search different file types.

I bet something like Seekafile can be modified to work with vB, but I don’t know enough about vB to say for sure.

Why would vB need to dynamically generate a thread each time using a relational database? It seems very inefficient. A thread basically contains a long list of static text, save for the user name and details for each post.

Well, those usernames and details are one reason. There are others. For instance, we’re allowed to edit our own posts, here, but not those of others. If a thread were a static webpage, the board software would have to keep track of which parts of that page different users are authorized to edit. Or suppose that a troll is banned, and the moderators want to delete all of its posts. Right now, they can do that at the touch of a button. Again, the software would have to keep track of which part of each thread is due to which person. Or, the mods can merge two threads, which results in the posts being interspersed according to their timestamps. Now, the software has to keep track not only of who posted each part of a thread, but when. You could include a whole bunch of metadata in your static pages to keep track of this, but pretty soon, your static webpages plus metadata end up being, themselves, a database. And if you’re going to be using a database, you might as well use one designed to be a database, not one cobbled together from the end webpages.

This is a very cool quote. Can I ask where you got it from?

I think QtheM originated that one. He’s crafty that way.

“Against stupidity the gods themselves contend in vain.” — Friedrich Schiller, Jungfrau von Orleans

You know, I thought the same thing until I thought to do a google search. It came up with different results so I decided to ask.

Thanks **Liberal. **

Are you familiar with database storage? B-Tree indexing? Binary searching? I do not have first-hand knowledge of the specifics to vB internals, so someone correct me if I am wrong, but I am familiar with database and indexing concepts in general, and have worked on projects similar to message board software. I think all text messages are stored in a single file, somewhat chronologically, by appending new posts to the end of the file. Each record (message) is tagged with necessary “key” data such as poster number, date/time, forum, thread number, and post serial number. The entire database is indexed on the fly and pointers to the records are stored in a much smaller database.

Now when a thread is requested, the engine reaches first into the pointer file, then into the main database and pulls out only the relevant records according to the pointer list. (Sequential thread posts are located at seemingly random spots in the main file.) B-Tree indexing is very efficient at this as a pointer system and can extract a few hundred records from multi-gigabyte files very quickly. Basically, it skips over large blocks of data that are not in the index and therefore, not part of the thread.

An interesting thing about B-Tree indexing, due to binary search technique is the maximum number of searches needed in the pointer file grows much slower than the total number of records stored does. Double the number of records and you only add one more search to the pointer retrieval routine.

Where it falls down is searching for data that is NOT indexed, which may require searching ALL records. Then if the record count doubles, the search time does, too.

related references:

:smack: