Someone please install search engine software on the SDMB!

Chasing_Dreams · April 1, 2007, 6:12am

I noticed that searches that an enormous amount of time to complete. Like 30 seconds to 1 minute. My guess is that each time I send a search request to the SDMB, the server actually looks through all the old messages. This seems like a waste of resources.

Why don’t we have some open source search engine software installed that will let us search almost instantaneously? The search engine software would create an index of all the keywords that appear in all of the SDMB and lower the amount of CPU time needed for a search, albeit at the sacrifice of some hard drive space. However, hard drive space is cheap. Time is money.

Autolycus · April 1, 2007, 7:14am

I have no clue regarding how the SDMB conducts searches, but if Chasing Dreams is accurate in his assumptions, then I’m totally for his recommendation.

Canadjun · April 1, 2007, 12:51pm

I just did a search for oxycontin through all the forums (I wanted something that I knew had been mentioned but doesn’t get mentioned very often so I wouldn’t get massive numbers of hits) and it reported that it took 0.11 seconds; even counting from when I hit enter to when the screen finished refreshing it wasn’t more than a couple of seconds. Clearly there is a search engine in there; I suspect the gerbils/hamsters/capybaras running it just stumble once in a while.

DSYoungEsq · April 1, 2007, 1:27pm

To find 750+ instances of “Amendment” took 20 seconds. But to find the same number of instances of “seizure” took only 3 seconds. I suspect that the time it takes is a fuction of what else it may be doing at the moment, not a result of a lack of appropriate search functions.

Squink · April 1, 2007, 2:10pm

Why even consider slow software when there’s a dedicated hardware solution?

Musicat · April 1, 2007, 2:13pm

Messages are retreived using SQL, and indexing is built-in, with pointers updated when a new message is saved.

I will admit the search function doesn’t seem to be very efficient, and I suspect lengthy searches may be what holds up the SDMB when it halts for up to minutes at a time, but replacing that would have to integrate well with vBulletin software. Let’s hope the vB dudes are working on it, as the problem can only get worse with larger databases.

Arnold_Winkelried · April 1, 2007, 2:19pm

Can you give an example of what kind of open source search engine you are talking about, Chasing Dreams? I imagine that those search engines are built to index a series of static text web pages. But the SDMB pages are built dynamically each time by retrieving data from a relational database.

samclem · April 1, 2007, 3:53pm

From that site:

:eek:

Squink · April 1, 2007, 3:58pm

Yeah, but isn’t it worth it when the fight is against Ignorance itself?

Qadgop_the_Mercotan · April 1, 2007, 4:11pm

Against ignorance, the gods themselves contend in vain.

TubaDiva · April 1, 2007, 4:35pm

What they said.

It’s inherent in the vB structure.

I too hope the vB engineers have something else in mind for us, something better.

Of course I don’t speak for the Reader but I doubt that they would look pleasantly on a 30K+ solution.

Liberal · April 1, 2007, 5:24pm

I worked on the staff of a board that experimented with key-word indexing. It was a disaster. (Some details available on request.)

Chasing_Dreams · April 1, 2007, 6:19pm

I don’t know of specific open source software that will work with vB, but I have tried Seekafile and like it very much. It is built for Windows and you can modify it to search your desktop hard drive or servers. You can add iFilters to it so that it can search different file types.

I bet something like Seekafile can be modified to work with vB, but I don’t know enough about vB to say for sure.

Why would vB need to dynamically generate a thread each time using a relational database? It seems very inefficient. A thread basically contains a long list of static text, save for the user name and details for each post.

Chronos · April 1, 2007, 7:03pm

Well, those usernames and details are one reason. There are others. For instance, we’re allowed to edit our own posts, here, but not those of others. If a thread were a static webpage, the board software would have to keep track of which parts of that page different users are authorized to edit. Or suppose that a troll is banned, and the moderators want to delete all of its posts. Right now, they can do that at the touch of a button. Again, the software would have to keep track of which part of each thread is due to which person. Or, the mods can merge two threads, which results in the posts being interspersed according to their timestamps. Now, the software has to keep track not only of who posted each part of a thread, but when. You could include a whole bunch of metadata in your static pages to keep track of this, but pretty soon, your static webpages plus metadata end up being, themselves, a database. And if you’re going to be using a database, you might as well use one designed to be a database, not one cobbled together from the end webpages.

Lakai · April 1, 2007, 8:27pm

This is a very cool quote. Can I ask where you got it from?

samclem · April 1, 2007, 8:46pm

I think QtheM originated that one. He’s crafty that way.

Liberal · April 1, 2007, 9:02pm

“Against stupidity the gods themselves contend in vain.” — Friedrich Schiller, Jungfrau von Orleans

Lakai · April 1, 2007, 9:22pm

You know, I thought the same thing until I thought to do a google search. It came up with different results so I decided to ask.

Thanks **Liberal. **

Musicat · April 1, 2007, 9:45pm

Are you familiar with database storage? B-Tree indexing? Binary searching? I do not have first-hand knowledge of the specifics to vB internals, so someone correct me if I am wrong, but I am familiar with database and indexing concepts in general, and have worked on projects similar to message board software. I think all text messages are stored in a single file, somewhat chronologically, by appending new posts to the end of the file. Each record (message) is tagged with necessary “key” data such as poster number, date/time, forum, thread number, and post serial number. The entire database is indexed on the fly and pointers to the records are stored in a much smaller database.

Now when a thread is requested, the engine reaches first into the pointer file, then into the main database and pulls out only the relevant records according to the pointer list. (Sequential thread posts are located at seemingly random spots in the main file.) B-Tree indexing is very efficient at this as a pointer system and can extract a few hundred records from multi-gigabyte files very quickly. Basically, it skips over large blocks of data that are not in the index and therefore, not part of the thread.

An interesting thing about B-Tree indexing, due to binary search technique is the maximum number of searches needed in the pointer file grows much slower than the total number of records stored does. Double the number of records and you only add one more search to the pointer retrieval routine.

Where it falls down is searching for data that is NOT indexed, which may require searching ALL records. Then if the record count doubles, the search time does, too.

related references:

samclem · April 1, 2007, 9:56pm

:smack:

Topic		Replies	Views
Elasped Time Between Searches Increased To 2 Minutes About This Message Board	140	8394	May 16, 2007
Search has been disabled About This Message Board	167	10401	March 25, 2008
SEARCH ALL disabled About This Message Board	127	7377	July 9, 2001
Say, Una Persson, a note on how vBulletin works (search engine nonsense from ATMB) The BBQ Pit	35	2708	April 6, 2007
Get rid of the FUCKING 120 second search limit! The BBQ Pit	61	3509	December 10, 2007

Someone please install search engine software on the SDMB!

Related topics