I was trying to find an email in my inbox (I’m very slack about labelling/sorting emails, so they mostly just stay in the inbox).
I enter the keyword and hit Search. On the first page of results, it says: “1-20 of about 83”.
I click to the next page of results. “21-40 of about 127”. What, you found another 44 results you missed first time round?
Next page: “41-60 of about 123”. Where did those other 4 results go, Google? And what’s this “about” business? You’re a computer, I thought you dealt in absolutes!
Click onward: “61-80 of about 118”. Now you’re just screwing with me.
And again: “81-100 of about 113”. :dubious:
So the next page of results will be the last one, right?
It’s more efficient to find the necessary number of results for pagination and estimate how many are left, then to find all the results and count them. This will always be true when you’re searching for a relatively tiny slice of a large amount of data. You may have millions of emails in your account, but only a couple hundred that contain the word you’re looking for.
Let’s assume Gmail has something like a B-Tree index of every word in every email you’ve ever received, presumably sorted by date. (I’m sure their actual indexing strategies are significantly more finely tuned.) If you search for “foobar,” then it can traverse the index until it finds 20 emails with the word “foobar” and return that data immediately. It can then use a heuristic based on what fraction of the total index it had to search to find those emails to estimate how many others are probably left. When you paginate, it continues searching the index and it can refine its estimate of how many may be left.
OK… but its estimates are pretty wildly off in that case. I kept clicking through the results, for that same keyword that gave, supposedly, somewhere between 80 and 130 results. I’m up to number 600 now and I’ve only gone as far back as July 2012!
I give pretty much the same answer to everyone who has an issue with Gmail: it’s free.
When I search Gmail for a keyword, I get results back almost instantly because basically as soon as it finds a page full of results, it gives them to me. It’s not through searching yet, so of course it doesn’t know what the final tally will be. It’s only going to do as much work as I require of it by continuing to page down through the results.
When I search my Outlook for a keyword, I wait for 5-10 minutes because it goes looking for every single instance of that keyword before showing me anything at all.
Well, that’s the thing about heuristics. They work well enough, until they don’t.
Let’s imagine that you had a lengthy conversation about Foobar consisting of hundreds of emails three years ago. Gmail might find the 20 most recent results, which would be far more sparse, and make an estimate about the frequency of “foobar” in the rest of the index. But then you go a few pages in and all of a sudden there are hundreds of foobars. The estimate was off because it assumed a uniform distribution when the actual distribution is a lot more lumpy.
Okay, I get the idea. And I have to admit that at least they’re honest by saying, “21-40 of about 127”.
BUT – by the time I’ve finally gotten to the third of fourth page, hasn’t it had enough time to look at all my emails? I mean, this is the same Google that searches the entire World Wide Web in 0.29 seconds, right?
Sure, but why should they? They are serving a huge number of users simultaneously, there’s no point in doing work to collect data that the user doesn’t need right away. Remember, Gmail doesn’t know how many pages you’re going to look through. 99.9% of the time nobody looks beyond the first page. It is vastly more efficient to fulfill the request and stop, rather than continuing to search in the background for information that probably will not be requested.
No, they search an index of the web. And they do the same pagination shenanigans on public search as they do in Gmail. Try searching for some weird phrase like chartreuse hippopotamus. Google thinks there are three pages of results, but by the time you get to the second page it turns out there are only two.
And 25 minutes after you post, this thread is the eighth result for “chartreuse hippopotamus” on the google.co.uk index. I’m enormously impressed by that.
And now it’s number 2. Give it a few more minutes and I reckon it’ll bump “Peotry in Motion” (sic) off the top spot, especially if I mention “chartreuse hippopotamus” a few more chartreuse hippopotamus times.
Note that, in the case of search results, they omit certain listings because of duplicate content, and group others together. So when they’re displaying page 1, they might know there are 30 more items for pages 2+, but they won’t check which of those are grouped or omitted until you actually visit page 2. So even if they knew the exact number of total items found, they’re still guessing the number of search results that will be displayed.
Something similar might be happening in Gmail too, with thread grouping.
Funny, because it’s always proved satisfactory to my patrons. Invariably, when I remind them they aren’t paying anything for this service they’re griping about*, they sort of smile and nod, and say, “Yeah, I guess you’re right.”
I wonder if it will help that there is some other content. I used to watch people drinking Green Chartreuse, it was quite easy to set it on fire. Sadly it was mostly hippopotamus free.
It requires an explanation of map/reduce which can get pretty complex. I’ll try to summarize in points:
Your email database isn’t stored on a single server, but is stored across dozens or hundreds of servers. I’ll call them “storage”.
When you make a search request, the server that handles the request relays it to those “storage” servers, then waits for them to come back with results.
As the results come back, “searcher” aggregates them in preparation to send back to the user.
For performance reasons, results that don’t come back within a certain time cutoff are ignored. At that cutoff, the results are sent back to the client.
“Storage” server results that come in later are basically ignored, but next time you run the same search they are practically guaranteed to come back because they’re “fresh”. (In general, the more recently a piece of data has been touched, the easier it is to find.)
MapReduce is used for a lot of stuff but it’s not relevant to a relatively simple index search. In fact, it would be a phenomenally inefficient way to do real-time search, because there is a required blocking step for aggregation. I think you may be confusing MR with some of the distributed filesystem strategies that Google uses for high-availability data. GFS is the data store used for search indexes; I presume that Google uses it (perhaps in concert with BigTable) for Gmail.
Are they retarded? Because if I were to ask you a factual question about a service and you smugly told me that the answer was, “It’s free,” I’d want to know what was wrong with you and whether you had misheard my question. The OP’s question (which is not a “gripe”) has an actual answer, which is being discussed by reasonably intelligent people in this thread. That answer is not, “Haw haw, it’s free.”
Even if it were a gripe, I still haven’t figured out where this attitude that you can’t complain about free stuff comes from. It’s GIFTS you don’t complain about, not free stuff. And Gmail/Google search are definitely NOT gifts.
If it were free stuff, then you’d not be able to complain when someone insults you. I mean, they didn’t charge you for that insult, now did they?