Yes, I gathered the data by hand. It may be a trivial task for a spider, but since I’ve never used or written one before (never needed to), the thought didn’t cross my mind. I will say, though, that collecting basic thread information can be made somewhat bearable by manipulating the display parameters to show more than 50 threads per page, and saving the HTML from each page for (semi)automated parsing later.
Out of the 69,117 threads active last year, 1,154 had earlier start dates. Out of these, about 110 were started in the year 2000. That’s a pretty small percentage, but it’s evidently enough to pull the correlations significantly downward.
Lifespan, as I’ve used it here, is simply the number of days between a thread’s OP and its last post. Discarding outliers doesn’t really change the correlations that much, though. Roughly 90% of all threads “live” for one week or less, and the correlation between lifespan and either views or replies in this range is only around 0.55 to 0.6 or so.
As you wish. All this work is based on the same Excel file I’ve linked to above, ThreadLife.xls (3.5 MB .ZIP file).
Incidentally, one reason I had to throw away outliers for my original scatter plot was because the most-viewed thread (the LotR one, of course) has about 10x the hits as the second-place thread in the sample, and plotting it would just push every other data point into a tiny corner. But then I realized that this is exactly what a log-log chart is good for, so here it is, with all outliers intact (it’s still sampled down from the 69k+ original threads, though, because Excel can only handle up to 32,000 data pairs on a x-y chart). The general shape tapers to a point, and I haven’t thought about whether this has any meaning or not.
Hm, have we moved away from Chairman Pow’s original question?