How many current active posters are there? An answer here.

Yes. Excellent analogy and why I brought it up. There are a couple of other threads with similar characteristics, e.g., Scylla’s “blimp attack” thread (or whatever it was). These threads are linked to from the outside and will completely skew the results.

You might be able to identify “migratory threads” by identifying the average number of posts per per thread poster and comparing it with the average number of posts of the total population of posters in all threads. “Migratory threads” should show a much lower average posts/poster than threads populated mostly by “resident members” of the ecosystem. Alternatively, you could just look at the ratio between the number of views and the number of posts. (I recall reading some data here on the average number of view per post in each forum.) “Migratory threads” should have a much higher ratio of views to posts than the forum average.

By the way, it’s fascinating to see these real-life techniques adapted and applied here. This is great stuff!

[Hunckback of ND]I AM NOT AN ANIMAL!!![/HofND]

[sup]I’m a pretty, pretty flower.[/sup]

Hmmm. in the The SDMB is switching to paid subscriptions thread I note the following quote:

I wonder if they were using actual SQL queries to determine these numbers, or if they were just guessing.

Nonetheless, I will not be daunted.

I am going to proceed with my population research and capture a second sample. It’ll provide a “scientific” baseline count of the SDMB posters. It might be interesting to do it again in a year or so to see if the Pay-To-Post rule will have any effect.

There’s a difference between “posted in the last 30 days” and “posted in one or more currently active threads.” Hence the discrepancy. Which brings up the question of what exactly constitutes a current active poster.

IMHO that would be the latter definition, so your research does indeed have value.

When? - I need to re-set my Outlook Reminder :wink:

Good point. It’ll be interesting to see what this comes out to be.

(laugh) For what it’s worth, my current plan is to take the second sample early morning of March 23.

From my first sample I’m creating a histogram of post counts. That will also be interesting because it’ll provide an answer to “what percentage of posters have a post count greater than x?”

Great hijack! We will now see how many “real” Dopers there are when the Dopers have to pay, particularly starting 3/23/05, $7.50 per year for an annual subscription to SDMD. Nobody, in their right mind, would pay $14.95 per year. Your 2,000+ is way off base … no more than 400 on 3/23/05!

Hijack? SwingWing, what the heck are you talking about? Unless of course you’re referring to your own hijack of this thread.

I know that you have a beef with the new subscription policy. I really don’t care. This thread was started long before the new policy was announced. The purpose was to find out how many active Dopers there are. That is still its purpose. The 2000+ count was an actual count of people who had posted in threads that were on the front page of each forum on March 11. A second sample will indicate the total active population via the Peterson method.

I grant you that it will be interesting to see what these numbers look like a year from now.

I. Just to recap this effort…
A claim was made that there were only about 400 people who were posting on the SDMB. I thought this claim was ludicrously low, so I set out to see if I could determine what the real number of current regularly active Dopers was. A regularly active Doper population would be those Dopers not metaphorically dead, hibernating or migrated to somewhere else.

After I began his effort, Ed Zotti stated that there were 7000 Dopers who have posted in the last 30 days. I don’t know if this was determined via a database query or is simply an educated guess.

Regardless, I had decided to take a scientific sample of Dopers. The methodology was to capture the Dopers who participated in threads that were on the front page of each forum at the time of the sample-taking. The sampling proceeded as follows:[ul]
[li]capture a subset of the whole population, count them, mark them and release[/li][li]two weeks later, capture a second subset of the population, and count them[/li][li]determine how many of the first sample also appeared in the second sample[/li][li]using the Peterson Mark-Recapture method, calculate how may active Dopers there are in the population[/ul][/li]
As a secondary benefit, I was able to determine how many Dopers had “big antlers” by examining the post-counts of the first sample.

II. The results, which I hereby christen as Algernon’s Reckoning…

(First of all, I’d like to thank don’t ask and Colibri for their help regarding population estimating methodologies.)

The formula for the Peterson Mark-Recapture method of population estimating is:
N = CM/R

Where:
M = The number of individuals (individual user-names) observed in the first sample
C = The total number of individuals (individual user-names) observed in the second sample.
R = The number of individuals in the second sample that are the same as those in the first sample.

N = total population size

The first sample (M) of Dopers was captured on March 11, 2004. This sample contained 2139 unique Dopers. (Note: This differs from the sample count provided earlier in this thread. The earlier count contained an error. (i.e. – I screwed up).)

A second sample © of Dopers was captured on March 23, 2004. As with the first sample, anyone who had posted in any of the threads on the front page of each forum was caught. An exception was made regarding “sticky” threads. Those sticky threads that were part of the first population sample were not used in this second sample. This second sample contained 2240 unique Dopers.

The first and second samples were compared and it was determined that there were 1259 individuals in the first sample that also were captured in the second sample. This is the recaptured group ®.

Therefore, N = (2139 * 2240) / 1259.

Using this method of estimating total population, the number of regularly active Dopers is reckoned to be 3806. This is significantly different than Ed Zotti’s stated population of 7000. Granted, according to Ed, his number represents the number of Dopers who had posted within the last 30 days. My reckoning takes two samples and projects a total population. I would’ve thought the two methods would have generated more similar results. One possible flaw in the sampling methodology I used is that perhaps two weeks is not a long enough interval between samples. One hypothesis could be that many Dopers post in streaks with long periods of absence, hence they might not get captured and counted as part of a regularly active population.

It is important to note that this represents the pre-subscription regularly active Doper population.

III. Environmental changes are going to affect the population…
Moving to a subscription mode at the SDMB effectively changes the environment. One could think about this as a significant and relatively permanent climate change. Akin to an Ice Age perhaps. Only those Dopers who are able and willing to expend the addition energy to survive in this new environment will remain. The others will either die or wander off to a new “hunting grounds”. The harsher environment may also limit the growth of the population by reducing the birth rate and influx of new Dopers.

It is too early to tell to what extent this environmental change will have on the ongoing viable Doper population. I hereby invoke the First Law of Frisbee Throwing – namely, “say nothing more predictive than ‘Watch this’.”

IV. Histogram of Doper post-counts of those caught in the first sample…
Using the following percentages as a proxy for the entire calculated regularly active Doper population, the number of “over 1000 post” Dopers calculates to 1311. (3806 Dopers times 34.46%)

(Incidentally, no hamsters were harmed in the gathering of these post-count statistics. I deliberately did this research either very late at night or very early in the morning. If I observed any indication of a slowness in response time, I stopped the queries and waited until the next opportune time.)




0000-0999    1402       65.54%
1000-1999     306       14.31%
2000-2999     182        8.51%
3000-3999      89        4.16%
4000-4999      51        2.38%
5000-5999      26        1.22%
6000-6999      29        1.36%
7000-7999      17        0.79%
8000-8999       9        0.42%
9000-9999      12        0.56%
10,000-14999   14        0.65%
> 15,000        2        0.09%

Total >= 1000  737    34.46%


A breakdown of the under 1000 group…



000-099         447      20.90%
100-199         212       9.91%
200-299         173       8.09%
300-399         129       6.03%
400-499         105       4.91%
500-599          95       4.44%
600-699          71       3.32%
700-799          71       3.32%
800-899          51       2.38%
900-999          48       2.24%


Thank you for doing the math for us. :smiley:

Hey; thanks for doing the work, Algernon. As far as your number of 3800 being different from Ed’s 7000, that doesn’t seem to be a big deal to me (speaking as a non-population-counting non-biologist). First of all, 3800 is less than a factor of two away from 7000, so it’s pretty close in that sense. Second, I had the impression that Ed’s 7000 was something of a SWAG.

Finally, I note that your histogram is heavily skewed (as one would expect) to the low end: fully 20% of the sample has <100 posts. Extrapolating, you would think that the number of members captured in the sample with <10 posts (or even, say, <3) would still be non-negligible. (Would it be hard to calculate that, by the way? Since you’ve already done all the other data manipulation…) Capturing these members with low post counts in your sample ought to be much harder than capturing high post-count members, whereas they’re all included in Ed’s number.

Here’s a thought (and I’m not sure if this is good science or not, so bear with me): what if you calculate the population of “Dopers with <100 posts” using the same methodology? You’ve already got the data. What does it give you?

And yes, my curiousity is boundless as long as someone else is doing the work.

Oh; perhaps I should explain my thought process a little bit.

Suppose you have two subspecies: Doperous doperous ubiquitous and Doperous doperous hensteethius. Suppose the former is easy to spot, and the latter is difficult to spot.

Suppose, furthermore, that you took a population count on Day 1, and located 800 D. d. ubiquitous and 200 D. d. hensteethius. You took a second count on Day 2, and again located 800 D. d. ubiquitous and 200 D. d. hensteethius. In this second sample, all 800 D. d. ubiquitous were exactly the same individuals as you saw on Day 1 (they’re easy to spot, after all), whereas only 20 of the D. d. hensteethius were the same individuals.

Now, if you lumped both subspecies together, M = 1000, C = 1000, and R = 820, so total estimated population is 1219.

If you count the subspecies seperately, for D. d. ubiquitous, M = 800, C = 800, and R = 800, so the total population is 800. For D. d. hensteethius, M = 200, C = 200, and R = 20, so the total population = 2000. Final species count is 2800, much larger than 1219!

I suggest that your sample isn’t really quite random (a point which Colibri makes above), but rather catches high post count posters at a greater rate than lower post count posters. If this is true, I furthermore suggest that your final number almost has to be a lower bound on the total population. And…perhaps a better estimation can be made by independently calculating the populations of, say, posters in each quartile of post count.

Note that the population that Algernon is studying is not “dopers”, but “active dopers”. And I don’t think it would be meaningful to make a count of “low post count active dopers”. Below about 100, there’s probably not any way to know which are the ones who will stick around, and which are transients who just haven’t left yet.

My pleasure!

Heh, heh. I have nothing to add to this, but your creative wit deserves another appearance in print.

For what it’s worth, here’s the histogram for <100 post Dopers. Remember that these counts come from the first sample only. Percentages are of the entire sample.



00-09   103   4.82%
10-19    55   2.57%
20-29    48   2.24%
30-39    35   1.64%
40-49    32   1.50%
50-59    40   1.87%
60-69    33   1.54%
70-79    32   1.50%
80-89    32   1.50%
90-99    37   1.73%


Your example is intriguing. Alas, I do not have the necessary data to calculate the D. d. ubiquitous and D. d. hensteethius populations separately using the Peterson method. That would require having the post counts for the second population sample. Of all the work needed to do this population estimation, obtaining post counts is by far the most time-consuming.

Makes sense. As Chronos points out, my calculation is a good indicator of regular posters, but not of any other broader definition of Dopers. I took care to repeat the phrase “regularly active” in my summary of the results.

A fair point. In my musing above, I was thinking along the lines of “what are possible reasons that Algernon’s calculated number is lower than Ed’s estimation of 7000?” rather than “what is the true count of active dopers?” In any case, it seems plausible to me that Ed’s number is an overestimate of active dopers, since it does include transients, while Algernon’s may an underestimate, since more prolific posters skew the results somewhat (depending on the true retention rate of the less active dopers).

Wouldn’t the calculation be affected by the small number of posters who make their 100th post in the time between samples?

I assume you mean that with (presumably) faster boards, we won’t have to look for other things to do when the boards are down anymore.

Right?

:smiley:

Whatever the cutoff point used to define the two populations, there will be some individuals who would move from one to the other during the time span between samples. The number though would be very small. For the most part, the Dopers with less than 100 posts are not particularly prolific. As I was gathering post-count information, these individuals on average were posting perhaps once a week. In addition, if I were to do this (which I’m not inclined to) I’d use the post count from the first sample for any individual recaptured in the second sample.

Huh? You mean there are other things to do? :stuck_out_tongue:

Paying for a membership has somewhat brought me out of hibernation. Eventually I made my way to ATMB and what do I find?

A fantastic thread. I enjoyed it very much. I even learned the name for a population-capture technique I have used a number of times in the past. And it’s interesting to see that the 10% estimate, as of pre-subscription, was still holding up pretty well.

(Note to some: the definition of “active member” is crucial. IIRC, I originally proposed two measures. One involved the number of posts in the past month. The other involved posting rates, on the grounds that “dead” members would see that gradually decline.)

But of course we now live in interesting times. The entire membership model is currently being shaken to its core. So let’s consider what may be appropriate for the future.

Ignoring the question of active members for now, let’s consider the pure growth of the board. Basically, we have two factors to consider in the board’s future growth. One is the birth rate and the other is the death rate. Combine these and we get:

dP/dt = b(t,P) - d(t,P)

where b and d are functions that depend on the time t and the population at time t.

Here’s a common model: assume that birthrates and deathrates, and therefore the per capita growth rate, remain constant over time. This gives you the exponential growth model:

dP/dt = k.P

This is Thomas Malthus’ much-maligned approach. It sucks for models of competing resource but may just about be appropriate in the short to medium term for this board.

Solving for P is pretty straightforward: P = exp(kt). (We can bring in a boundary condition for time t=0 to solve for the integrating constant, but let’s ignore that as it doesn’t affect what we’re doing for now). We can identify when we passed a given size by looking at member x. So we passed 10k on 7/10/2000, we passed 20k on 13/3/2002 and we passed 30k on 17/1/2003.

If our model is correct then P(t)/P(s) = exp(k[t-s]). So, using months as our base time unit, 20,000/10,000 = exp(17.k), which makes k= 0.0408.

If this is correct, we should be able to test for 20k to 30k, which is a 50% growth. Exp(0.0408 x 10) = 1.50, which is 50% growth! So apparently the growth rate was spot on for this period too!

We passed 40,000 on 3/10/2003, a 33% increase since 17/1/2003. exp(0.0408 x 8.5) = 1.41, which is also nearly spot on. (More accurately, we should probably use the entire time period and state that k = ln(4)/36 = 0.0385. In this case, and using more accurate times, we get 94%, 48% amd 39% expected growths compared to the 100%, 50% and 33% expected)

I suggest that we take time zero to be the 10k-point for now. The full equation should really be

P = P[sub]0[/sub].exp(0.0385.t)

So we take P[sub]0[/sub] to be 10,000 and t is measured from 7/10/2000. The nice thing about this is that we can simply rescale by 1E-4 and ignore the P[sub]0[/sub] completely.

At the moment we are at time 40.25. The model predicts 47,000 members. We have 45,000. It is 4.4% too high. I’d say that’s pretty good. Of course, it also means that we should probably rescale before making any predictions to reflect most recent knowledge.

Now how about active members? Well why do members stop posting? One possibility is the competition for natural resources, in this case bandwidth and, of course, attention. In this case we might suppose a logistic model. But I suspect that a more likely case is boredom. In this case, I propose that the dormant membership also increases according the the exponential model, with constant d. The active membership will then be given by:

{exp(0.0385.t) - exp(d.t)}

and it remains to find “d”.

We could go ass-backward and assume what we want to find. i.e. assume that 10% are active at any given time. In this case:

{exp(0.0385.t) - exp(d.t)}/exp(0.0385.t) = 0.1 for all t.

so 1 - exp((d - 0.0385)t) = 0.1
(d - 0.0385)t = ln(0.9) ( = -0.105)

… but the solution for d will be dependent on t, which is not appropriate for a exponential model.

So if the exponential models are appropriate for membership and dormancy, there can’t be a static proportion of active members. One of these two assumptions are broken. I’d be more inclined to believe the exponential growth – as demonstrated, it is pretty good for membership growth and there is no reason to assume otherwise for dormancy. It’s just one of those little quirks of nature that it tends to work pretty well. So back to the drawing board.

Instead, let’s suppose that at the time of kabbes’ hypothesis (5/11/2001, or t = 13), there really were 10% of posters active. As it happens, that was pretty much exactly the date Algernon joined. He was poster number 18,799. Call it a nice round 18,800. So 1,880 were active.

And as of mid-March 2004 (t = 40.25, say), there are 3,806 active of 45,000 members precisely.

So we have: at beginning November 2001, 16,920 dormant members and at mid March 2003, 41,194 dormant members. As before, d = ln(4.1194/1.6920)/27.25 = 0.0395.

How many were active at time t = 0? Who knows. The current trend, however, is that it would have been in excess of 10%, say 15% at a guess. 85% dormant is 8,500. So dormant at time t is:

8,500 x exp(0.0392.t)

This means that the number of active members at time t is:

10,000 x (45/47 x exp(0.0385.t) - 0.85 x exp(0.0395.t)}

(Rescaling by 45/47 to reflect today’s membership)

Note that at time t = 40.25, this gives 3,824 active, which is pretty much what we wanted, indicating that the 15% estimate is pretty good.

An interesting feature of this is that it will peak. Differentiating and setting to zero, this would have been at time t = 93.35, or approximately 4.5 years hence. After that it would decline. But 4.5 years is a lifetime in internet terms. It’s more likely to be an inadequacy of the model at longer periods than a true reflection of reality.

Anyway – I can end with something to test. If subscription had not happened, I would have anticipated pretty much 71,500 members by exactly one year hence. Of these, I would expect that 4,624 would have been active (6.5% of the membership). So let’s wait and see what subscription does to it and we can discuss the effects explicitly.

pan

PS: Ed’s estimate of 7,000 active instead, giving 38,000 dormant, intuitively doesn’t feel right – I would expect the proportion dormant to be growing, not shrinking, with age as the board becomes more and more impersonal. But we can certainly use Ed’s figures: d = ln(3.8/1.6920)/27.25 = 0.0297. However, in order to get this equation:

10,000 x {45/47 x exp(0.0385.t) - k x exp(0.0297.t)}

correct for time t = 40.25, we have to assume that k > 1, meaning that there were more active members than members at time zero. This is clearly ridiculous. Therefore either the exponential growth model is inaccurate, despite all the above evidence to the contrary, or Ed’s figure is way optimistic. I’m inclined to believe the latter. FWIW, Algernon’s figure seems to me to fit about right.

Welcome back kabbes!! It’s nice to see you again. We miss your everyday presence.

Thank you for your additional mathematical analysis. It’s fascinating to see this question attacked again from a completely different perspective. I followed your math just fine. But it is certain that I would not have been able to come up with the analysis on my own.

I appreciate the effort you put into this. Especially since it ends with:

It made my day.