My theory on forum threads

I suspect that the number of posts in any given thread will follow a power law distribution, rather than a normal (or other) distribution, as posts attract posts, both through increased interest and through folk replying to people who have replied to them.

And for similar reasons the number of posts will be logarithmically related to the number of views.

Sound likely to you? Has anyone done any studies on this?

While it is something I’ve noticed, I think many of the sex- or genital-related threads would throw any number-based theory out of whack. If anything, the views outnumber posts x100 (and this usually gets commented on).

That’s not a theory – it’s a hypothesis speculating on observtion.

I have an opinion that most people who start something with “I have a theory” don’t have a clue what a theory is, and don’t care. However, I have not put together an adequate data set, forumulated it in a falsifiable way, nor submitted it for peer reivew.
So it ain’t a damn theory.

It’s testable with what’s in the forums right now. Go for it.

Is it really necessary to be so pedantic about this?

When I hear people say “theory” in a casual way, I’m fully aware that they often mean “hypothesis”. That’s the way normal conversation works. Word usage is imprecise. I realize that and I never stop to correct them. I’m guessing “theory” is simply an easier word to use… it’s 2 syllables vs the 4 syllables of “hypothesis.” Also, the word “theory” comes to the mind more easily probably because it gets more exposure… “the theory of evolution” … a TV show called “Big Bang Theory” (I have no idea if that TV show has anything to do with any science theories.)

When people refer to tomato as a “vegetable”, do you always stop and correct them to say it’s actually a “fruit”? I suspect you do not. Can you cut people some slack on this one?

This paper might be the kind of thing that you’re looking for

http://www2008.org/papers/pdf/p645-gomezA.pdf

“We perform a statistical analysis of user’s reaction time to a new discussion thread in online debates on the popular news site Slashdot. First, we show with Kolmogorov-Smirnov tests that a mixture of two log-normal distributions combined with the circadian rhythm of the community is able to explain with surprising accuracy the reaction time of comments within a discussion thread. Second, this characterization allows to predict intermediate and long-term user behavior with acceptable precision. The prediction method is based on activity-prototypes, which consist of a mixture of two log-normal distributions, and represent the average activity in a particular region of the circadian cycle.”

Googling “discussion thread length” statistics, came up with several other promisingly looking papers that my connection is too slow for me to access and evaluate at the moment.

In general, almost everything on the internet is a power law distribution.

Many things are both a fruit and a vegetable, so I would not correct someone referring to, say, a cucumber as either. I would, however, correct someone who called a carrot a fruit.

It depends upon what netdrama is currently unfolding.

Philosophy of science doesn’t get interesting until you get past Bacon and on to Popper :rolleyes:

Meanwhile in the real world I will use the same imprecise language as everyone else.

A prune isn’t really a vegetable … a cabbage is a vegetable.

I daresay another interesting factor to look at is participation: is participation affected by topic? Some threads tend to get many new posters, each chiming in with their quirky response (say, the Lord of the Rings thread), while many others devolve into particular issues being hashed out between a small number of posters. That is, three or four posters keeping a thread alive to argue over their particular point. Is thread readability or community-wide interest reflected in this? Can that be gleaned from thread views?

Also, how does TLDR factor in? For a variety of reasons, some long threads probably loose participants despite initial interest.

It also depends on the type of thread it is. Some threads by nature draw a lot of posting. For example, threads in IMHO with questions like, “A poll: Do you put forks in your dishwasher prongs up or down? I put them down.” Then dozens of people will way in, amazingly to me, because I’m pretty sure no one really cares how dozens, or sometimes hundreds, of anonymous posters put their forks in. But so many people are just dying to tell the world how they load their dishwasher.

Or another kind of thread that will have a disproportionate number of posts is the tit-for-tat polemics, usually political. Two or four posters will get into an endless back-and-forth, which devolves into basically: “Are too!” “Am not!” “Are too!” “Am not!” “You said such-and-such.” “No I didn’t. But YOU said such-and-such!” Usually anything interesting regarding the subject gets said by the first one or two pages, but there are people who will go on for over five pages doing this, and it often gets nowhere.

To actually answer this question, I made a quick and dirty regression for the top 200 posts in GQ and got a R^2 of 0.95.

So yes, the SDMB does appear to follow a power law.

How I did it:

Open up Excel
Data->From Web
Entered in this URL: http://boards.straightdope.com/sdmb/forumdisplay.php?s=&f=3&page=1&pp=200&sort=replycount&order=desc&daysprune=-1
Selected the main table
Data->Sort->Replies Reverse Sort Order->Largest to Smallest
Insert->Line Graph
Add Trendline->Power Law->Display R^2

The two threads with the largest number of replies look like outliers to me. How does the model change if you exclude them?

FWIW, with the data I’ve managed to get, the correlation between views and posts is very weak (R^2 = 0.2) and there’s some very notable outliers in the sample including:

What is kopimism? (4000+ views, 1 reply)

and another thread I won’t name (so as to not corrupt the data) which has 102 replies and 123 views which means 83% of people who read it were compelled to reply. The average was 55 views per reply or a 2% response rate.

Removing the first: R^2 = 0.9827
Removing both: R^2 = 0.9869
Removing the top 3: R^2 = 0.9894

From then on, R decreases as you remove more.

Then there are the people who send a post, and then say, “So-and-so ‘beat’ me to the punch!” They mean that another poster said the same thing before they posted, and they are angry for some reason.

Can somebody tell me why this matters? Who cares if somebody posted the same information before you? Why do you have to be the first? Or rather, how does it reduce the significance of what you are trying to say?