I’m reading a research study and it’s got some whizz bang stats in it that I don’t understand. The context is that the new playground encourages more physical activity ¶. Levels of physical activity were checked at regular intervals, and coded in three ways. Firstly 1-5 based on how active the child was, secondly as sedentary or non-sedentary (0 or 1) and thirdly as moderate to vigorous physical activity (MVPA) (1) or not (0). Then there is this table I can’t make head or tail of.

Hi Jebbilene, would you mind sharing a link to the original study? Usually the authors are considerate enough to explain their methods and comment on the tabulated results.

Without the original study, I would start by imagining what the underlying data frame might look like. The wording of the research question suggests that the explanatory variable is the condition of the play area (undeveloped or newly developed), while the response variable is the child’s level of physical activity.

The N=6596 in the title could represent the number of distinct children (if it’s a matched pairs study), or the number of distinct observation times (where the same child might have activity level measured in two different rows of the data frame). The linear and logistic regressions that they report in the table are hard to square with a matched pairs study, so let’s suppose the data frame for the first regression looks like this:

Obs

Playground Condition

Physical Activity

1

old

2

2

old

3

3

old

1

…

…

…

6595

new

4

6596

new

5

Because the response variable is numerical, they can do ordinary least squares (linear) regression, coding the playground condition as old=0, new=1. The parameters of interest would be the slope of the line, the y-intercept, and the coefficient of determination (R-Sq). These numbers are basically what you find in the first column of the table.

As for the second column in the table, now the data frame might be formatted like so:

Obs

Playground Condition

Non-Sedentary Physical Activity?

1

old

0

2

old

0

3

old

1

…

…

…

6595

new

1

6596

new

1

Because the response variable is now binary (y=0 or y=1), linear regression is not recommended. Instead they employ a logistic transformation, something like y=e^z/(1+e^z), and regress z against the explanatory variable x (playground condition). Again the parameters reported are essentially slope, intercept, and R-sq.

The third column of the table would be obtained by a similar recipe as the second, except the data frame uses Moderate to Vigorous Physical Activity as the response variable, rather than Non-Sedentary Physical Activity.

Hi biqu, thanks very much for the response. You are correct about the variables and that N=6596 is the number of observations, rather than the number of children. So I think your guess as to how the data frame may have looked is probably pretty accurate.

The parameters of interest would be the slope of the line, the y-intercept, and the coefficient of determination (R-Sq).

Could you tell me why those things would be of interest, and what they might tell us? I’m way out of my depth, but I thought R-Sq was supposed to be high to indicate correlation, and it seems really low. And doesn’t negative mean inverse correlation, so the girls were actually less active in the new playgrounds?

Thanks for the link, Jebbilene. After downloading the PDF I found the paragraph you’re referring to:

The base model controlling only for gender was processed for each of the three PA outcome variables. On average, girls were less physically active than boys and less likely to be classified as nonsedentary ... Children observed after outdoor renovations were 22% more likely to be engaged in nonsedentary activity. So, independent of gender, children were more likely to be engaged in nonsedentary activity in renovated OLEs (Table 2).

I’m guessing that by “base model” they mean a multivariate equation like Y=a+b_{1}x_{1}+b_{2}x_{2}+…+b_{k}x_{k}, where Y is the response variable (physical activity, either coded on the 1–5 scale or transformed logistically) and the x’s are the explanatory variables (including gender and the specific playground features built during the renovation). Because the 6596 observations were not all made on the same playground site, the effect on physical activity needed to be standardized. That might be why Table 2 reported two sets of numbers in each cell: unstandardized (and standardized) effects.

The 1.22 that appears in the second row seems to be what they’re interpreting when they say “22% more likely to be engaged in nonsedentary activity” after renovations. That interpretation suggests that the second row represents the slope (one of the b’s in the linear equation above) of an OLS or logistic regression. If x_{1} is the explanatory variable and it changes from 0 to 1 when a playground gets renovated, then a coefficient b_{1}=1.22 would be associated with an increase in physical activity by 22% (due to the way the logistic transformation is defined).

The Y-intercept of the model is perhaps not meaningful enough to warrant an appearance in Table 2. Now that I’m reading the original study, it appears they’re using row 1 to provide another coefficient of the base model (say b_{2}). Then if x_{2}=1 represents that the observed child is female, the model predicts a lower physical activity than the case x_{2}=0. (About 36% lower, because 1-0.643 = 0.357.)

As for R^{2}, you’re right that the numbers appear low. With multivariate regression you can think of R^{2} as telling you how much explanatory power can be attributed to the variables on which you regress. There appears to be a whole lot of other variation in physical activity, which cannot be accounted for by gender and playground renovation.