Inference and regression

Materials for class on Thursday, November 29, 2018



No slides for today, since there’s no formal class.

Inference and regression

Review of hypothesis testing and inference

ModernDive chapter 11 isn’t finished yet, but the principles it covers are relatively straightforward. When testing hypotheses with simulation, you follow this pattern (which you should probably have memorized by now):

We care about statistical significance because of sampling error. If you measure something about a sample (like the proportion of blue M&Ms in a bag), you’re going to be off a little (or a lot if you have a small sample). If you see some number (i.e. if you find a small bag of M&Ms with 0 blue M&Ms in it), you want to know if that’s an anomaly, or if it’s something that could just happen by chance. If you had a fun-sized bag of 20 candies with 0 blue M&Ms in it, you could chalk that up to chance. If you had a giant Costco-sized bag of 2000 candies with 0 blue M&Ms in it, you’d be concerned. Such a finding would be statistically significant.

The same principle applies to calculating differences between groups. This is especially important in randomized controlled trials. Imagine a treatment group of 500 people is given some sort of new cold medicine and a control group of 500 people is given a placebo. All 1000 people somehow got the cold at the same time, and all 1000 people took their medicine (either the actual medicine or the placebo). Consider these two scenarios:

  1. Those in the control group were sick for an average of 7 days, while those in the treatment group were sick for an average of 6.875 days. The difference between the two groups is 0.125 days (or 3 hours). The 95% confidence interval for this difference ranges from -0.5 days to 1.25 days.
  2. Those in the control group were sick for an average of 7 days, while those in the treatment group were sick for an average of 4.5 days. The difference between the two groups is 2.5 days. The 95% confidence interval for this difference ranges from 1 day to 3.5 days.

There is a measurable difference between the average sickness duration of the two groups in both of these situations, but only the second group has a substantial and significant difference.

In the first situation, the confidence interval includes zero, which means the true unmeasurable difference between the two groups could potentially be zero. It could be negative. It could also be positive. We’re not sure. The probability of seeing a difference of 0.125 days in a world where there’s actually no difference is probably really high—that’s a typical number.

In the second situation, we’re very certain that the difference between the two groups is most definitely not zero. The probability that we’d see a difference of 2.5 days in a world where there’s actually no difference is really really low—this is like finding no blue M&Ms in a big bag.

In the first situation, we cannot say that the new cold medicine has a statistically significant effect on the duration of sickness—we don’t have enough evidence. In the second situation, we can.

Regression coefficients and inference

Just like averages, medians, proportions, differences in averages, and so on, regression coefficients are sample statistics. They are estimates of some unmeasurable population parameter that we can only calculate by taking samples from the population. Because of this, they also have standard errors and confidence intervals associated with them. They’re only estimates—they could be higher, or they could be lower.

With linear regression, what we generally care about is whether or not a coefficient is zero. If a coefficient could possibly be zero, it would mean that for every increase in X, Y might not actually change. If there’s very little chance that a coefficient could be zero, we can be pretty sure that for every increase in X, Y really truly responds and changes.

Here’s the process for testing hypotheses about coefficients:

Notice how steps 2 and 3 are missing here. That’s because the infer package does not have a way to simulate a world where a coefficient is zero—there’s no coefficient option in specify() or anything like that. You can essentially do these steps in your head, though.

Complete example

Here’s a full example using property tax data from Problem Set 3.

First we load the libraries and data we’ll need. We’ll divide the median home value variable by 100 so it’s a little easier to interpret (so we can say “for every $100 increase in median home value…” instead of "for every $1 increase in median home value)

taxes <- read_csv("data/property_taxes_2016.csv") %>% 
  mutate(median_home_value = median_home_value / 100)

Next we build a regression model that predicts/explains per-household taxes based on median home value in each county, the proportion of households with kids in each county, and the state each county is in:

tax_model <- lm(tax_per_housing_unit ~ 
                  median_home_value + prop_houses_with_kids + state,
                data = taxes)
tax_model %>% get_regression_table()
term estimate std_error statistic p_value lower_ci upper_ci
intercept -412.5 118.1 -3.493 0.001 -645.8 -179.2
median_home_value 0.405 0.018 21.99 0 0.369 0.442
prop_houses_with_kids 14.09 2.853 4.941 0 8.459 19.73
stateCalifornia 123.3 88.22 1.397 0.164 -50.98 297.5
stateIdaho 9.526 82.74 0.115 0.908 -153.9 173
stateNevada 102.5 98.25 1.043 0.299 -91.63 296.5
stateUtah -213.2 91.21 -2.337 0.021 -393.3 -33.03
tax_model %>% get_regression_summaries()
r_squared adj_r_squared mse rmse sigma statistic p_value df
0.845 0.839 73100 270.4 276.4 141.3 0 7

Up to this point in the semester, you’ve only looked at the estimate column. Note that there are columns named p_value (the probability that you’d see this coefficient in a world where it’s actually zero), lower_ci (the lower bound of the 95% confidence interval), and upper_ci (the upper bound of the 95% confidence interval).

Here’s how we interpret all these results, now with hypothesis testing built in:

That’s a lot of really verbose writing. In real life, you’d probably only really be concerned about one or two of the coefficients and only include the others to control for other effects (like, in general, you don’t care about the exact nature of state effects, but they’re useful for picking up some of the variation in your outcome variable).

Here’s how you could write up this entire model with all coefficients in a complete, concise paragraph:

We used ordinary least squares (OLS) regression to explain the variation in per-household property tax rates in five Western states, based on the median home value and the proportion of households with kids in each county. We also controlled for state effects. With these three explanatory variables, our model explains nearly 84% of the variation in tax rates in these states. Home values and proportions of households with kids both have a statistically significant association with tax rates. Controlling for all other variables in our model, on average, each $100 increase in median home values is associated with a $0.40 increase in property taxes (p < 0.001), while every 1% increase in the proportion of households with kids is associated with a $14 increase in taxes (p < 0.001). With the exception of Utah, where property taxes are $213 lower than those in Arizona, on average (0 = 0.021), none of the individual state coefficients are statistically significant.

Your turn!

Now you get to interpret the coefficients for two regression models. You’ve already learned how to do the bulk of the interpretation: for every one unit increase in X there’s an associated β change in Y. Now you just need to incorporate hypothesis testing and determine if those effects are statistically significant.

Here are your basic templates:

Open a new R Markdown file or R script and run the following code on your computer. Note that read_csv() points to a URL on the internet. I did this so you don’t need to make a new RStudio project, download data, and put the data in that folder. This simplifies things.In real life, though, you’d want to put these files locally on your computer—imagine if you’re loading important data from some website that changes its URLs a year from now (or shuts down and stops existing!). It would be bad if your code tried to load data from dead URLs.

Interpret each of the coefficients in these two models. Go through each and determine if each is statistically significant or not. Try to write up a concise paragraph like the one above for each of the models.

The answers are at this page here. Don’t look at them until you’ve tried as a team.

Good luck!

World happiness

We’ve worked with this world happiness data before. As a reminder, here’s what these different variables measure:


happiness <- read_csv("")

# The base case for region is "East Asia & Pacific"
# The base case for income is "High income"
model_happiness <- lm(happiness_score ~ life_expectancy + 
                        access_to_electricity + region + income, 
                      data = happiness)

model_happiness %>% get_regression_table()
model_happiness %>% get_regression_summaries()

Brexit results

We looked at this data during class 8. Here’s what these different variables measure:

results_brexit <- read_csv("")

model_brexit <- lm(leave_share ~ con_2015 + lab_2015 + ukip_2015 +
                     degree + age_18to24 + born_in_uk + unemployed + male, 
                   data = results_brexit)

model_brexit %>% get_regression_table()
model_brexit %>% get_regression_summaries()