Wrapping up and going beyond the basics

Materials for class on Thursday, December 13, 2018

Slides
Logistic regression
Algorithmic complexity and mystery
Sharing graphics, code, and analysis

Slides

Download the slides from today’s lecture:

Logistic regression

If you have an outcome variable that is binary (yes/no, agree/disagree, Obama/Romney, etc.), you cannot use standard linear regression. Instead, you need to use something called logistic regression. The math and mechanics of this type of regression are beyond the scope of this class, but you know enough about regression to be able to use it and interpret it.

To run logistic regression in R, we use glm() instead of lm(), but the structure is the same. The only difference is that we have to specify that this is logistic regression with the family option:

glm(y ~ x1 + x2 + x3, 
    family = binomial(link = "logit"),
    data = whatever)

The coefficients that come from this model behave a little differently than linear regression. Here’s an example using GSS data from your past problem sets.

First, we load libraries and clean the data:

library(tidyverse)
library(broom)

gss_raw <- read_csv("https://statsf18.classes.andrewheiss.com/data/gss2016.csv", 
                na = c("", "Don't know",
                       "No answer", "Not applicable"),
                guess_max = 2867) %>%
  select(marital, childs, educ, sex, race, born, income, pres12, polviews, pray)

gss <- gss_raw %>% 
  # Look for the letters "onservative". I omit the C because sometimes it's
  # uppercase (Conservative) and sometimes it's lowercase (Slightly conservative)
  mutate(polviews_ordered = factor(polviews, 
                                   levels = c("Extrmly conservative", "Conservative", 
                                              "Slghtly conservative", "Moderate",
                                              "Slightly liberal", "Liberal",
                                              "Extremely liberal"),
                                   ordered = TRUE)) %>% 
  mutate(conservative = ifelse(str_detect(polviews, "onservative"),
                               "Conservative", "Not conservative")) %>% 
  mutate(marital2 = case_when(
    marital == "Married" ~ "Married",
    TRUE ~ "Not married"
  )) %>% 
  # Get rid of the respondents who didn't vote for Obama or Romney
  mutate(pres12 = ifelse(!pres12 %in% c("Obama", "Romney"), NA, pres12)) %>% 
  mutate(pres12 = factor(pres12)) %>% 
  # case_when is like a fancy version of an if statement and it lets us collapse
  # the different levels of pray into two
  mutate(pray2 = case_when(
    pray == "Several times a day" ~ "At least once a week",
    pray == "Once a day" ~ "At least once a week",
    pray == "Several times a week" ~ "At least once a week",
    pray == "Once a week" ~ "At least once a week",
    pray == "Lt once a week" ~ "Less than once a week",
    pray == "Never" ~ "Less than once a week",
    pray == "Don't know" ~ NA_character_,
    pray == "No answer" ~ NA_character_,
    pray == "Not applicable" ~ NA_character_
  )) %>% 
  mutate(pray2 = factor(pray2, 
                        levels = c("Less than once a week", "At least once a week"))) %>% 
  # Make this numeric
  mutate(childs = as.numeric(childs))

We’ll use this data to find the factors that made people more likely to vote for Mitt Romney in 2016. Our outcome variable (or Y) is pres12, which has two values: Obama and Romney. Because it’s not numeric, and because there are two possible categories, we’ll use logistic regression. We’ll explain 2012 presidential votes based on the number of children people have, marital status, conservativeness, and frequency of prayer

model_logit <- glm(pres12 ~ childs + marital + conservative + pray2, 
                   family = binomial(link = "logit"), data = gss)

The get_regression_table() from the ModernDive package only works on linear models made with lm(), but we can get the same thing if we use a function named tidy() from library(broom). Technically, get_regression_table() is really just tidy() behind the scenes—the ModernDive authors just renamed it to make it easier for you.

The coefficients that come from this model are uninterpretable initially. The numbers are in a scale called “log odds”, which is a side effect of the math used to run the logistic regression:

model_logit %>% tidy()

## # A tibble: 8 x 5
##   term                         estimate std.error statistic  p.value
##   <chr>                           <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)                    0.828     0.234      3.54  3.93e- 4
## 2 childs                        -0.102     0.0462    -2.22  2.66e- 2
## 3 maritalMarried                 0.411     0.175      2.35  1.86e- 2
## 4 maritalNever married          -0.693     0.225     -3.08  2.08e- 3
## 5 maritalSeparated              -0.477     0.425     -1.12  2.62e- 1
## 6 maritalWidowed                 0.0705    0.246      0.287 7.74e- 1
## 7 conservativeNot conservative  -2.59      0.131    -19.8   1.35e-87
## 8 pray2At least once a week      0.219     0.162      1.35  1.77e- 1

To transform these numbers to something interpretable, we have to unlog them. This will create something called an “odds ratio”, which shows the change in probability of the Y outcome happening.

The coefficients were logged using a natural log (\(ln\) or \(\log_e\)), so to unlog them, we need to take \(e\) and raise it to the power of each coefficient. If a log odds coefficient is 0.411, like the married coefficient, we can unlog it by calculating \(e^{0.411}\), which is 1.508. You can use the exp() function in R to do this: exp(0.411) is 1.508.

Running exp() on every coefficient, though, is tedious. Fortunately, the tidy() function has an option that will do that for you. We can also get confidence intervals by setting the conf.int option:

model_logit %>% tidy(exponentiate = TRUE, conf.int = TRUE)

## # A tibble: 8 x 7
##   term             estimate std.error statistic  p.value conf.low conf.high
##   <chr>               <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
## 1 (Intercept)        2.29      0.234      3.54  3.93e- 4   1.45      3.62  
## 2 childs             0.903     0.0462    -2.22  2.66e- 2   0.824     0.988 
## 3 maritalMarried     1.51      0.175      2.35  1.86e- 2   1.07      2.13  
## 4 maritalNever ma…   0.500     0.225     -3.08  2.08e- 3   0.321     0.776 
## 5 maritalSeparated   0.620     0.425     -1.12  2.62e- 1   0.263     1.40  
## 6 maritalWidowed     1.07      0.246      0.287 7.74e- 1   0.662     1.74  
## 7 conservativeNot…   0.0750    0.131    -19.8   1.35e-87   0.0579    0.0966
## 8 pray2At least o…   1.24      0.162      1.35  1.77e- 1   0.908     1.71

These are the actual coefficients we can interpret. The final trick is that these numbers are all based around 1. Values higher than 1 mean that the explanatory variables are associated with a greater probability of Romney votes; values less than 1 mean that the explanatory variables are associated with a lower probability of Romney votes.

Here’s how to interpret a bunch of these coefficients:

The coefficient for married people is 1.51. This means that compared to divorced people (the base case for marital) people who are married are 51% more likely to have voted for Romney in 2012. That coefficient is statistically significant, and the confidence interval does not include 1.
The coefficient for children is 0.9. This means that for every additional child someone has, the probability of voting for Romney in 2012 decreases 10% (since 1 − 0.9 is 0.1). This is also statistically significant; the confidence interval does not include 1.
The coefficient for conservative is 0.075. This means that people who aren’t conservative are 93% less likely to have voted for Romney in 2012 (since 1 − 0.075 = 0.925). This is statistically significant.
The coefficient for prayer is 1.24, which means that praying at least once a week is associated with at 24% greater chance of voting for Romney in 2012. This is not statistically significant, though—the p-value is 0.177, and the confidence interval contains 1.

The key is that everything is centered around 1. If the odds ratio is something like 0.4, it means that the Y outcome is 60% less likely (since 1 − 0.4 = 0.6). If the odds ratio is something like 1.8, it means that the Y outcome is 80% more likely (since 1.8 is 0.8 greater than 1). If the odds ratio is something big like 7.3, it means that the Y outcome is 7.3 times more likely.

Algorithmic complexity and mystery

Logistic regression is the foundation for tons of what happens on the internet. Companies want to predict if you’ll click on an ad, donated to a political campaign, buy a product, etc., and each of these outcomes is binary (click/don’t click, donate/don’t donate, buy/don’t buy). Simple algorithms use logistic regression to find the factors that influence the decision to click, donate, or buy.

More complex algorithms go beyond logistic regression and use complicated math to make these predictions. While these algorithms generally make arguably better predictions,Though not always! This paper finds that basic logistic regression can often perform just as well as complicated black box models.

they sacrifice accuracy for interpretability. This, in turn, can lead to all sorts of ethical issues and cause bias to get baked into these models.

We didn’t have time to watch this in class, but this is a fantastic short video explaining this process:

Publishing at RPubs
GitHub and Gists

Entire projects:

Wrapping up and going beyond the basics

Contents

Slides

Logistic regression

Algorithmic complexity and mystery

Sharing graphics, code, and analysis