Example questions for Exam 2

Short answer

What is the difference between \(R^2\) and adjusted \(R^2\)?

How do we know what the best fit of a regression line is?

What does it mean to account for other variables in a regression model?

What does it mean when we say that the coefficients for categorical variables are not slopes?

Correlation

Coefficients

Interpret the following correlation coefficients:

0.64

0.11

-0.92

-0.01

0.39

Scatterplots

Guess a correlation coefficient for each of these plots:

## Warning: `as_tibble.matrix()` requires a matrix with column names or a `.name_repair` argument. Using compatibility `.name_repair`.
## This warning is displayed once per session.

Plot 1

Plot 2

Plot 3

Plot 4

Multiple regression

There will be two questions structured similarly to this one, but with different emphases (i.e. just interpreting coefficients, just interpreting \(R^2\), etc.).

You come across a data set with information about every Nintento Wii Mario Kart game listing on eBay in October 2009. There are 141 rows, each representing one eBay posting. There are columns indicating many different variables:

price: The final price of the game
duration: The number of days the game was listed on eBay
used: A categorical variable indicating if the game was new or used (base case = “new”)
num_bids: The number of bids made during the auction
seller_rating: The number of ratings the seller has on eBay
has_photo: A categorical variable indicating if the seller included a stock photograph of the game (base case = “no”)
num_wheels: The number of Wii wheel controllers included with the game

Variable identification

We are interested in predicting the final price of a Mario Kart game sold on eBay, based on a host of factors. We run the following multiple regression model:

\[ \begin{align} \text{Model 1: } \widehat{\text{price}} &= \beta_0 + \beta_1 \text{duration} + \beta_2 \text{num_bids} + \beta_3 \text{used} + \\ &\beta_4 \text{seller_rating} + \beta_5 \text{has_photo} + \beta_6 \text{duration} \end{align} \]

What is/are the outcome (or dependent) variable(s)?

What is/are the explanatory (or independent) variable(s)?

Interpreting output

Here is the output from this regression model:

model1 <- lm(price ~ duration + num_bids + used + 
               seller_rating + has_photo + num_wheels,
             data = mario_kart)
model1 %>% get_regression_table()

## # A tibble: 7 x 7
##   term          estimate std_error statistic p_value lower_ci upper_ci
##   <chr>            <dbl>     <dbl>     <dbl>   <dbl>    <dbl>    <dbl>
## 1 intercept       40.8       2.04     20.0     0       36.8     44.9  
## 2 duration         0.035     0.184     0.192   0.848   -0.328    0.399
## 3 num_bids        -0.066     0.071    -0.932   0.353   -0.207    0.074
## 4 usedused        -4.62      1.02     -4.54    0       -6.63    -2.61 
## 5 seller_rating    0         0         3.64    0        0        0    
## 6 has_photoyes     0.968     1.01      0.957   0.34    -1.03     2.97 
## 7 num_wheels       7.72      0.553    13.9     0        6.62     8.81

Interpret the following coefficients (remember the template!):

duration

num_bids

used

num_wheels

Interpreting fit

The following code shows a summary of the model diagnostics:

model1 %>% get_regression_summaries()

## # A tibble: 1 x 8
##   r_squared adj_r_squared   mse  rmse sigma statistic p_value    df
##       <dbl>         <dbl> <dbl> <dbl> <dbl>     <dbl>   <dbl> <dbl>
## 1     0.749         0.737  20.7  4.55  4.67      66.5       0     7

How much variation in final price does this model explain?

Comparing models

After running this first model, you run a couple simpler models that predict price based on whether the game is used, and one based on whether the game is used, how long it’s posted, and how many wheel controllers are included:

\[ \begin{align} \text{Model 2: }\widehat{\text{price}} &= \beta_0 + \beta_1 \text{used} \end{align} \]

\[ \begin{align} \text{Model 3: }\widehat{\text{price}} &= \beta_0 + \beta_1 \text{used} + \beta_2 \text{duration} + \beta_3 \text{num_wheels} \end{align} \]

This table provides the \(R^2\) and adjusted \(R^2\) values for the three models.

model	formula	r.squared	adj.r.squared
Model 1	price ~ duration + num_bids + used + seller_rating + has_photo + num_wheels	0.7487	0.7375
Model 2	price ~ used	0.3506	0.3459
Model 3	price ~ used + duration + num_wheels	0.7169	0.7107

Which of these models explains the most variation in price? How much variation does it explain?