# Example questions for Exam 2

## Short answer

- What is the difference between \(R^2\) and adjusted \(R^2\)?

- How do we know what the best fit of a regression line is?

- What does it mean to account for other variables in a regression model?

- What does it mean when we say that the coefficients for categorical variables are not slopes?

## Correlation

### Coefficients

Interpret the following correlation coefficients:

- 0.64

- 0.11

- -0.92

- -0.01

- 0.39

### Scatterplots

Guess a correlation coefficient for each of these plots:

```
## Warning: `as_tibble.matrix()` requires a matrix with column names or a `.name_repair` argument. Using compatibility `.name_repair`.
## This warning is displayed once per session.
```

- Plot 1

- Plot 2

- Plot 3

- Plot 4

## Multiple regression

**There will be two questions structured similarly to this one, but with different emphases (i.e. just interpreting coefficients, just interpreting \(R^2\), etc.).**

You come across a data set with information about every Nintento Wii Mario Kart game listing on eBay in October 2009. There are 141 rows, each representing one eBay posting. There are columns indicating many different variables:

`price`

: The final price of the game`duration`

: The number of days the game was listed on eBay`used`

: A categorical variable indicating if the game was new or used (base case = “new”)`num_bids`

: The number of bids made during the auction`seller_rating`

: The number of ratings the seller has on eBay`has_photo`

: A categorical variable indicating if the seller included a stock photograph of the game (base case = “no”)`num_wheels`

: The number of Wii wheel controllers included with the game

### Variable identification

We are interested in predicting the final price of a Mario Kart game sold on eBay, based on a host of factors. We run the following multiple regression model:

\[ \begin{align} \text{Model 1: } \widehat{\text{price}} &= \beta_0 + \beta_1 \text{duration} + \beta_2 \text{num_bids} + \beta_3 \text{used} + \\ &\beta_4 \text{seller_rating} + \beta_5 \text{has_photo} + \beta_6 \text{duration} \end{align} \]

- What is/are the outcome (or dependent) variable(s)?

- What is/are the explanatory (or independent) variable(s)?

### Interpreting output

Here is the output from this regression model:

```
model1 <- lm(price ~ duration + num_bids + used +
seller_rating + has_photo + num_wheels,
data = mario_kart)
model1 %>% get_regression_table()
```

```
## # A tibble: 7 x 7
## term estimate std_error statistic p_value lower_ci upper_ci
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 intercept 40.8 2.04 20.0 0 36.8 44.9
## 2 duration 0.035 0.184 0.192 0.848 -0.328 0.399
## 3 num_bids -0.066 0.071 -0.932 0.353 -0.207 0.074
## 4 usedused -4.62 1.02 -4.54 0 -6.63 -2.61
## 5 seller_rating 0 0 3.64 0 0 0
## 6 has_photoyes 0.968 1.01 0.957 0.34 -1.03 2.97
## 7 num_wheels 7.72 0.553 13.9 0 6.62 8.81
```

Interpret the following coefficients (remember the template!):

`duration`

`num_bids`

`used`

`num_wheels`

### Interpreting fit

The following code shows a summary of the model diagnostics:

`model1 %>% get_regression_summaries()`

```
## # A tibble: 1 x 8
## r_squared adj_r_squared mse rmse sigma statistic p_value df
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.749 0.737 20.7 4.55 4.67 66.5 0 7
```

- How much variation in final price does this model explain?

### Comparing models

After running this first model, you run a couple simpler models that predict price based on whether the game is used, and one based on whether the game is used, how long it’s posted, and how many wheel controllers are included:

\[ \begin{align} \text{Model 2: }\widehat{\text{price}} &= \beta_0 + \beta_1 \text{used} \end{align} \]

\[ \begin{align} \text{Model 3: }\widehat{\text{price}} &= \beta_0 + \beta_1 \text{used} + \beta_2 \text{duration} + \beta_3 \text{num_wheels} \end{align} \]

This table provides the \(R^2\) and adjusted \(R^2\) values for the three models.

model | formula | r.squared | adj.r.squared |
---|---|---|---|

Model 1 | price ~ duration + num_bids + used + seller_rating + has_photo + num_wheels | 0.7487 | 0.7375 |

Model 2 | price ~ used | 0.3506 | 0.3459 |

Model 3 | price ~ used + duration + num_wheels | 0.7169 | 0.7107 |

- Which of these models explains the most variation in price? How much variation does it explain?