class: center, middle, inverse, title-slide # Quiz1 Review ## SDS 192: Introduction to Data Science ###
Shiya Cao
Statistical & Data Sciences
, Smith College
###
Fall 2024
--- * Quiz 1  --- * MP 1  --- * Q1 We can use the following code to create a scatterplot. If we want to change the color of the points to use the Set2 palette from colorbrewer2.org, which of the following functions would you use? .pull-left[ ```r library(tidyverse) library(RColorBrewer) #Load Data library(palmerpenguins) ggplot(data = penguins, aes(x = bill_length_mm, y = flipper_length_mm, size = body_mass_g, color = species)) + geom_point() + facet_wrap(~island) + theme(legend.position = "right") ``` ] .pull-right[ * `scale_color_manual()` * `scale_fill_manual()` * `scale_fill_brewer()` * **`scale_color_brewer()`** ] --- * Q2 We talked about the two ways in class to deal with scatterplots that suffer from overplotting. Which of the following answers does NOT address overplotting? .pull-left[ * Set the size of the points to less than 1 * Set the alpha to less than 1 * Add jitter to the plot via geom_jitter * **Add another color to distinguish different points** ] --- * Q3 Consider the following data graphic created by this chunk of code: .pull-left[ ```r library(tidyverse) ggplot(mtcars, aes(x = disp, y = mpg)) + geom_point(aes(color = factor(am))) + geom_smooth(se = FALSE) ``` * Currently, the points are alternately colored peach and teal based on whether they represent a car that has an automatic or manual transmission. There is one blue line that illustrates the trend for all cars. ] .pull-right[ * How can the code be changed to show two lines, one peach-colored line illustrating the trend for automatic transmission cars, and a separate teal line illustrating the trend for manual transmission cars, while keeping the points colored as they are? Two approaches could make this work. Select **two** answers. * Hint: I highly encourage you to test your answer by copying and pasting the code into RStudio and trying out each answer option. ] --- .pull-left[ * **Copy the mapping of `am` to the color aesthetic to the `geom_smooth()` function.** * Remove the call to `aes()` inside `geom_point()`. * Move the mapping of `am` to the color aesthetic to the `geom_smooth()` function. * Remove the call to `aes()` inside `ggplot()`. * **Move the mapping of `am` to the color aesthetic to the `ggplot()` function.** ] --- * Q4 The following chunk creates a bar graph of median monthly rent among states, as captured by the 2017 American Community Survey. * Only those states with median monthly rents above $1200 are shown. .pull-left[ ```r library(tidyverse) us_income <- us_rent_income |> filter(variable == "rent", estimate > 1200) ggplot(us_income, aes(x = NAME, y = estimate)) + geom_col() + scale_x_discrete(NULL) + scale_y_continuous("Average Monthly Rent") ``` ] .pull-right[ * How could you make the values on the vertical axis appear as dollar amounts (e.g., `$1,000` instead of `1000`.)? ] --- .pull-left[ * Set the `labels` argument to `scale_y_continuous()` to `scales::comma`. * Set the `breaks` argument to `scale_y_continuous()` to `scales::comma`. * **Set the `labels` argument to `scale_y_continuous()` to `scales::dollar`.** * Set the `breaks` argument to `scale_y_continuous()` to `scales::percent`. * Set the `breaks` argument to `scale_y_continuous()` to `scales::dollar`. * Set the `labels` argument to `scale_y_continuous()` to `scales::percent`. ] --- * Q5 The following chunk creates a bar graph of median yearly income among states, as captured by the 2017 American Community Survey. .pull-left[ ```r library(tidyverse) us_income <- us_rent_income |> filter(variable == "income") ggplot(us_income, aes(x = NAME, y = estimate)) + geom_col() ``` ] .pull-right[ * Each bar represents one state (or territory). * The state names appear on the $x$-axis, but they are impossible to read because they are overlapping. * Which of the following alterations would likely improve the readability of the plot? ] --- .pull-left[ * Remove `NAME` from the $x$-axis, because it just takes up space and it's pretty obvious what the $x$ variable is. * Change the barplot to a line graph. * **Use `coord_flip()` to switch the $x$ and $y$ axes, because that would create more horizontal space for the state names.** * Set the `fill` aesthetic to `NAME`. ] --- * Q6 Consider the following data graphic from the *Economist* article, "[Temporary economic downturns have long-lasting consequences](https://www.economist.com/graphic-detail/2018/10/19/temporary-economic-downturns-have-long-lasting-consequences)": .pull-left[ * Notice that the labels on the $x$-axis are given in odd numbers of years since graduation (i.e., 1, 3, 5, 7, ...). Suppose that you wanted to change these to even numbers (i.e., 2, 4, 6, 8, ...). What function would you use to change the labels on the $x$-axis? ] .pull-right[ * `geom_text()`, because the labels are text. * **`scale_x_continuous()`, because "Years since graduation" is a continuous variable.** * `labs()`, because the numbers are labels. * `scale_color_discrete()`, because "Years since graduation" is a discrete variable. ] --- * Q7 In the following plot, what does the height of the bar represent? .pull-left[ ```r library(tidyverse) library(RColorBrewer) #Load data library(nycflights13) ggplot(data = flights, aes(x = carrier, fill = origin)) + geom_bar(position = position_dodge(preserve = "single")) + scale_fill_manual(values = c("#e69cb2", "#e09db6", "#cc6882")) + theme_minimal() ``` ] .pull-right[ * **The number of flights for each carrier in each origin airport** * The total number of flights in each origin airport * The total number of origin airports * The total number of flights for each carrier * The number of carriers in each origin airport ] --- * Q8 The asking prices for a sample of 15 books currently being sold are listed below. For convenience, the data have been sorted: 12, 14, 15, 20, 20, 30, 30, 30, 30, 40, 40, 40, 40, 41, 71 * The boxplot drawn for this dataset is not entirely correct. What is the issue of the boxplot? .pull-left[  ] .pull-right[ * 71 is not an outlier. * **The upper whisker is incorrect.** * The middle quartile should not be 30. * 12 is an outlier. ] --- * Q9 Assume we collected the data of how many heart disease patients currently live in the six states in New England. Now we want to create a data graphic to represent the data collected (the number of heart disease patients in each of the six states in New England). Which of the following functions would best generate the graphic? .pull-left[ * `geom_line()` * **`geom_col()`** * `geom_histogram()` * `geom_bar()` * `geom_point()` ]