class: center, middle, inverse, title-slide # Quiz2 Review ## SDS 192: Introduction to Data Science ###
Shiya Cao
Statistical & Data Sciences
, Smith College
###
Fall 2024
--- * Quiz 2  --- * Q1 Which of the following **two** lines of code would **not** return for me a data frame with all and only columns in the `texas_school_meals` dataset involving milk distribution and milk reimbursement? .pull-left[ ```r library(tidyverse) library(lubridate) #Load data texas_school_meals <- read.csv("https://data.texas.gov/resource/c9s4-hrvh.csv?$limit=70000") |> mutate(claimdate = ymd_hms(claimdate)) |> mutate(claimdate = factor(claimdate)) ``` ] .pull-right[ ```r * **texas_school_meals |> select(!starts_with("breakfast") & !starts_with("lunch"))** * texas_school_meals |> select(38:43, 47) * texas_school_meals |> select(milkdays:milkservedpaid, milkreimbursement) * **texas_school_meals |> filter(str_detect(milkreimbursement, "milk"))** * texas_school_meals |> select(starts_with("milk")) ``` ] --- * Q2 Which of the following lines of code would return a data frame of the distinct names of school nutrition program sites that served more than 4000 total lunches in either April, May, or June 2020? .pull-left[ ```r library(tidyverse) library(lubridate) #Load data texas_school_meals <- read.csv("https://data.texas.gov/resource/c9s4-hrvh.csv?$limit=70000") |> mutate(claimdate = ymd_hms(claimdate)) |> mutate(claimdate = factor(claimdate)) ``` ] .pull-right[ ```r * **texas_school_meals |> filter(lunchtotal > 4000 & claimdate %in% c("2020-04-01", "2020-05-01", "2020-06-01")) |> select(sitename) |> distinct()** * texas_school_meals |> filter(lunchtotal > 4000 | claimdate %in% c("2020-04-01", "2020-05-01", "2020-06-01")) |> select(sitename) |> distinct() * texas_school_meals |> select(lunchtotal > 4000 & claimdate %in% c("2020-04-01", "2020-05-01", "2020-06-01")) |> filter(sitename) |> distinct() * texas_school_meals |> filter(lunchtotal > 4000 & claimdate == c("2020-04-01", "2020-05-01", "2020-06-01")) |> select(sitename) |> distinct() * texas_school_meals |> select(lunchtotal > 4000 | claimdate %in% c("2020-04-01", "2020-05-01", "2020-06-01")) |> filter(sitename) |> distinct() ``` ] --- * Q3 In which months in the 2019-2020 school year was the total free lunches served across all Texas school nutrition program sites less than 1,000,000? Copy the code below into RStudio and write some data wrangling code to determine this. * Hint: You need to determine the total free lunches served each month. .pull-left[ ```r library(tidyverse) library(lubridate) #Load data texas_school_meals <- read.csv("https://data.texas.gov/resource/c9s4-hrvh.csv?$limit=70000") |> mutate(claimdate = ymd_hms(claimdate)) |> mutate(claimdate = factor(claimdate)) ``` ] .pull-right[ * March 2020, April 2020, and May 2020 * July 2019, August 2019, and September 2019 * **July 2019, April 2020, May 2020, and June 2020** * March 2020, April 2020, May 2020, and June 2020 ] --- * Q4 Let's say I wanted to create the following plot, visualizing the total PM 2.5 and PM 10 emissions in NYS Title V facilities each year, faceted by emission type. How would I have to pivot the data to produce this plot? .pull-left[ <img src="img./TIDYING_DATA_q4.png" width="300" /> ] .pull-right[ ```r library(tidyverse) #Load data nys_facility_emissions <- read.csv("https://data.ny.gov/resource/4ry5-tfin.csv?$limit=4000") ``` ] --- .pull-left[ ```r * pivot_longer(cols = year, names_to = "emission_type", values_to = "emission_amount") * pivot_wider(names_from = year, values_from = contains("p")) * pivot_wider(names_from = year, values_from = starts_with("pm")) * pivot_longer(cols = contains("p"), names_to = "emission_type", values_to = "emission_amount") * **pivot_longer(cols = starts_with("pm"), names_to = "emission_type", values_to = "emission_amount")** ``` ] --- * Q5 Let's say I wanted to create the following data frame documenting the voc emissions for each NYS Title V facility each year. How would I have to pivot the data to produce this table? .pull-left[ <img src="img./TIDYING_DATA_q5_1.png" width="300" /> ] .pull-right[ ```r library(tidyverse) #Load data nys_facility_emissions <- read.csv("https://data.ny.gov/resource/4ry5-tfin.csv?$limit=4000") %>% select(year, dec_id, facility_name, voc) ``` ] --- .pull-left[ ```r * pivot_longer(cols = year:voc, names_to = "year", values_to = "total_voc") * pivot_wider(names_from = voc, values_from = year) * pivot_longer(cols = contains("v"), names_to = "year", values_to = "total_voc") * pivot_longer(cols = voc, names_to = "year", values_to = "total_voc") * **pivot_wider(names_from = year, values_from = voc)** ``` ] --- * Q6 Which *two* statements make the following code an example of tidy data? .pull-left[ ```r library(tidyverse) #Load data nys_facility_permits <- read.csv("https://data.ny.gov/resource/4n3a-en4b.csv") |> head(4) ``` ] .pull-right[ * Every column is a single value. * **Every row is an observation.** * Every row is a variable describing the observation. * **Every column is a variable describing the observation.** * Every column is an observation. ] --- * Q7 Let's say I wanted to measure the correlation between Covid deaths in a given Connecticut town by June 22, 2022 to the number of fully vaccinated individuals in that town by June 22, 2022. First, I aggregate the vaccination data to calculate totals of fully vaccinated individuals in each town. Then, I find [this dataset](https://data.ct.gov/Health-and-Human-Services/COVID-19-Tests-Cases-and-Deaths-By-Town-/28fr-iqnx) on Connecticut's open data portal. Which type of join would I use to only join the towns present in both datasets? --- .pull-left[ ```r library(tidyverse) #Load data ct_covid_total_vax_june_22_22 <- read.csv("https://data.ct.gov/resource/gngw-ukpw.csv?$limit=40000") |> filter(dateupdated == "2022-06-22T00:00:00.000") |> group_by(town) |> summarize(total_population = sum(population, na.rm = TRUE), total_fully_vaccinated = sum(fully_vaccinated, na.rm = TRUE)) ct_covid_cases_june_22_22 <- read.csv("https://data.ct.gov/resource/28fr-iqnx.csv?$limit=100000") |> filter(lastupdatedate == "2022-06-22T00:00:00.000") ``` ] .pull-right[ ```r * ct_covid_total_vax_june_22_22 |> left_join(ct_covid_cases_june_22_22, by = c("town" = "town")) * ct_covid_total_vax_june_22_22 |> full_join(ct_covid_cases_june_22_22, by = c("town" = "town")) * ct_covid_total_vax_june_22_22 |> anti_join(ct_covid_cases_june_22_22, by = c("town" = "town")) * ct_covid_total_vax_june_22_22 |> right_join(ct_covid_cases_june_22_22, by = c("town" = "town")) * **ct_covid_total_vax_june_22_22 |> inner_join(ct_covid_cases_june_22_22, by = c("town" = "town"))** ``` ] --- * Q8 Let's say I wanted to check whether there were any towns in `ct_covid_total_vax_june_22_22` that weren't represented in `ct_covid_cases_june_22_22`. What kind of join would I use to identify those towns? .pull-left[ ```r * **ct_covid_total_vax_june_22_22 |> anti_join(ct_covid_cases_june_22_22, by = c("town" = "town"))** * ct_covid_total_vax_june_22_22 |> inner_join(ct_covid_cases_june_22_22, by = c("town" = "town")) * ct_covid_total_vax_june_22_22 |> left_join(ct_covid_cases_june_22_22, by = c("town" = "town")) * ct_covid_total_vax_june_22_22 |> full_join(ct_covid_cases_june_22_22, by = c("town" = "town")) * ct_covid_total_vax_june_22_22 |> right_join(ct_covid_cases_june_22_22, by = c("town" = "town")) ``` ] --- * Q9 For which of the following **two** join types does the order in which you call the tables **don't** matter? .pull-left[ * right_join() * **full_join()** * left_join() * **inner_join()** * anti_join() ] --- .pull-left[ * Q10 Consider the following function definition, and subsequent call of that function. ```r library(tidyverse) ``` ```r summarize_species <- function(data, pattern = "Human") { data |> filter(species == pattern) |> summarize( num_people = n(), avg_height = mean(height, na.rm = TRUE) ) } ``` ] .pull-right[ ```r summarize_species("Wookiee", starwars) ``` * How could we modify the function *call* to make this work? Select all that apply. ] --- .pull-left[ * **Switch the order of the arguments** * **Specify `data = starwars`** * **Specify `pattern = "Wookiee"`** * Add a pipe: `"Wookiee" |> summarize_species(starwars)` * None of these suggestions would work ] --- .pull-left[ * Q11 Consider the following function definition, and subsequent call of that function. ```r library(tidyverse) library(babynames) ``` ```r linegraph <- function(data, x_var, y_var, color_var, facet_var) { data |> ggplot(aes(x = x_var, y = y_var, color = color_var)) + geom_line() + facet_wrap(vars(facet_var)) } ``` ```r babynames_riley_casey <- babynames |> filter(name == "Riley" | name == "Casey") ``` ] --- .pull-left[ ```r linegraph(babynames_riley_casey, year, prop, sex, name) + labs(x = "Year", y = "Proportion", color = "Sex", title = "Comparison of Casey and Riley") ``` It throws errors "Plot is missing `facet_var`. Layer 1 is missing `facet_var`". Which of the following causes the problem? ] .pull-right[ * Should use `fill` instead of `color`. * **Setting `x_var`, `y_var`, `color_var`, and `facet_var` in the call is problematic, because R thinks these are variables in the data frame instead of passing real variable names in the function.** * Naming `x_var`, `y_var`, `color_var`, and `facet_var` is problematic, because R requires a distinguishable name such as `babynames_year`. * You need to add a `return()` statement at the end of the function. ] --- .pull-left[ * Q12 The `df` data frame contains four columns. ```r library(tidyverse) ``` ```r df <- data.frame( name = c("obs1", "obs2", "obs3", "obs1"), a = c(2, 3, 4, 5), b = c(4, 7, 2, 1), c = c(4, 9, 3, 2) ) ``` * The following function summarize the selected columns in `df` and returns a `tibble`. ] .pull-right[ ```r sum_x_in_df <- function(x) { sum(x, na.rm = TRUE) } ``` * Suppose that you want to retrieve the following three columns stored in the `my_cols` vector. ```r my_cols <- df |> select(a:c) ``` * Which of the following calls will **return a `tibble`**? ] --- .pull-left[ * `map(my_cols, sum_x_in_df)` * `map_2(my_cols, sum_x_in_df)` * **`map_dfc(my_cols, sum_x_in_df)`** ]