class: center, middle, inverse, title-slide # Quiz3 Review ## SDS 192: Introduction to Data Science ###
Shiya Cao
Statistical & Data Sciences
, Smith College
###
Fall 2024
--- * Q1 Let's say I intend to map PM2.5 emissions from NYS Title V facilities per county in 2019. I download the county shapefiles and store them in my working directory. The file's current projected CRS is NAD83 / UTM zone 18N (EPSG:26918). Which of the *two* following steps would I have to take to import the data and ensure it is ready for plotting in leaflet? .pull-left[ * **Convert the CRS to 4326 with `st_transform()`** * Import `ny.shp` with `read_csv`()` * Convert the CRS to 4269 with `st_transform()` * **Import `ny.shp` with `read_sf()`** ``` ] --- * Q2 The coordinates in the `nys_facility_emissions` are encoded in WGS 84 (EPSG:4326). Which of the following lines of code would add a correct geometry column to the data frame? .pull-left[ ```r library(tidyverse) library(sf) nys_facility_emissions <- read_csv("https://data.ny.gov/resource/4ry5-tfin.csv?$limit=4000") |> extract(location, c('lon', 'lat'), '\\((.*) (.*)\\)', convert = TRUE) |> filter(!is.na(lon)) ``` ] .pull-right[ ```r * nys_facility_emissions |> st_as_sf(coords = c("lat", "lon"), crs = 4326) * nys_facility_emissions |> st_transform(crs = 4326) * nys_facility_emissions |> st_as_sf(coords = c("longitude", "latitude"), crs = 4326) * **nys_facility_emissions |> st_as_sf(coords = c("lon", "lat"), crs = 4326)** * nys_facility_emissions |> st_as_sf(coords = c("latitude", "longitude"), crs = 4326) ``` ] --- * Q3 The following chunk isolates the path of tropical storms that hit during 1992. .pull-left[ ```r library(tidyverse) library(sf) storms92 <- storms |> mutate(timestamp = lubridate::parse_date_time(paste(year, month, day, hour), "%Y %m %d %H")) |> filter(year == 1992) |> st_as_sf(coords = c("lat", "long")) |> st_set_crs(4326) ``` ] .pull-right[ * What is wrong with the codes? ] --- .pull-right[ * The data in `storms` is not accurate. * **In the `coords` argument, `lat` and `long` are in the wrong order.** * The OpenStreetMap data is out-of-date. * The [CRS](https://en.wikipedia.org/wiki/Spatial_reference_system) is wrong -- it shouldn't be 4326. ] --- .pull-left[ * Q4 Using the same dataset `storms92` as the previous question, the following chunk attempts to map the path of these tropical storms over time to color, but throws an error. ```r library(leaflet) pal <- colorNumeric(palette = "Set2", domain = storms92$name) leaflet(data = storms92) |> addTiles() |> addCircles(color = ~pal(name)) |> addLegend(pal = pal, values = ~name, opacity = 0.8) ``` * How can you fix the problem? ] .pull-right[ * It should be `color = pal(name)`. * It should be `color = ~name`. * It should be `fill = ~pal(name)`. * It should be `aes(color = name)`. * **It should be `colorFactor` instead of `colorNumeric`.** ] --- .pull-left[ * Q5 Now we filter `storms92` to `andrew` to isolate the path of [Hurricane Andrew](https://en.wikipedia.org/wiki/Hurricane_Andrew), which decimated the Bahamas, Florida, and the Gulf Coast in 1992. ```r andrew <- storms92 |> filter(name == "Andrew") ``` ] .pull-right[ * The following chunk attempts to map the path of Hurricane Andrew over time to color, but throws an error. ```r library(leaflet) pal <- colorNumeric(palette = "Greens", domain = andrew$timestamp) leaflet() |> addTiles() |> addCircles(data = andrew, color = pal(timestamp)) ``` * How can you fix the problem? ] --- .pull-left[ * **It should be `color = ~pal(timestamp)`.** * It should be color = ~timestamp. * It should be fill = ~pal(timestamp). * It should be aes(color = timestamp). * It should be `colorFactor` instead of `colorNumeric`. ] --- .pull-left[ * Q6 Using `leaflet()`, which of the following statements are true? Select all that apply. * **Maps can color a categorical variable.** * **Maps can color a numeric variable.** * **Maps can categorize numeric data into bins using different colors.** * **Maps can apply color to polygons.** * **Maps can apply color to circles.** ] --- * Q7 Setting the size of the points to a number between 0 and 1 and adding jitter to the plot via geom_jitter can be two ways we could address overplotting. * Is the above statement true or false? .pull-left[ * True * False ] --- * Q8 Consider the following data graphic created by this chunk of code: .pull-left[ ```r library(tidyverse) ggplot(mtcars, aes(x = disp, y = mpg), color = am) + geom_point() ``` * Why aren't the points colored? ] .pull-right[ * **Because the mapping of `color` to the `am` variable occurs outside of the `aes()` function.** * Because there is no variable in `mtcars` called `am`. * Because you need `fill = am` instead of `color = am`. * Because you need to put the `color = am` specification inside `geom_point()`. * Because you need quotation marks around `am`. ] --- * Q9 When setting aesthetics to variables, which of the following statements is FALSE? .pull-left[ * **The `fill` aesthetic is advised to be set to a *continuous* variable** * The `size` aesthetic can be set to a *continuous* variable * The `color` aesthetic can be set to a *categorical* variable ] --- * Q10 In which student groups in the 2020-2021 school year was the total student count for that group less than 100,000? Copy the code below into RStudio and write some data wrangling code to determine this. Select all that apply. * Hint: You need to remove any missing values and only returns the summary value for all non-missing values. .pull-left[ ```r library(tidyverse) #Load data ct_school_attendance <- read.csv("https://data.ct.gov/resource/t4hx-jd4c.csv?$limit=3000") |> filter(reportingdistrictname != "Connecticut") ``` ] .pull-right[ * All other races * Black or African American * English Learners * Reduced Price Meal Eligible * Students Experiencing Homelessness * Students With Disabilities ] --- * Q11 Which student group in the 2020-2021 school year had the lowest average student count across Connecticut school districts? Copy the code below into RStudio and write some data wrangling code to determine this. * Hint: You need to remove any missing values and only returns the summary value for all non-missing values. .pull-left[ ```r library(tidyverse) #Load data ct_school_attendance <- read.csv("https://data.ct.gov/resource/t4hx-jd4c.csv?$limit=3000") |> filter(reportingdistrictname != "Connecticut") ``` ] .pull-right[ * All other races * Black or African American * English Learners * Reduced Price Meal Eligible * **Students Experiencing Homelessness** * Students With Disabilities ] --- .pull-left[ * Q12 Consider the following code: ```r library(tidyverse) mtcars |> group_by(cyl) |> summarize(avg_mpg = mean(mpg)) |> filter(am == 1) ``` ] .pull-right[ * You need to specify that `na.rm = TRUE`. * It should say `am = 1`. * **There is no variable called `am` in the result of `summarize()`, because `am` does not appear in the calls to either `group_by()` or `summarize()`.** * There is no variable called `am` in `mtcars`. ] --- * Q13 Let's say I wanted to create the following plot, visualizing the distribution of vaccination percentages by November 30, 2022 for each age group across all towns in Connecticut, faceted by the vaccination stage. In order to produce this plot, I need to use the `pivot_*` function to pivot the data. After using the `pivot_*` function, how many rows are in the resulting table? Copy the code below into RStudio and write some data wrangling code to determine this. .pull-left[ <img src="img./TIDYING_DATA_q7_1.png" width="300" /> ] --- .pull-left[ ```r library(tidyverse) #Load data ct_covid_vax_nov_30_22 <- read.csv("https://data.ct.gov/resource/gngw-ukpw.csv?$limit=40000") |> filter(dateupdated == "2022-11-30T00:00:00.000") |> mutate(age_group = factor(age_group, levels = c("5-11", "12-17", "18-24", "25-44", "45-64", "65+"))) ``` ] .pull-right[ * 1190 rows * **3570 rows** * 4760 rows * 8000 rows ] --- * Q14 Let's say I wanted to create the following data frame documenting the number of individuals fully vaccinated by November 30, 2022 in each age group in Connecticut towns. In order to produce this table, I need to use the `pivot_*` function to pivot the data. After using the `pivot_*` function, how many columns are in the resulting table? Copy the code below into RStudio and write some data wrangling code to determine this. .pull-left[ <img src="img./TIDYING_DATA_q8_1.png" width="300" /> ] --- .pull-left[ ```r library(tidyverse) #Load data ct_covid_vax_nov_30_22 <- read.csv("https://data.ct.gov/resource/gngw-ukpw.csv?$limit=40000") |> filter(dateupdated == "2022-11-30T00:00:00.000") |> select(town, age_group, population, fully_vaccinated) ``` ] .pull-right[ * 8 columns * **9 columns** * 10 columns * 11 columns ] --- Q15 Which of the **three** following statements are true of tidy data? .pull-left[ * **Every cell is a single value.** * **Every row is an observation.** * **Every column is a variable.** * Every column is an observation. * Every row is a variable. ]