class: center, middle, inverse, title-slide # Lec 16: Tidy Data ## SDS 192: Introduction to Data Science ###
Shiya Cao
Statistical & Data Sciences
, Smith College
###
Fall 2024
--- # Today's Learning Goals * Be able to pivot longer, pivot wider, and recode values using R. --- # How would you produce this plot? <!-- --> --- # Observations vs. Variables vs. Values .pull-left[ * Observations refer to individual units or cases of the data being collected. * Variables describe something about an observation. * Values refer to the actual value associated with a variable for a given observation. ] .pull-right[  Grolemund, Garrett, and Hadley Wickham. n.d. R for Data Science. Accessed March 31, 2019. https://r4ds.had.co.nz/. ] --- # What Is Tidy Data? * Each variable must have its own column. * Each observation must have its own row. * Each value must have its own cell. --- # Is This Tidy? > What variables are displayed on this plot? ```r Date<-c("03/19/2022","03/18/2022") Nrthmpton_AQI<-c("70","69") NYC_AQI<-c("72","60") Bstn_AQI<-c("43","59") df<-data.frame(Date, Nrthmpton_AQI, NYC_AQI, Bstn_AQI) df ``` ``` ## Date Nrthmptn_AQI NYC_AQI Bstn_AQI ## 1 03/19/2022 70 72 43 ## 2 03/18/2022 69 60 59 ``` --- ```r library(tidyverse) ggplot(data = df, aes(x = Date)) + geom_col(aes(y = Nrthmpton_AQI), fill = "green") + geom_col(aes(y = NYC_AQI), fill = "blue") + geom_col(aes(y = Bstn_AQI), fill = "red") ```
--- # What Will It Look Like When Tidy? ```r df2 <- df |> pivot_longer(cols = -Date, names_to = "City", values_to = "AQI") |> mutate(City = str_replace(City, "_AQI", "")) df2 ``` ``` ## # A tibble: 6 × 3 ## Date City AQI ## <chr> <chr> <dbl> ## 1 03/19/2022 Nrthmptn 70 ## 2 03/19/2022 NYC 72 ## 3 03/19/2022 Bstn 43 ## 4 03/18/2022 Nrthmptn 69 ## 5 03/18/2022 NYC 60 ## 6 03/18/2022 Bstn 59 ``` --- ```r ggplot(data = df2, aes(x = Date, y = AQI, fill = City)) + geom_col(position = "dodge") ``` <!-- --> --- # Pivoting Longer * We use `pivot_longer()` to pivot a dataset from wider to longer format: * `pivot_longer()` takes the following arguments: 1. `cols = `: Identify a series of columns to pivot. * The names of those columns will become repeated rows in the pivoted data frame, and the values in those columns will be stored in a new column. 2. `names_to = `: Identify a name for the column where the column names will be stored. 3. `values_to = `: Identify a name for the column where the values associated with those names will be stored. 4. Various arguments to support transformations to names. --- # Recoding Values * `case_when()` allows us to assign new values to a cell when a certain condition is met. * Vectorized 'if/else' statement. * Argument format: CONDITION ~ REPLACEMENT VALUE WHEN CONDITION MET .pull-left[ ```r df |> pivot_longer(cols = -Date, names_to = "City", values_to = "AQI") |> mutate(City = str_replace(City, "_AQI", "")) |> mutate(City = case_when( City == "Nrthmpton" ~ "Northampton", City == "NYC" ~ "New York City", City == "Bstn" ~ "Boston" )) |> select(Date, City) ``` ] .pull-right[ ``` ## # A tibble: 6 × 2 ## Date City ##<chr><chr> ## 1 03/19/2022 Northampton ## 2 03/19/2022 New York City ## 3 03/19/2022 Boston ## 4 03/18/2022 Northampton ## 5 03/18/2022 New York City ## 6 03/18/2022 Boston ``` ] --- # Pivoting Wider * We use `pivot_wider()` to pivot a dataset from longer to wider format: * `pivot_wider()` takes the following arguments: 1. `names_from = `: Identify the column to get the new column names from. 2. `values_from = `: Identify the column to get the cell values from. 3. Various arguments to support transformations to names. --- # Example .pull-left[ ``` ## # A tibble: 6 × 3 ## Date City AQI ## <chr> <chr> <dbl> ## 1 03/19/2022 Nrthmptn 70 ## 2 03/19/2022 NYC 72 ## 3 03/19/2022 Bstn 43 ## 4 03/18/2022 Nrthmptn 69 ## 5 03/18/2022 NYC 60 ## 6 03/18/2022 Bstn 59 ``` ] .pull-right[ ```r df2 |> pivot_wider(names_from = City, values_from = AQI) ``` ] ``` ## # A tibble: 2 × 4 ## Date Nrthmptn NYC Bstn ## 1 03/19/2022 70 72 43 ## 2 03/18/2022 69 60 59 ``` --- # Renaming Columns > Use `rename(NEW_COLUMN_NAME = EXISTING_COLUMN_NAME)` ```r df2 |> pivot_wider(names_from = Date, values_from = AQI, names_repair = make.names) |> rename(Mar19 = X03.19.2022, Mar18 = X03.18.2022) ``` ``` ## # A tibble: 3 × 3 ## City Mar19 Mar18 ##<chr><dbl><dbl> ## 1 Nrthmptn 70 69 ## 2 NYC 72 60 ## 3 Bstn 43 59 ```