class: center, middle, inverse, title-slide # Lec 13: Wrangle Single Table ## SDS 192: Introduction to Data Science ###
Shiya Cao
Statistical & Data Sciences
, Smith College
###
Fall 2024
--- # Today's Learning Goals * Conduct single table analysis using `select()`, `filter()`, `arrange()`, `summarize()`, `group_by()`, `mutate()`. --- class: center, middle # Data wrangling is a process for transforming a dataset from its original form into a more relevant form to suit your needs. --- # Intro to Data Wrangling The dplyr package presents a grammar for data wrangling (H. Wickham and Francois 2020). This package is loaded when `library(tidyverse)` is run. * `select()` * `filter()` * `arrange()` * `summarize()` * `group_by()` * `mutate()` * `join()` --- # Today's Dataset * Hospital Cost Reports, 2018 ```r library(tidyverse) hospitals <- read.csv("data/Hospital_Cost_Report_2018.csv") ``` --- # `select()` * `select()` enables us to select variables (columns) of interest. ```r hospitals |> select(Provider.CCN, State.Code, Number.of.Beds) |> head() ``` ``` ## Provider.CCN State.Code Number.of.Beds ## 1 10032 AL 34 ## 2 250042 MS 181 ## 3 440235 TN 25 ## 4 50523 CA 145 ## 5 50305 CA 337 ## 6 50043 CA 296 ``` --- # `select()` * `:` enables us to select consecutive variables (columns) of interest. ```r hospitals |> select(Provider.CCN, State.Code:County) |> head() ``` ``` ## Provider.CCN State.Code Zip.Code County ## 1 10032 AL 36278 RANDOLPH ## 2 250042 MS 38614 COAHOMA ## 3 440235 TN 37841 SCOTT ## 4 50523 CA 94509- CONTRA COSTA ## 5 50305 CA 94705- ALAMEDA ## 6 50043 CA 94609- ALAMEDA ``` --- # `select()` * `!` enables us to negate selection of certain variables (columns). * `last_col()` serves as a reference to the last column. ```r hospitals |> select(!(Hospital.Name:last_col())) |> head() ``` ``` ## rpt_rec_num Provider.CCN ## 1 623132 10032 ## 2 628158 250042 ## 3 628833 440235 ## 4 631016 50523 ## 5 631094 50305 ## 6 631292 50043 ``` --- # `select()` * `starts_with()` is a helper function that enables us to select columns by matching string patterns at the start of their names. ```r hospitals |> select(Provider.CCN, starts_with("Fiscal")) |> head() ``` ``` ## Provider.CCN Fiscal.Year.Begin.Date Fiscal.Year.End.Date ## 1 10032 2017-09-30 2017-11-12 ## 2 250042 2017-10-31 2017-12-30 ## 3 440235 2017-10-10 2017-12-30 ## 4 50523 2017-12-31 2018-02-27 ## 5 50305 2017-12-31 2018-02-27 ## 6 50043 2017-12-31 2018-02-27 ``` --- # `select()` * `ends_with()` is a helper function that enables us to select columns by matching string patterns at the end of their names. ```r hospitals |> select(Provider.CCN, ends_with("Improvements")) |> head() ``` ``` ## Provider.CCN Land.Improvements Leasehold.Improvements ## 1 10032 ## 2 250042 ## 3 440235 ## 4 50523 $9,762,451 $3,077 ## 5 50305 $6,040,540 $4,268,998 ## 6 50043 $6,040,540 $6,352,542 ``` --- # `filter()` * `filter()` subsets observations (rows) according to a certain criteria that we provide. * == checks for equivalence and != checks for inequivalence ```r hospitals |> filter(City == "SPRINGFIELD") |> select(Provider.CCN, City, State.Code) ``` ``` ## Provider.CCN City State.Code ## 1 223303 SPRINGFIELD MA ## 2 111306 SPRINGFIELD GA ## 3 144021 SPRINGFIELD IL ## 4 364040 SPRINGFIELD OH ## 5 264024 SPRINGFIELD MO ## 6 263032 SPRINGFIELD MO ``` --- # `filter()` * `&` returns rows that meet **both** criteria. ```r hospitals |> filter(City == "SPRINGFIELD" & State.Code == "MA") |> select(Provider.CCN, City, State.Code) ``` ``` ## Provider.CCN City State.Code ## 1 223303 SPRINGFIELD MA ## 2 220077 SPRINGFIELD MA ## 3 220066 SPRINGFIELD MA ## 4 222046 SPRINGFIELD MA ``` --- # `filter()` * `|` returns rows that meet **either** criteria. ```r hospitals |> filter(City == "SPRINGFIELD" & State.Code == "MA" | State.Code == "VT") |> select(Provider.CCN, City, State.Code) ``` ``` ## Provider.CCN City State.Code ## 1 223303 SPRINGFIELD MA ## 2 220077 SPRINGFIELD MA ## 3 220066 SPRINGFIELD MA ## 4 222046 SPRINGFIELD MA ## 5 471306 SPRINGFIELD VT ``` --- # `filter()` * \>, <, >=, <= checks whether values are greater/less/equal to than other values ```r hospitals |> filter(Number.of.Beds > 1500) |> select(Provider.CCN, City, State.Code, Number.of.Beds) ``` ``` ## Provider.CCN City State.Code Number.of.Beds ## 1 450388 SAN ANTONIO TX 1560 ## 2 100007 ORLANDO FL 2753 ## 3 330101 NEW YORK NY 2272 ``` --- # `filter()` * `%in%` checks for values that match at least one value in a vector of values. ```r hospitals |> filter(CCN.Facility.Type %in% c("STH", "LTCH", "CAH")) |> select(Hospital.Name, City, State.Code) |> head() ``` ``` ## Hospital.Name City State.Code ## 1 WEDOWEE HOSPITAL WEDOWEE AL ## 2 NORTHWEST MS MEDICAL CENTER CLARKSDALE MS ## 3 BIG SOUTH FORK MEDICAL CENTER ONEIDA TN ## 4 SUTTER DELTA MEDICAL CENTER ANTIOCH CA ## 5 ALTA BATES MEDICAL CENTER BERKELEY CA ## 6 SUMMIT MEDICAL CENTER OAKLAND CA ``` --- # `filter()` * `is.na()` checks for `NA` values and `!is.na()` checks for non-`NA` values. ```r hospitals |> filter(is.na(Number.of.Beds)) |> nrow() ``` ``` ## [1] 83 ``` --- # `filter()` * `str_detect()` checks for values with strings that match a pattern we provide. ```r hospitals |> filter(str_detect(Hospital.Name, "SHRINERS")) |> select(Hospital.Name, City, State.Code) |> head() ``` ``` ## Hospital.Name City State.Code ## 1 SHRINERS HOSPITAL FOR CHILDREN ST LOUIS MO ## 2 SHRINERS HOSPITALS CINCINNATI OH ## 3 SHRINERS HOSPITAL FOR CHILDREN CHICAGO IL ## 4 SHRINERS HOSPITAL FOR CHILDREN MINNEAPOLIS MN ## 5 THE SHRINERS HOSPITAL FOR CHILDREN SPRINGFIELD MA ## 6 THE SHRINERS HOSPITAL FOR CHILDREN BOSTON MA ``` --- # `arrange()` * `arrange()` sorts rows according to values in a column. * Defaults to sorting from smallest to largest (numeric) or first character to last character (character). ```r hospitals |> filter(City == "SPRINGFIELD" & State.Code == "MA") |> arrange(Number.of.Beds) |> select(Hospital.Name, Number.of.Beds) ``` ``` Hospital.Name Number.of.Beds VIBRA HOSPITAL OF WESTERN MASSACHUSE 174 THE MERCY HOSPITAL 237 BAYSTATE MEDICAL CENTER 696 THE SHRINERS HOSPITAL FOR CHILDREN NA ``` --- # `arrange()` * `desc()` sorts from largest to smallest (numeric) or last character to first character (character). ```r hospitals |> filter(City == "SPRINGFIELD" & State.Code == "MA") |> arrange(desc(Number.of.Beds)) |> select(Hospital.Name, Number.of.Beds) ``` ``` Hospital.Name Number.of.Beds BAYSTATE MEDICAL CENTER 696 THE MERCY HOSPITAL 237 VIBRA HOSPITAL OF WESTERN MASSACHUSE 174 THE SHRINERS HOSPITAL FOR CHILDREN NA ``` --- # `summarize()` * `summarize()` computes a value across a vector of values and stores it in a new data frame. ```r summary_hospitals_ma <- hospitals |> filter(State.Code == "MA") |> summarize(total_beds = sum(Number.of.Beds)) summary_hospitals_ma ``` ``` ## # A tibble: 1 × 1 ## total_beds ## <int> ## 1 NA ``` > Why did we get an `NA` value? --- # Let's filter to MA ```r hospitals_ma <- hospitals |> filter(State.Code == "MA") ``` --- # `summarize()` * Setting `na.rm` to TRUE ignore any missing values and only returns the summary value for all non-missing values. * `sum(is.na())` computes the total `NA` values in the vector. ```r summary_hospitals_ma <- hospitals_ma |> summarize(total_beds = sum(Number.of.Beds, na.rm = TRUE), total_beds_na = sum(is.na(Number.of.Beds))) summary_hospitals_ma ``` ``` ## # A tibble: 1 × 2 ## total_beds total_beds_na ## <int> <int> ## 1 18804 2 ``` > What if I wanted to know the percentage of rows with `NA` values in the `Number.of.Beds` column? --- # `summarize()` * `n()` computes the total observations in the vector. ```r summary_hospitals_ma <- hospitals_ma |> summarize(total_beds = sum(Number.of.Beds, na.rm = TRUE), percent_beds_na = sum(is.na(Number.of.Beds)) / n() * 100) summary_hospitals_ma ``` ``` ## # A tibble: 1 × 2 ## total_beds percent_beds_na ## <int> <dbl> ## 1 18804 2.04 ``` --- * But there are very different types of hospitals represented in this dataset. Maybe I want to know the number of beds for each type of hospital. --- # `group_by()` |> `summarize()` * `group_by()` groups observations with a shared value in a variable. * When we combine `group_by()` and `summarize()` we can perform operations *within* groups. ```r hospitals_ma |> group_by(CCN.Facility.Type) |> summarize(total_beds = sum(Number.of.Beds, na.rm = TRUE), total_hopsitals = n(), percent_beds_na = sum(is.na(Number.of.Beds)) / n() * 100) ``` --- ``` ## `summarize()` has grouped output by 'CCN.Facility.Type'. ``` ``` ## # A tibble: 7 × 4 ## CCN.Facility.Type total_beds total_hopsitals percent_beds_na ## <chr> <int> <int> <dbl> ## 1 CAH 74 3 0 ## 2 CH 527 4 50 ## 3 LTCH 2416 11 0 ## 4 PH 1704 17 0 ## 5 RH 752 6 0 ## 6 RNMHC 20 1 0 ## 7 STH 13311 56 0 ``` --- # `group_by()` * Grouping only changes the metadata of a data frame. * `ungroup()` removes the grouping structure metadata. ```r hospitals_ma |> group_by(CCN.Facility.Type) |> ungroup() ``` --- # `mutate()` * `mutate()` creates a new variable (column) in a data frame and fills values according to criteria we provide. > What would the following lines of code do? ```r hospitals_ma |> mutate(Address = paste(Street.Address, City, State.Code, Zip.Code)) hospitals_ma |> mutate(Hospital.Name.Lower = tolower(Hospital.Name)) hospitals_ma |> mutate(Percentage.State.Beds = Number.of.Beds/sum(Number.of.Beds, na.rm = TRUE) * 100) ``` --- # `mutate()` ```r hospitals_ma |> mutate(Address = paste(Street.Address, City, State.Code, Zip.Code)) |> select(Street.Address, City, Address) |> head() ``` ``` ## # A tibble: 6 × 3 ## Street.Address City Address ## <chr> <chr><chr> ## 1 ONE HOSPITAL ROAD OAK BLUFFS ONE HOSPITAL ROAD OAK BLUFFS MA 2557 ## 2 2033 MAIN STREET ATHOL 2033 MAIN STREET ATHOL MA 1331 ## 3 450 BROOKLINE AVE BOSTON 450 BROOKLINE AVE BOSTON MA 2215 ## 4 145 WARD HILL AVENUE BRADFORD 145 WARD HILL AVENUE BRADFORD MA 1835 ## 5 222 STATE STREET LUDLOW 222 STATE STREET LUDLOW MA 1056 ## 6 1200 CENTRE STREET ROSLINDALE 1200 CENTRE STREET ROSLINDALE MA 2127 ``` --- # `mutate()` ```r hospitals_ma |> mutate(Hospital.Name.Lower = tolower(Hospital.Name)) |> select(Hospital.Name, Hospital.Name.Lower) |> head() ``` ``` ## # A tibble: 6 × 2 ## Hospital.Name Hospital.Name.Lower ## <chr> <chr> ## 1 MARTHAS VINEYARD HOSPITAL marthas vineyard hospital ## 2 ATHOL MEMORIAL HOSPITAL athol memorial hospital ## 3 DANA-FARBER CANCER INSTITUTE dana-farber cancer institute ## 4 WHITTIER HOSPITAL-BRADFORD whittier hospital-bradford ## 5 HEALTHSOUTH REHABILITATION HOSPITAL healthsouth rehabilitation hospital ## 6 HEBREW REHABILITATION CENTER hebrew rehabilitation center ``` --- # `mutate()` ```r hospitals_ma |> mutate(Percentage.State.Beds = Number.of.Beds/sum(Number.of.Beds, na.rm = TRUE) * 100) |> select(Number.of.Beds, Percentage.State.Beds) |> head() ``` ``` ## # A tibble: 6 × 2 ## Number.of.Beds Percentage.State.Beds ## <int> <dbl> ## 1 25 0.133 ## 2 25 0.133 ## 3 30 0.160 ## 4 60 0.319 ## 5 53 0.282 ## 6 667 3.55 ```