class: center, middle, inverse, title-slide # Lec 15: Join Two Tables + Other Verbs ## SDS 192: Introduction to Data Science ###
Shiya Cao
Statistical & Data Sciences
, Smith College
###
Fall 2024
--- # Today's Learning Goals * Be able to join two tables using R. --- # Joining Two Tables * Up until this point, we have been working with data in single tables. * Sometimes related data can be spread across multiple tables that we wish to bring together for different kinds of analysis. --- # Today's Dataset > What do you notice about the two datasets? Why might they be useful to consider together? ```r library(tidyverse) ct_school_attendance <- read.csv("data/ct_school_attendance.csv") |> filter(studentgroup == "All Students" & reportingdistrictname != "Connecticut") ct_school_learning_model <- read.csv("data/ct_school_learning_model.csv") ``` --- # Keys * To join data tables, we need a column shared across the tables that we can join on. * Often this is some form of a unique key. .pull-left[ ``` ## reportingdistrictcode ## 1 0010011 ## 2 0020011 ## 3 0030011 ## 4 0040011 ## 5 0050011 ## 6 0070011 ``` ] .pull-right[ ``` ## district_code ## 1 2190012 ## 2 80011 ## 3 10011 ## 4 20011 ## 5 30011 ## 6 40011 ``` ] > What might be an issue with joining on this key? --- # Keys * Keys must be formatted the same way and of the same type. .pull-left[ ```r ct_school_attendance <- ct_school_attendance |> mutate(reportingdistrictcode = as.integer(reportingdistrictcode)) ct_school_attendance |> select(reportingdistrictcode) |> head() ``` ``` ## reportingdistrictcode ## 1 10011 ## 2 20011 ## 3 30011 ## 4 40011 ## 5 50011 ## 6 70011 ``` ] .pull-right[ ```r ct_school_learning_model |> select(district_code) |> head() ``` ``` ## district_code ## 1 2190012 ## 2 80011 ## 3 10011 ## 4 20011 ## 5 30011 ## 6 40011 ``` ] --- # Join ```r ct_school_attendance |> inner_join(ct_school_learning_model, by = c("reportingdistrictcode" = "district_code")) |> rename(fully_remote = percent_students_fully_remote) |> #to fit on slide select(reportingdistrictcode, attrate_202021, fully_remote) |> head() ``` ``` ## reportingdistrictcode attrate_202021 fully_remote ## 1 10011 0.9680 13% ## 2 20011 0.8923 32% ## 3 30011 0.9550 30% ## 4 40011 0.9621 19% ## 5 50011 0.9698 12% ## 6 70011 0.9542 13% ``` --- # Different Kinds of Joins * `inner_join()` * `left_join()` * `right_join()` * `full_join()` * `anti_join()` --- # Inner Join  --- # Outer Joins <img src="https://d33wubrfki0l68.cloudfront.net/9c12ca9e12ed26a7c5d2aa08e36d2ac4fb593f1e/79980/diagrams/join-outer.png" width="350" /> --- # Many-to-One Relationships  --- # Anti-Join * Filters to entries in one table that are not in the other table. .pull-left[ ```r ct_school_attendance |> anti_join(ct_school_learning_model, by = c("reportingdistrictcode" = "district_code")) |> select(reportingdistrictcode) ``` ``` ## reportingdistrictcode ## 1 2310018 ## 2 3360015 ## 3 3470015 ``` ] .pull-right[ ```r ct_school_learning_model |> anti_join(ct_school_attendance, by = c("district_code" = "reportingdistrictcode")) |> select(district_code) |> head() ``` ``` ## district_code ## 1 46821 ## 2 46921 ## 3 100161 ## 4 170561 ## 5 170821 ## 6 190161 ``` ]