class: center, middle, inverse, title-slide # Lec 9: Data Fundamentals ## SDS 192: Introduction to Data Science ###
Shiya Cao
Statistical & Data Sciences
, Smith College
###
Fall 2024
--- # Today's Learning Goals * Understand rows, columns, and datasets. * Understand metadata. --- # Vectors .pull-left[ * A vector is a data object with several entries. * We define a vector by listing these entries (separated by commas) in the function `c()`, which is shorthand for *combine*. * Let's create a vector representing the point values of this hand of cards. What is a good variable name for this vector? * Next let's create a vector representing the colors of this hand of cards. Good variable name? * Let's create a vector representing the suits of this hand of cards. Good variable name? ] .pull-right[  By Ron Maijen - Own work, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=15300536 ] --- # Uniqueness * We can use the function `unique()` to determine the distinct values in the vector. * Call `unique()` on the vector of playing card colors you created in the last slide. ```r playing_card_colors <- c("black", "red", "black", "red", "black") unique(playing_card_colors) ``` ``` ## [1] "black" "red" ``` > Challenge: How would we write code to computationally determine the number of unique values in a vector? --- # Class * Values in a vector will all be of the same class. * We can check the class of a vector by calling `class()` and passing the name of the vector as an argument. * What is the class of each of these vectors? * `playing_card_values <- c(1, 2, 3, 4, 5)` * `playing_card_colors <- c("black", "red", "black", "red", "black")` * `playing_card_values <- c("1", "2", "3", "4", "5")` --- # Data Frames * A data frame is a rectangular collection of variables (in the columns) and observations (in the rows). * Every column in a data frame is a vector. The column names acts as a variable name for that vector. * We can turn a series of vectors into a data frame, by listing them (separated by commas) in `data.frame()`. ```r playing_card_values <- c(1, 2, 3, 4, 5) playing_card_colors <- c("black", "red", "black", "red", "black") playing_card_suits <- c("spade", "diamond", "spade", "heart", "club") # Create a new data frame for playing cards playing_card_data <- data.frame(playing_card_values, playing_card_colors, playing_card_suits) ``` --- # Viewing Data Frames * The data frame we just created is quite small, so we can enter `playing_card_data` into our console and see the whole thing. This often doesn't work when working with larger datasets. * Other ways to view data frames: * `view(playing_card_data)` * `glimpse(playing_card_data)` * `head(playing_card_data)`: returns first six rows of dataset. * `names(playing_card_data)`: returns the dataset's column names. * `nrow(playing_card_data)`: returns the number of rows in the dataset. * `ncol(playing_card_data)`: returns the number of columns in the dataset. --- # Renaming Columns * The column names in our new data frame are a bit redundant now since they are all in the variable `playing_card_data`. * We can rename them by creating a new vector of column names, and then assigning that vector the names of the data frame: ```r playing_card_column_names <- c("values", "colors", "suits") names(playing_card_data) <- playing_card_column_names ``` --- # Accessing Columns * There are many ways to reference certain columns in a data frame. * Today we will use the `$` to access columns (e.g. `playing_card_data$values`). * Call the `table()` function on the suit variable. What returns? --- # What is a Dataset? .pull-left[  Grolemund, Garrett, and Hadley Wickham. n.d. R for Data Science. Accessed March 31, 2019. https://r4ds.had.co.nz/. ] .pull-right[ * A collection of data points organized into a structured format. * In this course, we will mainly work with datasets that are structured in a two-dimensional format. * We will refer to these as *rectangular* datasets. * Rectangular datasets are organized into a series of rows and columns; ideally: * We refer to rows as *observations*. * We refer to columns as *variables*. ] --- # Observations vs. Variables vs. Values .right-column[ * Observations refer to individual units or cases of the data being collected. * If I was collecting data about each student in this course, one student would be an observation. * If I was collecting census data and aggregating it at the county level, one county would be an observation. * Variables describe something about an observation. * If I was collecting data about each student in this course, 'major' might be one variable. * If I was collecting county-level census data, 'population' might be one variable. * Values refer to the actual value associated with a variable for a given observation. * If I was collecting data about each student's major in this course, one value might be SDS. * If I was collecting data about the population of Hampshire County in MA, the value might be 161,572. ] .left-column[  Grolemund, Garrett, and Hadley Wickham. n.d. R for Data Science. Accessed March 31, 2019. https://r4ds.had.co.nz/. ] --- # How do I find out more information about a dataset? * Metadata can be referred to as "data about data". * Metadata provides important contextual information to help us interpret a dataset. * There are two types of metadata associated with datasets: * Administrative metadata tells us how a dataset is managed and its *provenance*, or the history of how it came to be in its current form: * Who created it? * When was it created? * When was it last updated? * Who is permitted to use it? * Descriptive metadata tells us information about the contents of a dataset: * What does each row refer to? * What does each column refer to? * What values might appear in each cell? --- # Where do I find metadata for a dataset? * Oftentimes metadata is recorded in a dataset codebook or data dictionary. * These documents provide definitions for the observations and variables in a dataset and tell you the accepted values for each variable. * Let's say that I have a dataset of student names, majors, and class years. A codebook or data dictionary might tell me that: * Each row in the dataset refers to one student. * The 'Class Year' variable refers to "the year the student is expected to graduate." * Possible values for the 'Major' variable are Economics, SDS, and Psychology.