class: center, middle, inverse, title-slide # Lec 07: Boxplots ## SDS 192: Introduction to Data Science ###
Shiya Cao
Statistical & Data Sciences
, Smith College
###
Fall 2024
--- # Today's Learning Goals * Understand measures of central tendency and variability. * Be able to interpret boxplots. * Create boxplots using `ggplot2`. --- class: center, middle # A descriptive statistic is a summary statistic that quantitatively describes or summarizes features from a collection of information. --- # Measures of Central Tendency * Mean * Median --- # Mean .pull-left[ * Sum of values divided by number of values summed. * Takes every value into consideration. * Heavily influenced by outliers. ] .pull-right[ <!-- --> ] --- # Median .pull-left[ * Middle value(s) of the dataset when all values are lined from smallest to largest. * Limited influence from outliers. ] .pull-right[ <!-- --> ] --- # Normal Distribution .pull-left[ * More values huddle around some center line and taper off as we move away from center. * Histogram is symmetrical with a normal distribution. * Median and mean should be about the same; mean should be a good measure of central tendency. ] .pull-right[ <!-- --> ] --- # Skew .pull-left[ * Histogram is non-symmetrical when there is skew. * Long tail to the right of center indicates a *right skew*. * Median becomes more representative measure of central tendency than mean. ] .pull-right[ <!-- --> ] --- # Ethical Implications .pull-left[ > The administration pointed out that 92 million Americans would receive an average tax reduction of over $1,000. Would most of people be getting a tax cut of around $1,000? ] .pull-right[ <!-- --> By Diva Jain - <a rel="nofollow" class="external free" href="https://codeburst.io/2-important-statistics-terms-you-need-to-know-in-data-science-skewness-and-kurtosis-388fef94eeaa">https://codeburst.io/2-important-statistics-terms-you-need-to-know-in-data-science-skewness-and-kurtosis-388fef94eeaa</a>,
CC BY-SA 4.0
,
Link
] --- # Measures of Variability * Range * Interquartile Range (IQR) --- # Range .pull-left[ * Maximum value minus the minimum value. * Evaluates the spread of the entire dataset. ] .pull-right[ <!-- --> ] --- # Interquartile Range (IQR) .pull-left[ * 1st quartile is middle value between minimum and median. * Splits the data into 1st and 2nd 25%. * 3rd quartile is middle value between median and maximum. * Splits the data into 3rd and 4th 25%. * IQR is the difference between the 1st and 3rd quartile. * Represents the middle 50% of values. ] .pull-right[ <!-- --> ] --- # Grouped Boxplots <!-- --> --- # Interpreting Boxplots Step 1: Check for Outliers .pull-left[ > How many are there? What do they indicate? Do you assume they are errors in the data? Or do they represent extremes that are important for us to take into consideration? ] .pull-right[ <!-- --> ] --- # Interpreting Boxplots Step 2: Compare Medians .pull-left[ > Do the medians line up? If not, in which groups are the medians higher and in which are they lower? ] .pull-right[ <!-- --> ] --- # Interpreting Boxplots Step 3: Compare the Ranges .pull-left[ > Do certain groups have a wider range of values represented than others? In other words, are the values more distributed for certain groups than for others? This might indicate a greater degree of disparity in some groups than others. ] .pull-right[ <!-- --> ] --- # Interpreting Boxplots Step 4: Compare the IQRs .pull-left[ > In which groups do the middle 50% of values tend to huddle around a central value? In which are they more spread out from the center? ] .pull-right[ <!-- --> ] --- # Interpreting Boxplots Step 5: Compare the Symmetry .pull-left[ > Does the median appear to be in the center of the range and IQR? Is the median closer to the minimum – or the bottom whisker? Or the top whisker? ] .pull-right[ <!-- --> ]