class: center, middle, inverse, title-slide # Lec 01: Introductions ## SDS 192: Introduction to Data Science ###
Shiya Cao
Statistical & Data Sciences
, Smith College
###
Fall 2024
--- # What is Data Science? .pull-left[ * Interdisciplinary field combining computer science, mathematics/statistics, and domain expertise to extract meaningful information from unstructured data points. ] .pull-right[
] --- # Case Study 1: ACLU Fights Discriminatory Housing .pull-left[ * American Civil Liberties Union employs [data scientists](https://medium.com/aclu-tech-analytics/meet-the-aclu-analytics-team-4644d4f20dae) to produce insights regarding discriminatory laws and practices. * Findings are presented in courts, legislatures, and public reports. * In [this study](https://www.aclu.org/blog/racial-justice/race-and-economic-justice/lawsuit-challenges-discriminatory-housing-policy), they use public data to show that excluding people with criminal records from housing can be viewed as a violation of the US Fair Housing Act. ] .pull-right[
] --- # Case Study 2: EPA Tracks Environmental Injustice .pull-left[ * Environmental Protection Agency hires [data scientists](https://www.epa.gov/careers/science-careers-epa) to produce insights regarding environmental health risks. * Findings implicate environmental policies, funding allocations, and legal actions against states and industries. * [This tool](https://www.epa.gov/ejscreen), visualizes environmental and demographic indicators to highlight communities experiencing environmental injustices. ] .pull-right[  ] --- # We have had data for a long time. It is not new. However, those studies were not possible 30 years ago. Why is data science so popular today? --- # Big Data * "Between the dawn of civilization and 2003, we only created five exabytes of information; now we're creating that amount every two days." Eric Schmidt, Goolge (and others)  ---  ---  ---  ---  --- # Development of Tools .pull-left[ * 1940s: [Monroe LN-160X](http://www.vintagecalculators.com/html/monroe_ln-160x.html) * 1947: [Victor is the world's largest exclusive manufacturer of adding machines](https://en.wikipedia.org/wiki/Victor_Technology) ] .pull-right[  ] --- .pull-left[ * 1973: [TI SR-10](https://en.wikipedia.org/wiki/Calculator#History) pocket calculator * 1977: [Apple II](https://en.wikipedia.org/wiki/Apple_II) * 1981: [MS-DOS](https://en.wikipedia.org/wiki/MS-DOS) first released * 1986: [SQL](https://en.wikipedia.org/wiki/SQL) becomes a standard of the American National Standards Institute (ANSI) ] .pull-right[  ] --- .pull-left[ * 1990: [Windows 3.0](https://en.wikipedia.org/wiki/Windows_3.0) * 1995: [Excel 7.0](https://en.wikipedia.org/wiki/Microsoft_Excel#Number_of_rows_and_columns) can handle at most 16k rows * 1998: [iMac G3](https://en.wikipedia.org/wiki/IMac_G3) ] .pull-right[  ] --- .pull-left[ * 2003: [Excel 11.0](https://en.wikipedia.org/wiki/Microsoft_Excel#Number_of_rows_and_columns) can handle at most 64k rows * 2005: [MySQL](https://en.wikipedia.org/wiki/MySQL) powers Google * 2007: [Excel 12.0](https://en.wikipedia.org/wiki/Microsoft_Excel#Number_of_rows_and_columns) can handle at most 1M rows ] .pull-right[  By Vectorised from
https://labs.mysql.com/common/logos/mysql-logo.svg
,
Fair use
,
Link
] --- # Science Paradigms * Thousand years ago: Science was ***empirical*** describing natural phenomena. * Last few hundred years: ***Theoretical*** branch using models and generalizations. * Last few decades: ***Computational*** branch simulating complex phenomena. * Today: ***Data exploration*** (eScience) using experiment and simulation. * Data captured by instruments or generated by simulator * Processed by software * Information/Knowledge stored in computer * Scientist analyzes databases/files using data management and statistics ---  --- # [Data Scientists Outlook](https://www.bls.gov/ooh/math/data-scientists.htm#:~:text=in%20May%202023.-,Job%20Outlook,on%20average%2C%20over%20the%20decade.) (U.S. Bureau of Labor Statistics) # [Geographic Profile for Data Scientists](https://www.bls.gov/oes/current/oes152051.htm) (U.S. Bureau of Labor Statistics) --- * "The ability to take ***data*** -- to be able to ***understand*** it, to ***process*** it, to ***extract value*** from it, to ***visualize*** it, to ***communicate*** it is going to be a hugely important skill in the next decades, not only at the professional level but even at the educational level for elementary school kids, for high school kids, for college kids. Because now we really do have essentially free and ***ubiquitous data***." Hal Varian, Prof. Emeritus at UC Berkeley, Chief Economist at Google --- # Data Science for Good Opportunities * [Human Rights Data Analysis Group](https://hrdag.org/) * [Data Science for Social Good](https://www.dssgfellowship.org/) * [Stanford Computational Policy Lab](https://policylab.stanford.edu/) * [AI for Good](https://ai4good.org/) * [Data Kind](https://www.datakind.org/) * [We Count](https://wecount.inclusivedesign.ca/) * [Institute for the Quantitative Study of Inclusion, Diversity, and Equity](https://qsideinstitute.org/) --- # About Me * PhD in Information Technology, MS in System Dynamics and Innovation Management, BA in Management * Disability inclusion researcher * Employment issues * Transition from school to workplace * Applied statistics in disability inclusion using population surveys * The intersection of identity attributes * Participatory mapping for accessibility * Statistical methods to preserve disability data privacy * The integration of disability incluison components into statistics and data science pedagogy * [Disability Inclusion Analytics Lab (DIAL)](https://scao53.github.io/) * When I have free time, I enjoy reading, writing, practicing piano, watching movies, visiting museums, hiking, and volunteering --- # Introduction Activity (3 minutes) * What’s your favorite summer activity? (Please choose the cloest answer!) * a. Walking/hiking * b. Learning something new * c. Reading/writing * d. Spending time with family/friends * e. Going to the beach/kayaking * f. Having a picnic/BBQ --- # Disability Inclusion Pedagogy * Purpose of this study: * make connections between STEM fields and disability studies so that STEM subjects become more appealing to traditionally underrepresented groups * encourage diversity of thought in approaching data science problems * promote disability awareness in the data science and statistics community * Relevant to you: * use disability inclusion public datasets in our three mini-projects * read some disability inclusion articles to understand the context * participate in a pre-course survey and post-course survey * gain 2 extra credits for completing both the [pre-course survey](https://smithcollege.qualtrics.com/jfe/form/SV_e9B41BJHZHcBzAG) and the post-course survey * students who are younger than 18 years old will let me know privately and automatically gain 2 extra credits * Your responses will be confidential. Data from this study without identifiers to individuals may be used in research paper, presentation, etc. --- # How to Become a Data Scientist? .pull-left[ * Equity, consistency, standardization of R knowledge * SDS 100 * Programming with data * R or Python or SQL * See CSC 110 for Python * Advanced programming: SDS 270 or 271 * Statistical modeling * SDS 201/220 + SDS 291 + SDS 293 * Communication * SDS 109, 235, or 236 * Domain knowledge ] .pull-right[
] --- # Topics Covered in This Course .pull-left[ * Data visualization * Data wrangling * Mapping * Data science ethics * Disability inclusion context ] --- # Usual Structure of a Class * Motivation: * Let you work at your own pace * More coding practices for you * Engage everyone in learning * Make this course more meaningful to your education * Expectations of you: * Doing the assigned reading before each class is critical * Actively practice coding in class * Ask questions in class if you have any * Visit my student hours or drop-in tutoring at Spinelli Center if you have more questions after class * I'll send a survey to ask for your feedback on this class structure in a few weeks and we may adjust it if needed --- # Usual Structure of a Class * 15 mins: I give a **chalk talk** to go over more complex topics we need to highlight for each lecture * 10 mins: You work on **the first in-class exercise**, relatively simple and covering the key learning goals; or you can choose to use this time to go over the key topics on the class slides that will always be posted to our
course webpage
* 15 mins: I do **live coding** for the first in-class exercise and you can code with me * 35 mins: You work on **the second in-class exercise**, including more practices for the key learning goals. Let me and the in-class assistant know if you have any questions --- # Executive Summary of Syllabus --- # Coding Can Be Intimidating! .pull-left[ * Coding is like learning a new language. When you are first learning it, it all feels completely unfamiliar. I will work to support you in building the vocabulary and syntax to code in R. * Coding can be frustrating. I regularly lose hours of my day in trying to find bugs in my code. I will work to give you resources and skills to navigate coding frustrations. * Do start by finishing a
minimially viable product (MVP)
. In other words: *
Done is better than perfect
*
Don’t let perfect be the enemy of good
] .pull-right[  ]