Recently Published
Chapter 10: Regression Analysis
This chapter introduces the "Mathematical Compass" of Data Science: Regression Analysis. While previous chapters focused on grouping and labeling, students will now learn to predict continuous, numerical outcomes. By mastering the ability to quantify exactly how one variable influences another, students will move from simply describing data to building models that can estimate everything from market trends to housing prices with statistical precision.
Core Topics covered:
Chapter Overview
Simple Linear Regression
Multiple Linear Regression
Regression Diagnostics
Variable Selection
Regularized Regression
Non-Linear Regression
Regression for Count and Categorical Outcomes
Chapter Lab Activity: Housing Price Regression with Boston Dataset
Statistics for Data Science (229711) - Chapter 9: Data Classification
This chapter shifts the focus to the most popular domain of Supervised Learning: Data Classification. Students will learn how to build models that can "decide" and "predict" categorical labels for new data. From determining whether an email is spam to diagnosing a medical condition, this chapter provides a robust toolkit for making evidence-based predictions by learning from historical patterns.
Core Topics covered:
Introduction to Classification
Logistic Regression
K-Nearest Neighbors
Decision Trees
Random Forest and Ensemble Methods
Model Evaluation
Model Comparison and Selection
Chapter Lab Activity: Medical Diagnosis Classification with Pima Data
Statistics for Data Science (229711) - Chapter 8: Data Clustering
This chapter introduces the concept of Unsupervised Learning through the lens of Data Clustering. Students will learn how to find "hidden structures" in data without predefined labels, mastering the techniques used to group similar observations together. From identifying customer segments to discovering natural patterns in biology, this chapter provides the tools to make sense of unlabeled datasets by letting the data speak for itself.
Core Topics covered:
Introduction to Clustering
K-Means Clustering
Hierarchical Clustering
DBSCAN
Cluster Validation
Gaussian Mixture Models
Practical Clustering Workflow
Chapter Lab Activity: Customer Segmentation with wholesales Data
Statistics for Data Science (229711) - Chapter 7: Data Dimension Reduction
This chapter explores the "Art of Information Distillation": Dimension Reduction. Students will learn how to navigate the "Curse of Dimensionality," discovering how to condense massive, complex datasets into their most essential structures. The focus is on finding the "signal" within the "noise"—transforming hundreds of variables into a few meaningful dimensions that tell the real story.
Core Topics covered:
The Curse of Dimensionality
Principal Component Analysis (PCA)
Factor Analysis
Linear Discriminant Analysis (LDA)
t-SNE
Feature Selection Methods
Evaluating Dimension Reduction
Chapter Lab Activity: Dimension Reduction Pipeline with decathlon2
Statistics for Data Science (229711) - Chapter 6: Data Preprocessing
This chapter dives into the "engine room" of Data Science: Preprocessing. Students will learn that the quality of a model is determined long before it is trained, focusing on the critical steps required to turn messy, real-world data into a "model-ready" format.
Core Topics covered:
Why Preprocessing Matters
Handling Missing Data
Outlier Detection and Treatment
Data Transformation
Encoding Categorical Variables
Feature Scaling
Data Integration and Reshaping
Chapter Lab Activity: Full Preprocessing Pipeline with msleep
Statistics for Data Science (229711) - Chapter 5: Data Sampling Techniques
This chapter addresses the foundational question of data science: "How do we ensure our data truly represents the world?" It explores the mechanics of selection, the math of sample size, and the power of computational resampling.
Core Topics covered:
Why Sampling Matters
Probability Sampling Methods
Non-Probability Sampling Methods
Sample Size Determination
Sampling Bias and Common Pitfalls
Bootstrap Resampling
Evaluating Sample Quality
Chapter Lab Activity: Exploring Sampling with nhanes-Style Data
Statistics for Data Science (229711) - Chapter 4: Test of Independence of Variables
This chapter explores the statistical frameworks used to detect and quantify relationships between variables. It moves from testing the independence of categorical factors to measuring the strength and direction of associations in both discrete and continuous data.
Core Topics covered:
The Concept of Independence
Chi-Square Test of Independence
Fisher’s Exact Test
Cramér’s V and Effect Size for Categorical Association
Correlation Tests
Point-Biserial and Phi Coefficients
Partial Correlation
Chapter Lab Activity: Exploring Independence with the titanic and mtcars Datasets
Statistics for Data Science (229711) - Chapter 3: Hypothesis Testing
This chapter introduces the core engine of statistical decision-making: Hypothesis Testing. It provides a rigorous framework for making inferences about populations based on sample evidence, a critical skill for any Data Scientist.
Core Topics covered:
The Logic of Hypothesis Testing
One-Sample Tests
Two-Sample Tests
Paired Sample Test
One-Way ANOVA
Non-Parametric Alternatives
Effect Size and Statistical Power
Chapter Lab Activity: Exploring Hypothesis Testing with the ToothGrowth Dataset
Statistics for Data Science (229711) - Chapter 2: Data Distribution and Probability
This chapter serves as the theoretical bridge between descriptive analysis and statistical inference. It introduces the fundamental concepts of probability and explores the mathematical distributions that model real-world data behavior.
Core Topics covered:
Types of Data and Measurement Scales
Probability Fundamentals
Conditional Probability and Bayes’ Theorem
Discrete Probability Distributions
Continuous Probability Distributions
Sampling Distributions and the Central Limit Theorem
Assessing Normality
Chapter Lab Activity: Exploring Distributions with the airquality Dataset
Statistics for Data Science (229711) - Chapter 1: Descriptive Statistics
This document serves as the introductory chapter for the Statistics for Data Science course at the graduate level. It focuses on the fundamental principles of Exploratory Data Analysis (EDA), shifting the focus from simple computation to critical statistical interpretation .
Topics covered:
Measures of Central Tendency
Measures of Dispersion
Measures of Shape: Skewness and Kurtosis
Data Visualization for Descriptive Statistics
Multivariate Descriptive Statistics
Chapter Lab Activity: Exploring the mtcars Dataset
208251_LAB5_Nonparametric Statistics
Students are able to
1)perform descriptive statistics
2)apply appropriate non-parametric statistics tests to answer research questions of interest.
208251_LAB4_Nonparametric Statistics
Students are able to
1)perform descriptive statistics
2)apply appropriate non-parametric statistics tests to answer reseach questions of interest.
208251_LAB3_Model diagnostics
Students are able to use R language to analyse data using multiple linear regression:
1. Perform linear regression analysis
2. Check Normality Assumptions
3. Check Constant Variance Assumptions
4. Check Independence (Autocorrelation) Assumptions
5. Dealing with Invalid Model Assumption
208251_LAB1_SimpleLinearRegression
Students are able to use R language to
1. perform descriptive statistics
2. construct scatterplot between two quantitative variables
3. perform correlation analysis
4. perform linear regression analysis and inference on regression parameters
5. interpret the results
208251_LAB2_MultipleLinearRegression
Students are able to use R language to analyse data using multiple linear regression:
1. perform descriptive statistsics
2. transform qualitative independent variable into dummy variables
3. select independent variables
4. perform linear regression analysis and inference on regression parameters
5. interpret the results