George Mason University
Volgenau School of Information Technology and Engineering
Department of Statistics


STAT 472 Introduction to Statistical Learning

Spring Semester, 2020

Section 001

Tuesdays from 7:20 to 10:00 PM (starting January 21, other dates given below)

Location: room 1108 of the Nguyen Engineering Building


Instructor: Clifton D. Sutton


Text:

An Introduction to Statistical Learning with Applications in R, by James, Witten, Hastie, and Tibshirani (Springer, 2013), which can be downloaded at no cost from the website for the book (which is maintained by the authors).


Software:

R can be downloaded at no cost from this web site.


Prerequisite:

STAT 456, with a grade of C or higher


Description:

This course covers methods for regression, classification, and clustering which can be applied to "Big Data" problems. Included are traditional statistical methods such as ordinary least squares (OLS) regression, logistic regression, and linear discriminate analysis (LDA), which are relatively simple to understand, as well as more modern computer-intensive methods such as tree-based methods (including bagging, boosting, and random forests), generalized additive models (GAMs), and support vector machines (SVMs). The basics of how and why each method works will be presented, but getting bogged down in the messy details of the algorithms will be avoided. Also, overarching principles such as the so-called curse of dimensionality and the bias-variance trade-off will be emphasized, as well as somewhat general model fitting and selection techniques such as regularization and cross-validation. In addition to gaining an overall understanding of how the various methods work (from a statistical point of view), successful students will also obtain experience in applying the methods to real data sets in a wide variety of settings, using the popular software R.

Approximate class-by-class content:

[1] Jan. 21:
introduction (to the course, the text's web site, the text's notation, and R (the software to be used))
[Ch. 1 of text, and the first several pages of Ch. 2]
[2] Jan. 28:
some principles of statistical learning; the classification setting and Bayes classifiers
[the rest of Ch. 2 of text]
[3] Feb. 4:
the classification setting; ordinary least squares (OLS) regression (model fitting, and related inferences and assessment)
[Ch. 3 of text through roughly p. 77]
[4] Feb. 11:
more on regression (extensions, dealing with problems)
[continuing Ch. 3 of text, through roughly p. 96]
[5] Feb. 18:
more on OLS regression, and a comparison with a nonparametric approach (KNN regression); linear methods for classification (LDA and logistic regression)
[the rest of Ch. 3 of text, and Ch. 4 of text through roughly p. 141]
[6] Feb. 25:
more on classification (a comparison of linear methods with more flexible methods (QDA, RDA, and KNN))
[continuing Ch. 4 of text, through roughly p. 163]
[7] Mar. 3:
a bit more on classification; resampling methods for assessment (cross-validation and the bootstrap)
[rest of Ch. 4 of text, and Ch. 5 of text]
[**] Mar. 10:
no class due to Spring Break
[8] Mar. 17:
linear model selection, regularization, and dimension reduction (subset selection, shrinkage methods, dimension reduction methods)
[Ch. 6 of text, through roughly p. 236]
[9] Mar. 24:
more on linear methods, and moving beyond linearity (polynomial regression, splines, local regression)
[rest of Ch. 6 of text, and Ch. 7 of text through roughly p. 282]
[10] Mar. 31:
generalized additive models (GAMs); classification and regression trees (CART); bagging
[rest of Ch. 7 of text, and Ch. 8 of text, through roughly p. 319]
[11] Apr. 7:
more on bagging, random forests (RFs); boosting; multivariate adaptive regression splines (MARS); maximal margin classifiers, support vector clssifiers
[rest of Ch. 8 of text, Ch. 9 of text through roughly p. 348]
[12] Apr. 14:
support vector machines (SVMs); unsupervised learning (principal components analysis)
[rest of Ch. 9 of text, and Ch. 10 through roughly p. 348]
[13] Apr. 21:
clustering
[rest of Ch. 10 of text]
[14] Apr. 28:
review and summary of course
[**] May 12:
Final Exam (note: exam period is from 7:30 to 10:15 PM)
Note: If any classes are cancelled due to weather, power outages, or for any other reason, some of the dates given above may be changed. Also, you should consider the above schedule to be approximate ... this is a relatively new course and there is some uncertainty associated with the class-by-class schedule of a course when I'm teaching it for only the third time.


Grading:


Additonal Comments: