George Mason University
Volgenau School of Information Technology and Engineering
Department of Statistics


STAT 672 Statistical Learning and Data Analytics

Summer Session B, 2018

Section 001

Tuesdays & Thursdays from 7:20 to 10:00 PM (starting June 5, other dates given below)

Location: room 135 of Innovation Hall


Instructor: Clifton D. Sutton


Text:

An Introduction to Statistical Learning with Applications in R, by James, Witten, Hastie, and Tibshirani (Springer, 2013), which can be downloaded at no cost from the website for the book (which is maintained by the authors).


Software:

R can be downloaded at no cost from this web site.


Prerequisite:

STAT 544 & STAT 554, both with a grade of B- or higher


Description:

This course covers methods for regression, classification, and clustering which can be applied to "Big Data" problems. Included are traditional statistical methods such as ordinary least squares (OLS) regression, logistic regression, and linear discriminate analysis (LDA), which are relatively simple to understand, as well as more modern computer-intensive methods such as tree-based methods (including bagging, boosting, and random forests), generalized additive models (GAMs), and support vector machines (SVMs). The basics of how and why each method works will be presented, but getting bogged down in the messy details of the algorithms will be avoided. Also, overarching principles such as the so-called curse of dimensionality and the bias-variance trade-off will be emphasized, as well as somewhat general model fitting and selection techniques such as regularization and cross-validation. In addition to gaining an overall understanding of how the various methods work (from a statistical point of view), successful students will also obtain experience in applying the methods to real data sets in a wide variety of settings, using the popular software R.

Approximate class-by-class content:

[1] June 5:
introduction (to the course, the text's web site, the text's notation, and R (the software to be used))
[Ch. 1 of text, and perhaps the first several pages of Ch. 2]
[2] June 7:
some principles of statistical learning; the classification setting and Bayes classifiers
[the rest of Ch. 2 of text]
[3] June 12:
ordinary least squares (OLS) regression (model fitting, and related inferences and assessment)
[roughly the first half of Ch. 3 of text]
[4] June 14:
more on regression (extensions, dealing with problems)
[continuing Ch. 3 of text, through roughly p. 102]
[5] June 19:
more on OLS regression, and a comparison with a nonparametric approach (KNN regression); linear methods for classification (LDA and logistic regression)
[the rest of Ch. 3 of text, and Ch. 4 of text through roughly p. 144]
[6] June 21:
more on classification (a comparison of linear methods with more flexible methods (QDA, RDA, and KNN))
[continuing Ch. 4 of text, through roughly p. 154]
[7] June 26:
a bit more on classification; resampling methods for assessment (cross-validation and the bootstrap)
[rest of Ch. 4 of text, and Ch. 5 of text]
[8] June 28:
linear model selection, regularization, and dimension reduction (subset selection, shrinkage methods, dimension reduction methods)
[Ch. 6 of text, through roughly p. 236]
[**] July 3
no class due to Summer Recess
[9] July 5:
more on linear methods, and moving beyond linearity (polynomial regression, splines, local regression, generalized additive models (GAMs))
[rest of Ch. 6 of text, Ch. 7 of text]
[10] July 10:
classification and regression trees (CART), bagging
[Ch. 8 of text, through roughly p. 319]
[11] July 12:
random forests (RFs), boosting, and multivariate adaptive regression splines (MARS)
[rest of Ch. 8 of text, Ch. 9 of text through roughly p. 343]
[12] July 17:
support vector machines (SVMs)
[rest of Ch. 9 of text]
[13] July 19:
unsupervised learning (principal components analysis and clustering)
[Ch. 10 of text]
[14] July 24:
review and summarization of course (and maybe a few extra topics (not to be covered on the final exam))
[Ch. 10 of text]
[**] July 26:
Final Exam (note: exam period is from 7:30 to 10:15 PM)
Note: If any classes are cancelled due power outages (or for any other reason), some of the dates given above may be changed.


Grading:


Additonal Comments: