George Mason University
Volgenau School of Information Technology and Engineering
Department of Statistics
STAT 672 Statistical Learning and Data Analytics
Summer Session B, 2018
Section 001
Tuesdays & Thursdays from 7:20 to 10:00 PM (starting June 5, other dates given below)
Location: room 135 of
Innovation Hall
(phone, fax,
e-mail, etc.)
Office Hours: 6:15-7:00 & 10:00-10:30 PM
on class nights
(more information)
Text:
An Introduction to Statistical Learning with Applications in R,
by James, Witten, Hastie, and Tibshirani
(Springer, 2013),
which can be downloaded at no cost from
the website for the book (which is maintained by the authors).
Software:
R
can be downloaded at no cost from
this web site.
Prerequisite:
STAT 544 & STAT 554, both with a grade of B- or higher
Description:
This course covers methods for regression, classification, and clustering which can be applied to "Big Data" problems. Included are
traditional statistical methods such as ordinary least squares (OLS) regression, logistic regression, and linear discriminate analysis (LDA), which are relatively simple to
understand, as well as more modern computer-intensive methods such as tree-based methods (including bagging, boosting, and random forests), generalized additive models (GAMs),
and support vector machines (SVMs). The basics of how and why each method works will be presented, but getting bogged down in the messy details of the algorithms will
be avoided. Also, overarching principles such as the so-called curse of dimensionality and the bias-variance trade-off will be emphasized, as well as
somewhat general model fitting and selection techniques such as regularization and cross-validation.
In addition to gaining an overall understanding of how the various methods work (from a statistical point of view), successful students will also obtain
experience in applying the methods to real data sets in a wide variety of settings, using the popular software R.
Approximate class-by-class content:
- [1] June 5:
- introduction (to the course, the text's web site, the text's notation, and R (the software to be used))
[Ch. 1 of text, and perhaps the first several pages of Ch. 2]
- [2] June 7:
- some principles of statistical learning; the classification setting and Bayes classifiers
[the rest of Ch. 2 of text]
- [3] June 12:
- ordinary least squares (OLS) regression (model fitting, and related inferences and assessment)
[roughly the first half of Ch. 3 of text]
- [4] June 14:
- more on regression (extensions, dealing with problems)
[continuing Ch. 3 of text, through roughly p. 102]
- [5] June 19:
- more on OLS regression,
and a comparison with a nonparametric approach (KNN regression);
linear methods for classification (LDA and logistic regression)
[the rest of Ch. 3 of text, and Ch. 4 of text through roughly p. 144]
- [6] June 21:
- more on classification (a comparison of linear methods with more flexible methods (QDA, RDA, and KNN))
[continuing Ch. 4 of text, through roughly p. 154]
- [7] June 26:
- a bit more on classification; resampling methods for assessment (cross-validation and the bootstrap)
[rest of Ch. 4 of text, and Ch. 5 of text]
- [8] June 28:
- linear model selection, regularization, and dimension reduction (subset selection, shrinkage methods, dimension reduction methods)
[Ch. 6 of text, through roughly p. 236]
- [**] July 3
- no class due to Summer Recess
- [9] July 5:
- more on linear methods, and moving beyond linearity (polynomial regression, splines, local regression, generalized additive models (GAMs))
[rest of Ch. 6 of text, Ch. 7 of text]
- [10] July 10:
- classification and regression trees (CART), bagging
[Ch. 8 of text, through roughly p. 319]
- [11] July 12:
- random forests (RFs), boosting, and multivariate adaptive regression splines (MARS)
[rest of Ch. 8 of text, Ch. 9 of text through roughly p. 343]
- [12] July 17:
- support vector machines (SVMs)
[rest of Ch. 9 of text]
- [13] July 19:
- unsupervised learning (principal components analysis and clustering)
[Ch. 10 of text]
- [14] July 24:
- review and summarization of course (and maybe a few extra topics (not to be covered on the final exam))
[Ch. 10 of text]
- [**] July 26:
- Final Exam (note: exam period is
from 7:30 to 10:15 PM)
Note: If any classes are cancelled due power outages (or for any other reason), some of the dates given above may be changed.
Grading:
Additonal Comments:
- Put STAT 672 in the subject line when you send me e-mail
(due to spam, I sometimes delete messages without reading them, based
on the subject line).
- I can possibly
make arrangements to meet with you outside of my
scheduled hours, with Wednesday evening being perhaps the best time each week.
- See comments on the this web page regarding (slightly) late submission of homework assignments, and how to submit papers if you don't bring them to class.
*** After the end of the short grace period,
late papers
will be
considered only if I get them before I grade the papers of other class
members or post the answers. (I really mean this! And
a broken fax machine or being
locked out of the Engineering Building does not change things --- if I don't have your paper by
the end of the grace period, I won't grade it if I've already graded the other papers or have posted the answers.
***
If you bring your paper by my office and I'm not there, the
best procedure is to put it under my office door (Room 1706, not 1707) and then send me an
e-mail or call and tell me that you dropped off your paper.
You can
possibly fax
your papers to me at (703) 993-1700. (If you do fax your paper,
please notify me by e-mail or phone so that I can look for your paper.
(The entire department shares the same fax machine.)) I cannot be
responsible for late papers put under my door or faxed if for some
reason I don't get them, but in the past I've never had too many problems
getting papers in these ways (although the new fax machine seems to not work as well as the old fax machine did).
Do not e-mail solutions to me.
- All homework should be on paper which is approximately
8.5 inches by 11 inches. All pages should be stapled in the upper left
hand corner. All answers should be clearly indicated. (You need to
choose one answer for each part. Draw a box around your final
answers or highlight them in some way.) Although for a lot of the software-based parts of the homework I'll specify that you don't have to submit
supporting work, for the relatively few theoretical/mathematical problems that I'll assign, you should show adequate
supporting work and not merely give the final answers.
- You are expected to familiarize yourself with the
George Mason University honor code and abide by it. It is
perfectly okay to seek assistance from others on any of the
homework problems (except for extra credit problems, which may be occassionally assigned),
but you should not turn in any work that is
copied from someone else (and so you should be prepared to explain
your solution to me if asked to do so). (While it's okay to briefly discuss homework poroblems with other students, you should not look at another student's work
while writing up your homework solutions. Nor should one student explain most every step of his/her solution to another student. And any answers that come from
R output should be based on output that you created yourself.)
It
will be considered to be a violation of the honor code if you deviate
from this rule concerning homework or if you give or
receive aid on any of the quizzes or the final exam.
- You are expected to take the final exam during the
designated time slot; Incompletes will
not be granted except under very unusual circumstances.
- Please abide by the university policy that cell phone ringers be
turned off while class is in session.
- Please do not make a lot of noise eating during class --- if you
feel that you must eat during class, please choose a soft candy bar
rather than a bag of potato chips (since both the chips and the bag they
come in tend to make too much noise when eaten and handled).
- If you are a student with a disability and desire academic accommodations, please see me during the first two weeks of classes and
contact the
Office of Disability Services (ODS). All academic accommodations must be arranged through the ODS.
- Any class meetings canceled by the university due to
snow, sleet, power outage, bombing,
etc. will be made up if possible.
With regard to bad weather, I will
plan to teach class if the university is open and not teach it if the
university is closed. So instead of calling or e-mailing me to find out if I plan
to have class, just find out if the university is open or closed.
- Caveat: The schedule and procedures described here for this course are subject to change (and it is the responsibility of
students to attend all class meetings and keep themselves informed of
any changes).