George Mason University
Volgenau School of Information Technology and Engineering
Department of Statistics
STAT 472 Introduction to Statistical Learning
Spring Semester, 2020
Section 001
Tuesdays from 7:20 to 10:00 PM (starting January 21, other dates given below)
Location: room 1108 of the
Nguyen Engineering Building
(phone, fax,
e-mail, etc.)
Office Hours: 6:15-7:00 & 10:00-10:30 PM
on class nights
(more information)
Text:
An Introduction to Statistical Learning with Applications in R,
by James, Witten, Hastie, and Tibshirani
(Springer, 2013),
which can be downloaded at no cost from
the website for the book (which is maintained by the authors).
Software:
R
can be downloaded at no cost from
this web site.
Prerequisite:
STAT 456, with a grade of C or higher
Description:
This course covers methods for regression, classification, and clustering which can be applied to "Big Data" problems. Included are
traditional statistical methods such as ordinary least squares (OLS) regression, logistic regression, and linear discriminate analysis (LDA), which are relatively simple to
understand, as well as more modern computer-intensive methods such as tree-based methods (including bagging, boosting, and random forests), generalized additive models (GAMs),
and support vector machines (SVMs). The basics of how and why each method works will be presented, but getting bogged down in the messy details of the algorithms will
be avoided. Also, overarching principles such as the so-called curse of dimensionality and the bias-variance trade-off will be emphasized, as well as
somewhat general model fitting and selection techniques such as regularization and cross-validation.
In addition to gaining an overall understanding of how the various methods work (from a statistical point of view), successful students will also obtain
experience in applying the methods to real data sets in a wide variety of settings, using the popular software R.
Approximate class-by-class content:
- [1] Jan. 21:
- introduction (to the course, the text's web site, the text's notation, and R (the software to be used))
[Ch. 1 of text, and the first several pages of Ch. 2]
- [2] Jan. 28:
- some principles of statistical learning; the classification setting and Bayes classifiers
[the rest of Ch. 2 of text]
- [3] Feb. 4:
- the classification setting; ordinary least squares (OLS) regression (model fitting, and related inferences and assessment)
[Ch. 3 of text through roughly p. 77]
- [4] Feb. 11:
- more on regression (extensions, dealing with problems)
[continuing Ch. 3 of text, through roughly p. 96]
- [5] Feb. 18:
- more on OLS regression,
and a comparison with a nonparametric approach (KNN regression);
linear methods for classification (LDA and logistic regression)
[the rest of Ch. 3 of text, and Ch. 4 of text through roughly p. 141]
- [6] Feb. 25:
- more on classification (a comparison of linear methods with more flexible methods (QDA, RDA, and KNN))
[continuing Ch. 4 of text, through roughly p. 163]
- [7] Mar. 3:
- a bit more on classification; resampling methods for assessment (cross-validation and the bootstrap)
[rest of Ch. 4 of text, and Ch. 5 of text]
- [**] Mar. 10:
- no class due to Spring Break
- [8] Mar. 17:
- linear model selection, regularization, and dimension reduction (subset selection, shrinkage methods, dimension reduction methods)
[Ch. 6 of text, through roughly p. 236]
- [9] Mar. 24:
- more on linear methods, and moving beyond linearity (polynomial regression, splines, local regression)
[rest of Ch. 6 of text, and Ch. 7 of text through roughly p. 282]
- [10] Mar. 31:
- generalized additive models (GAMs);
classification and regression trees (CART); bagging
[rest of Ch. 7 of text, and Ch. 8 of text, through roughly p. 319]
- [11] Apr. 7:
- more on bagging, random forests (RFs); boosting; multivariate adaptive regression splines (MARS); maximal margin classifiers, support vector clssifiers
[rest of Ch. 8 of text, Ch. 9 of text through roughly p. 348]
- [12] Apr. 14:
- support vector machines (SVMs);
unsupervised learning (principal components analysis)
[rest of Ch. 9 of text, and Ch. 10 through roughly p. 348]
- [13] Apr. 21:
- clustering
[rest of Ch. 10 of text]
- [14] Apr. 28:
- review and summary of course
- [**] May 12:
- Final Exam (note: exam period is
from 7:30 to 10:15 PM)
Note: If any classes are cancelled due to weather, power outages, or for any other reason, some of the dates given above may be changed. Also, you should
consider the above schedule to be approximate ... this is a relatively new course and there is some uncertainty associated with the class-by-class schedule
of a course when I'm teaching it for only the third time.
Grading:
- 40% for
homework assignments
- 30% for
quizzes (some closed book, and some open book and notes (only your best 10 of 13 quiz scores will be counted))
- 30% for open
book (and notes)
final exam
Additonal Comments:
- Put STAT 472 in the subject line when you send me e-mail
(due to spam, I sometimes delete messages without reading them, based
on the subject line).
- Be sure to note that there is not a class meeting scheduled for March 10 (due to Spring
Break). However, if any class meetings are canceled prior to Spring Break (perhaps due to bad weather),
it could be that the Tuesday of Spring Break will be used to make up for the missed class.
(But, if only one class
is cancelled prior to Spring Break, we may just wait to make it up on May 5, which is currently scheduled to be a "Reading Day"
with no class meeting, the week between the last regular lecture and the final exam.)
- I can possibly
make arrangements to meet with you outside of my
scheduled hours; however,
on Tuesdays I do not like to be
bothered from 7:00 to 7:20, and on Mondays and Wednesdays I'm often tied up with my other classes until 10:30 PM or so.
This semester, I suspect that Thursday and Friday afternoons will be good times to meet outside of my regular office hours.
Also, I am
willing to stay in the classroom and assist people each Tuesday after
class.
- Please do not leave long messages on my voice-mail,
and since I often don't get around to returning calls until the evening,
you should state what time you plan to go to sleep. (On Mondays and Wednesdays, I'm usally completely tied up with my other classes between 6:15 and 10:30 PM.)
Always leave your
phone number, speaking slowly, even though you might have
given it to me previously. I find it better to communicate with people
in person or via e-mail --- phone tag is frustrating and sometimes the
GMU voice-mail system doesn't work the way it is supposed to.
- See comments on the this web page regarding (slightly) late submission of homework assignments, and how to submit papers if you don't bring them to class.
*** After the end of the short grace period,
late papers
will be
considered only if I get them before I grade the papers of other class
members or post the answers. (I really mean this! And
4 feet of snow,
a broken fax machine or being
locked out of the Engineering Building does not change things --- if I don't have your paper by
the end of the grace period, I won't grade it if I've already graded the other papers or have posted the answers.
***
If you bring your paper by my office and I'm not there, the
best procedure is to put it under my office door (Room 1706, not 1707) and then send me an
e-mail or call and tell me that you dropped off your paper.
You can
possibly fax
your papers to me at (703) 993-1700. (If you do fax your paper,
please notify me by e-mail or phone so that I can look for your paper.
(The entire department shares the same fax machine.)) I cannot be
responsible for late papers put under my door or faxed if for some
reason I don't get them, but in the past I've never had too many problems
getting papers in these ways (although the new fax machine seems to not work as well as the old fax machine did).
Do not e-mail solutions to me.
- All homework should be on paper which is approximately
8.5 inches by 11 inches. All pages should be stapled in the upper left
hand corner. All answers should be clearly indicated. (You need to
choose one answer for each part. Draw a box around your final
answers or highlight them in some way.) Although for a lot of the software-based parts of the homework I'll specify that you don't have to submit
supporting work, for the relatively few theoretical/mathematical problems that I'll assign, you should show adequate
supporting work and not merely give the final answers.
- You are expected to familiarize yourself with the
George Mason University honor code and abide by it. It is
perfectly okay to seek assistance from others on any of the
homework problems (except for extra credit problems, which may be occassionally assigned),
but you should not turn in any work that is
copied from someone else (and so you should be prepared to explain
your solution to me if asked to do so). (While it's okay to briefly discuss homework poroblems with other students, you should not look at another student's work
while writing up your homework solutions. Nor should one student explain most every step of his/her solution to another student. And any answers that come from
R output should be based on output that you created yourself.)
It
will be considered to be a violation of the honor code if you deviate
from this rule concerning homework or if you give or
receive aid on any of the quizzes or the final exam.
- You are expected to take the final exam during the
designated time slot; Incompletes will
not be granted except under very unusual circumstances.
- Please abide by the university policy that cell phone ringers be
turned off while class is in session.
- Please do not make a lot of noise eating during class --- if you
feel that you must eat during class, please choose a soft candy bar
rather than a bag of potato chips (since both the chips and the bag they
come in tend to make too much noise when eaten and handled).
- If you are a student with a disability and desire academic accommodations, please see me during the first two weeks of classes and
contact the
Office of Disability Services (ODS). All academic accommodations must be arranged through the ODS.
- Any class meetings canceled by the university due to
snow, sleet, power outage, bombing,
etc. will be made up if possible.
With regard to bad weather, I will
plan to teach class if the university is open and not teach it if the
university is closed. So instead of calling or e-mailing me to find out if I plan
to have class, just find out if the university is open or closed.
- Caveat: The schedule and procedures described here for this course are subject to change (and it is the responsibility of
students to attend all class meetings and keep themselves informed of
any changes).