Welcome to CSI 772
Statistical Learning
Spring, 2015
Instructor:
James Gentle
Lectures: Thursdays 4:30pm - 7:10pm, Planetary Hall 220
If you send email to the instructor,
please put "CSI 772" in the subject line.
Course Description
``Statistical learning'' refers to analysis of data with the objective of
identifying patterns or trends. We distinguish supervised learning,
in which we seek to predict an outcome measure or class based on a sample
of input measures, from unsupervised learning,
in which we seek to identify and describe relationships and patterns among a sample
of input measures. The emphasis is on supervised learning, but
the course addresses the elements of both supervised learning and unsupervised
learning. It covers essential material for developing new statistical
learning algorithms.
Prerequisites
Calculus-level probability and statistics, such as in CSI 672, and
some general knowledge of applied statistics.
Text and other materials
The text is
An Introduction to Statistical Learning,
by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani, published by
Springer-Verlag, 2013. ISBN 978-1-4614-7137-0.
The website for the text is
http://www.StatLearning.com/.
The software used in this course is R, which is a freeware package that can be
downloaded from the
Comprehensive R Archive Network (CRAN).
It is also available on various GMU computers in student labs.
No prior experience in R is assumed for this course.
A good site for getting started with R, especially for people who are somewhat
familiar with SAS or SPSS, is
Quick R.
The main R libraries that we will use are ISLR and MASS.
Lectures
Students are expected to attend class and take notes as they see appropriate.
Lecture notes and slides used in the lectures will usually not be posted.
Grading
Student work in the course (and the relative weighting of this work
in the overall grade) will consist of
homework assignments, mostly exercises in the text (15)
project (15)
midterm exam (30)
final exam (40)
You are expected to take the final exam during the designated time period.
Incomplete grades will not be granted except under very special circumstances.
Homework
Each homework will be graded based on 100 points, and 5 points will be deducted
for each day that the homework is late, and will not be accepted if more than
5 days late (weekends count!).
Start each problem on a new sheet of paper and label it clearly.
Homework will not be accepted as computer files (and certainly not as
faxes!); it must be submitted on
paper.
Project
Each student must complete a project in the area of statistical learning.
The project will involve comparison of classification methods using
a dataset from the
University of California at Irvine (UCI) Machine Learning Repository.
Because the available time for the class is not sufficient to cover all of
even the most common methods of learning, a student may wish to do a project
involving methods addressed in the
text, but which are not covered in class.
The project will require a written report and an oral presentation.
More details are
here.
Academic honor
Each student enrolled in this course must assume the
responsibilities of an active participant in GMU's scholarly
community in which everyone's academic work and behavior are
held to the highest standards of honesty. The GMU policy on
academic conduct will be followed in this course.
Make sure that work that is supposed to be yours is indeed your own
With cut-and-paste capabilities on webpages, it is easy to plagarize.
Sometimes it is even accidental, because it results from legitimate note-taking.
Some good guidelines are here:
http://ori.dhhs.gov/education/products/plagiarism/
See especially the entry "26 Guidelines at a Glance".
Collaborative work
Students are free to discuss homework problems or other topics
with each other or anyone else, and are
free to use any reference sources. Group work and discussion outside of
class is encouraged, but of course explicit copying of homework solutions
should not be done.
Approximate schedule
The details of the schedule will evolve as the semester progresses.
Week 1, January 22
Course overview; notation; etc.
Supervised learning
General methods of statistics: Decisions, models, linear regression, etc.
The R program.
Assignment 1, due January 29:
In ISL exercises 2.1 and 2.8, and
Supplemental Exercise 1.
Week 2, January 29
Basic properties of random variables and probability.
Linear regression.
Assignment 2, due February 5:
In ISL exercises 3.1, 3.2, 3.3, and 3.8.
Week 3, February 5
-
Quick summary of Chapter 3
-
More details on linear regression ***
pick up at slide 28
- variances of least squares estimators.
- variable selection in regression: least squares and ridge.
- model building: partial least squares, lasso, and LAR.
Assignment 3, due February 12:
In ISL exercises 3.5, 3.7, 3.9, and 3.14.
Week 4, February 12
Classification
Assignment 4, due February 19: In ISL, exercises 4.1 and 4.10 (a) and (b).
(I decided not to include anything extra on separating hyperplanes; it will come
up again later.)
Week 5, February 19
Discuss previous
assignments.
Linear discriminant analysis (LDA) and quadratic discriminant analysis (QDA).
Comparisons of logistic regression, LDA, QDA, and KNN.
Assignment 5, due February 26: In ISL, exercises 4.2, 4.3, 4.4, 4.5,
and 4.10 (c), (d), (e), (f), (g) and (h).
Week 6, February 26
Methods for modeling and classification: review and miscellaneous topics.
Discuss
exercises in Chapter 4.
Properties of high dimensional spaces.
Discuss project.
Project preliminary assignment, due March 19
March 26:
Pick out two datasets in the
UCI repository that are appropriate for classification. For each, give the
name of the dataset, a one or two sentence general description, the list of
variables and their types, and the actual values of
the first observation.
Week 7, March 5
Midterm: mostly Chapters 2,3, 4 in ISL, and material on linear
models and operations covered in notes.
Closed book, closed notes, and closed computers except for one sheet (front and back) of
prewritten notes.
March 12
Class does not meet.
Week 7, March 19
Midterm: mostly Chapters 2,3, 4 in ISL, and material on linear
models and operations covered in notes.
Closed book, closed notes, and closed computers except for one sheet (front and back) of
prewritten notes.
Week 8, March 26
Assignment 6, due April 2: In ISL, exercises 5.2, 5.3, 5.4, and 5.5.
Week 9, April 2
Linear model selection and regularization.
Assignment 7, due April 9: In ISL, exercises 6.3, 6.4, 6.5, and 6.8.
Week 10, April 9
Chapter 8: Tree-Based Methods.
Assignment 8, due April 16: In ISL, exercises 8.2, 8.3, 8.4, 8.5, and 8.8.
Week 11, April 16
Chapter 9: Support Vector Machines.
Assignment 9, due April 23: In ISL, exercises 9.1, 9.2, 9.3, and 9.5.
Week 12, April 23
Unsupervised Learning.
Assignment 10, due April 30: Read Chapter 10 in ISL,
and then work exercises 10.1, 10.2, and 10.4.
Week 13, April 30
Projects due.
Presentations of projects.
Week 14, Wednesday, May 6
10.1(a)
Presentations of projects.
Review.
May 7
4:30pm - 7:15pm Final Exam.
Closed book, closed notes, and closed computers except for one sheet (front and back) of
prewritten notes.