Welcome to CSI 772 / STAT 772

Statistical Learning

Spring, 2013

Instructor: James Gentle

Lectures: Thursdays 4:30pm - 7:10pm, Acquia Hall 219

If you send email to the instructor, please put "CSI 772" or "STAT 772" in the subject line.

Course Description

``Statistical learning'' refers to analysis of data with the objective of identifying patterns or trends. We distinguish supervised learning, in which we seek to predict an outcome measure or class based on a sample of input measures, from unsupervised learning, in which we seek to identify and describe relationships and patterns among a sample of input measures. The emphasis is on supervised learning, but the course addresses the elements of both supervised learning and unsupervised learning. It covers essential material for developing new statistical learning algorithms.

Prerequisites

Calculus-level probability and statistics, such as in CSI 672/STAT 652, and some general knowledge of applied statistics.

Text and other materials

The text is T. Hastie, R. Tibshirani, and J. Friedman (HTF) The Elements of Statistical Learning, second edition, Springer-Verlag, 2006. ISBN 978-0-387-84857-0. The website for the text is http://www-stat.stanford.edu/ElemStatLearn/.

The course organization and content will closely follow that of the text. The text is quite long, however, and so some topics will be covered very lightly, and some whole chapters will be skipped completely. The main chapters we will cover are 1--4, 7, 9, 10, and 12--15.

The software used in this course is R, which is a freeware package that can be downloaded from the Comprehensive R Archive Network (CRAN). It is also available on various GMU computers in student labs.

No prior experience in R is assumed for this course. A good site for getting started with R, especially for people who are somewhat familiar with SAS or SPSS, is Quick R.

Lectures

Students are expected to attend class and take notes as they see appropriate. Lecture notes and slides used in the lectures will usually not be posted.

Grading

Student work in the course (and the relative weighting of this work in the overall grade) will consist of

homework assignments, mostly exercises in the text (15)

project (15)

midterm exam (30)

final exam (40)

You are expected to take the final exam during the designated time period.

Incomplete grades will not be granted except under very special circumstances.

Homework

Each homework will be graded based on 100 points, and 5 points will be deducted for each day that the homework is late, and will not be accepted if more than 5 days late (weekends count!). Start each problem on a new sheet of paper and label it clearly. Homework will not be accepted as computer files (and certainly not as faxes!); it must be submitted on paper.

Project

Each student must complete a project in the area of statistical learning. The project will involve comparison of classification methods using a dataset from the University of California at Irvine (UCI) Machine Learning Repository.

Because the available time for the class is not sufficient to cover all of even the most common methods of learning, a student may wish to do a project involving methods addressed in the text, but which are not covered in class.

The project will require a written report and, depending on available class time, may involve an oral presentation.

Academic honor

Each student enrolled in this course must assume the responsibilities of an active participant in GMU's scholarly community in which everyone's academic work and behavior are held to the highest standards of honesty. The GMU policy on academic conduct will be followed in this course.

Make sure that work that is supposed to be yours is indeed your own

With cut-and-paste capabilities on webpages, it is easy to plagarize.
Sometimes it is even accidental, because it results from legitimate note-taking.

Some good guidelines are here:
http://ori.dhhs.gov/education/products/plagiarism/
See especially the entry "26 Guidelines at a Glance".

Collaborative work

Students are free to discuss homework problems or other topics with each other or anyone else, and are free to use any reference sources. Group work and discussion outside of class is encouraged, but of course explicit copying of homework solutions should not be done.

Approximate schedule

The details of the schedule will evolve as the semester progresses.

Week 1, January 24

Course overview; notation; etc.
Supervised learning
General methods of statistics: Decisions, models, linear regression, etc.
The R program.
Assignment 1, due January 31: In HTF exercises 2.1, 2.4, and 2.7, and Exercise 1.A.

Week 2, January 31

Basic properties of random variables and probability. Linear regression.
Assignment 2, due February 7: In HTF exercises 2.8, 3.1, 3.2, 3.4, 3.5, 3.6, and 3.7.

Week 3, February 7

Linear regression

variances of least squares estimators.
variable selection in regression: least squares and ridge.
model building: partial least squares, lasso, and LAR.

Linear classification in R; the "vowel data''.
Assignment 3, due February 14: In HTF: exercises 3.9, 3.11, 3.19, 3.23, and 3.27.

Week 4, February 14

Smoothing, overfitting, bias/variance tradeoff.
Criteria for comparing models.
Cp, AIC, BIC, CV, and bootstrap estimation of the prediction error in linear regression models.
Linear methods for classification: discriminant analysis, linear and quadratic.

Assignment 4, due February 21: In HTF: exercises 4.2(a), 4.3, 4.5, 4.6(a), and 7.9.
Because the lecture did not cover enough material, this assignment will not be turned in.

Week 5, February 21

Discuss previous assignments.
Linear methods for classification: discriminant analysis and logistic regression.
Lecture notes.
Assignment 5, due February 28: Access the "vowel data" and develop a classifier from the training data using (1) linear regression (2) logistic regression (3) qda For each classifier, determine the error rate in the test data.

Week 6, February 28

Linear methods for classification: review and miscellaneous topics.
Discuss project.
Project preliminary assignment, due March 21: Pick out two datasets in the UCI repository that are appropriate for classification. For each, give the name of the dataset, a one or two sentence general description, the list of variables and their types, and the actual values of the first observation.

Week 7, March 7

Midterm: mostly Chapters 1 through 4 and 7 in HTF.
Closed book, closed notes, and closed computers except for one sheet (front and back) of prewritten notes.

March 14

Class does not meet.

Week 8, March 21

Additive models and trees
Assignment 6, due March 28: In HTF, read Sections 9.1-9.3. Use a classification tree on the "vowel data" using the training data. (There are different choices you can make for your tree -- any are acceptable, but describe what you do.) Determine the error rate in the test data for your fitted tree.

Week 9, March 28

PRIM, MARS, HME, boosting
Assignment 7, due April 4: In HTF exercise 9.2, 9.5(a)(b)(c)(d) for a regression tree, 10.4(a)(b) using a tree function in R

Week 10, April 4

More on trees: boosting, random forests
Weka. Download.
Reference: Ian Witten, Eibe Frank and Mark Hall (2011) Data Mining: Practical Machine Learning Tools and Techniques , third edition, Morgan Kaufmann Publishers (ISBN: 978-0-12-374856-0)
Assignment 8, due April 11: Analyze the "vowel data".
In each case, develop a classifier using the training data and determine the error rate in the test data for your classifier. In each case, of course, there are choices you can make.
1. Use your implementation of AdaBoost that uses the tree function in R (Exercise 10.4 in HTF).
2. Use AdaBoost (AdaBoostM1) in Weka amd/or RWeka (your choice).
3. Use Random Forests (RandomForest) in Weka amd/or RWeka (your choice).
4. Write a brief summary comparing the methods.

Week 11, April 11

Support vector machines
Variations on LDA
Prototypes and nearest neighbors
Assignment 9, due April 18: In HTF, read/skim Chapters 12 and 13. Work Exercises 12.1, 12.4(a), 12.9, 13.3.

Week 12, April 18

Review/discuss various issues in boosting
Review/discuss various issues in SVM
Nearest neighbors and unsupervised learning
Assignment 10, due April 25: In HTF, read/skim Chapter 14. Work Exercise 14.1.

Week 13, April 25

Unsupervised learning

Week 14, May 2

Projects due. We may spend some time in class discussing them.

May 9

4:30pm - 7:15pm Final Exam.
Closed books, notes, and computers.