**Lectures:** Thursdays 4:30pm - 7:10pm, Acquia Hall 219

If you send email to the instructor, please put "CSI 772" or "STAT 772" in the subject line.

``Statistical learning'' refers to analysis of data with the objective of
identifying patterns or trends. We distinguish * supervised learning,*
in which we seek to predict an outcome measure or class based on a sample
of input measures, from * unsupervised learning,*
in which we seek to identify and describe relationships and patterns among a sample
of input measures. The emphasis is on supervised learning, but
the course addresses the elements of both supervised learning and unsupervised
learning. It covers essential material for developing new statistical
learning algorithms.

The text is T. Hastie, R. Tibshirani, and J. Friedman (HTF)
* The Elements of Statistical Learning,* second edition,
Springer-Verlag, 2006. ISBN 978-0-387-84857-0.
The website for the text is
http://www-stat.stanford.edu/ElemStatLearn/.

The course organization and content will closely follow that of the text. The text is quite long, however, and so some topics will be covered very lightly, and some whole chapters will be skipped completely. The main chapters we will cover are 1--4, 7, 9, 10, and 12--15.

The software used in this course is R, which is a freeware package that can be downloaded from the Comprehensive R Archive Network (CRAN). It is also available on various GMU computers in student labs.

No prior experience in R is assumed for this course. A good site for getting started with R, especially for people who are somewhat familiar with SAS or SPSS, is Quick R.

Student work in the course (and the relative weighting of this work in the overall grade) will consist of

You are expected to take the final exam during the designated time period.

Incomplete grades will not be granted except under very special circumstances.

Because the available time for the class is not sufficient to cover all of even the most common methods of learning, a student may wish to do a project involving methods addressed in the text, but which are not covered in class.

The project will require a written report and, depending on available class time, may involve an oral presentation.

Sometimes it is even accidental, because it results from legitimate note-taking.

Some good guidelines are here:

http://ori.dhhs.gov/education/products/plagiarism/

See especially the entry "26 Guidelines at a Glance".

Supervised learning

General methods of statistics: Decisions, models, linear regression, etc.

The R program.

- variances of least squares estimators.
- variable selection in regression: least squares and ridge.
- model building: partial least squares, lasso, and LAR.

- Smoothing, overfitting, bias/variance tradeoff.
- Criteria for comparing models.
- Cp, AIC, BIC, CV, and bootstrap estimation of the prediction error in linear regression models.
- Linear methods for classification: discriminant analysis, linear and quadratic.

Because the lecture did not cover enough material, this assignment will not be turned in.

Linear methods for classification: discriminant analysis and logistic regression.

Lecture notes.

Discuss project.

Closed book, closed notes, and closed computers except for one sheet (front and back) of prewritten notes.

Weka. Download.

Reference: Ian Witten, Eibe Frank and Mark Hall (2011) Data Mining: Practical Machine Learning Tools and Techniques , third edition, Morgan Kaufmann Publishers (ISBN: 978-0-12-374856-0)

In each case, develop a classifier using the training data and determine the error rate in the test data for your classifier. In each case, of course, there are choices you can make.

1. Use your implementation of AdaBoost that uses the tree function in R (Exercise 10.4 in HTF).

2. Use AdaBoost (AdaBoostM1) in Weka amd/or RWeka (your choice).

3. Use Random Forests (RandomForest) in Weka amd/or RWeka (your choice).

4. Write a brief summary comparing the methods.

Variations on LDA

Prototypes and nearest neighbors

Review/discuss various issues in SVM

Nearest neighbors and unsupervised learning

Closed books, notes, and computers.