Welcome to ISYE 6740

Computational Data Analyais / Machine Learning

Fall, 2018

Instructor: James Gentle

Office: B206-B
Office hours: by appointment (email)
especially Tuesday or Thursday morning or early afternoon, or Tuesday after 6:30pm, or
anytime I'm in my office; my door is always open.

Lectures: Generally Tuesdays and Thursdays, although some lectures will be on Fridays.
Lectures begin at 3:10pm August 20 through September 20 and on Septemer 27.
Lectures begin at 1:00pm starting September 25, except for Septemer 27.
Lectures beginning at 3:10 end at 5:00pm, and lectures beginning at 1:00pm end at 3:00pm.

If you send email to the instructor, please put "ISYE 6740" in the subject line.


Course Description

``Machine learning'' refers to use of logical rules of induction and deduction along with data to identify salient properties of objects or processes, such as clusters, patterns, or trends. Machine learning is an important part of artificial intelligence, as well as other areas of data science.

``Statistical learning'' is essentially synonymous with machine learning, but the term ``statistical'' perhaps implies greater emphasis on data. This course will focus on modern methods of statistical data analysis.

We distinguish supervised learning, in which we seek to predict an outcome measure or class based on a sample of input measures, from unsupervised learning, in which we seek to identify and describe relationships and patterns among a sample of input measures. The emphasis in this course is on supervised learning, but the course addresses the elements of both supervised learning and unsupervised learning. It covers essential material for developing new statistical learning algorithms.

Prerequisites

Calculus-level probability and statistics, some general knowledge of applied statistics, and an ability to use a computer to do data analysis. No knowledge of any particular computing system or language is assumed.

Text and other materials

The text is An Introduction to Statistical Learning, by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani, published by Springer-Verlag, 2013. ISBN 978-1-4614-7137-0. The website for the text is http://www.StatLearning.com/.

The software used in this course is R, which is a freeware package that can be downloaded from the Comprehensive R Archive Network (CRAN).

No prior experience in R is assumed for this course. A good site for getting started with R, especially for people who are somewhat familiar with SAS or SPSS, is Quick R.

Lectures

Students are expected to attend class and take notes as they see appropriate.

Questions from students during lectures are encouraged, and of course questions after class in person or by email are welcomed.


Grading

Student work in the course (and the relative weighting of this work in the overall grade) will consist of

  • homework assignments, mostly exercises in the text (20)
  • project (15)
  • midterm exam (15)
  • final exam (50)

    You are expected to take the exams during the designated time periods.

    Homework

    Each homework will be graded based on 100 points, and 5 points will be deducted for each day that the homework is late, and will not be accepted if more than 5 days late (weekends count!).

    Students may discuss homework, but the work submitted must be the student's own work.

    Project

    Each student must complete a project in the area of statistical learning. The project will involve comparison of classification methods using a dataset from the University of California at Irvine (UCI) Machine Learning Repository.

    Because the available time for the class is not sufficient to cover all of even the most common methods of learning, a student may wish to do a project involving methods addressed in the text, but which are not covered in class.

    The project will require a written report.


    Academic honor

    Each student enrolled in this course must assume the responsibilities of an active participant in the world's scholarly community in which everyone's academic work and behavior are held to the highest standards of honesty.

    Make sure that work that is supposed to be yours is indeed your own

    With cut-and-paste capabilities on webpages, it is easy to plagarize.
    Sometimes it is even accidental, because it results from legitimate note-taking.

    Collaborative work

    Students are free to discuss homework problems or other topics with each other or anyone else, and are free to use any reference sources. Group work and discussion outside of class is encouraged, but of course explicit copying of homework solutions should not be done.


    Approximate schedule

    The details of the schedule will evolve as the semester progresses.

    Tuesday, August 21

    Course overview; notation; etc.

    Here is the R code I used to produce the infamous red and green plots in my lecture. I have inserted comments to tell you what I'm doing, but don't try to understand all of the code.


    Thursday, August 23

    Continue with R; basics of statistical learning

    Complete the questionaire and email it to me.

    Assignment (due Thursday, Aug 30):

    HW1. Exercises 2.1 and 2.8 in text

    The dataset for 2.8 is at the website for the book. There is a PDF file for the whole book there also.

    Show your R code.

    Please email me your solutions. One PDF file is preferred.
    Solutions, comments
    TeX source


    Tuesday, August 28

    Linear regression

    Thursday, August 30

    Linear regression

    Lecture notes

    Assignment (due Frisday, Sep 7):

    HW2. Exercises 3.1, 3.8, 3.11, and 3.14 in text

    Show your R code.

    Please email me your solutions in one PDF file.
    Solutions, comments


    Tuesday, September 4

    Classification; logistic regression

    Some stuff about likelihood

    Some notes about R functions for regression


    Thursday, September 6

    Classification; discriminant analysis

    Some stuff about q-q plots

    Lecture notes

    Assignment (due Tuesday, Sep 11):

    HW3. Exercises 4.2, 4.3, 4.4(a),(b),(c), and 4.10 in text

    Show your R code.

    Please email me your solutions in one PDF file.
    Solutions, comments


    Friday, September 7

    Discriminant analysis
    Resampling methods

    Tuesday, September 11

    Assignment (due Tuesday, Sep 18):

    HW04. Exercises 5.2, 5.3, 5.4, and 5.5 in text

    Show your R code.

    Please email me your solutions in one PDF file.
    Solutions, comments


    Thursday, September 13

    Exam (through Chapter 4; lecture of September 7)
    Kinds of questions to expect
    Solutions, comments


    Tuesday, September 18

    Variable selection (Chapter 6 in text)

    Assignment (due Tuesday, Sep 25):

    HW5. Exercises 6.3, 6.4, 6.5, and 6.8 in text

    Show your R code.

    Please email me your solutions in one PDF file.
    Solutions, comments


    Thursday, September 20

    Shrinkage; dimension reduction
    Lecture notes

    Further comments on ridge and lasso

    Tuesday, September 25 1:00pm

    Nonparametric smoothing (Chapter 7 in text)

    Lecture notes

    Assignment (due Tuesday, Oct 9):

    HW6. Exercises 7.1, 7.5, and 7.6 in text

    Show your R code.

    Please email me your solutions in one PDF file.


    Thursday, September 27 3:10pm

    Nonparametric smoothing

    October 1--7 National Holiday



    Tuesday, October 9 1:00pm

    Classification trees (Chapter 8 in text)

    Assignment (due Thursday, Oct 18):

    HW7. Exercises 8.2, 8.3, 8.4, and 8.8 in text

    Show your R code.

    Please email me your solutions in one PDF file.


    Thursday, October 11

    Classification trees
    Lecture notes

    Tuesday, October 16

    More on classification trees; bagging, random forests, boosting

    Support vector machines (Chapter 9 in text)
    Lecture notes

    Assignment (due Thursday, Oct 25):

    HW8. Exercises 9.1, 9.2, 9.3, and 9.7 in text

    Show your R code.

    Please email me your solutions in one PDF file.


    Thursday, October 18

    Support vector machines

    Tuesday, October 23

    Principal components analysis (Chapter 10 in text)
    Lecture notes

    Some stuff about linear algebra relevant to PCA

    Assignment (due Tuesday, Oct 30):

    HW9. Exercises 10.2, 10.3, 10.6, and 10.10 in text

    Show your R code.

    Please email me your solutions in one PDF file.


    Thursday, October 25

    Nearest neighbors

    Friday, October 26

    Hierarchical clustering

    Review presentations (not in the order):

    Other topics:

  • Variable selection in LDA
  • Example of estimation of MSE in cubic spline regression.
  • Leverage of residuals in global models and in local models; examples.

    An old exam


    Tuesday, October 30

    Miscellaneous topics: artificial neural nets

    Continue with review presentations.


    Saturday, November 3, 2:40pm, Room A404
    Exam
    (Comprehensive)


    Friday, November 23
    Project due

    There are no strict guidelines for the written report. I would guess it should be anywhere from 10 to 30 pages, depending on the amount of graphical displays. You should write it like a "research paper" for a journal (although there may be no original results).

    You should describe and state the objectives in analyzing the data. You can describe previous work on the dataset.

    Then briefly describe the methods that you used. You should use at least two learning methods.

    Describe how you proceeded with your analysis. You may want to describe your program(s) and show some code. (You do not have to use R.)

    Describe your results and conclusions. There are two types of results for your project.
    One has to do with the analysis itself. What do your results show about the data itself. (These are for the objectives in analyzing this particular dataset.)
    The other type of result concerns the relative performance of the methods you used. Which performed better? Was there any particular characteristic of the data you analyzed that might make a particular method perform better? How would you expect the methods you used to perform in similar learning problems?

    Any other discussion or conclusions and references, as relevant.