Welcome to CSI 771 / STAT 751

Computational Statistics

Fall, 2011

Instructor: James Gentle

Lectures: Wednesday, 4:30-7:10pm, Research Building, room 301

Some of the lectures will be based on notes posted on this website. Some lectures will be accompanied only by notes written on the board. This course is about modern, computationally-intensive methods in statistics. It emphasizes the role of computation as a fundamental tool of discovery in data analysis, of statistical inference, and for development of statistical theory and methods.


Topics

The general description of the course is available at mason.gmu.edu/~jgentle/csi771/

Prerequisites:

  • a course in applied statistics such as STAT 554
  • a course in statistical inference such as CSI 672 / STAT 652.

    Text: Elements of Computational Statistics ISBN 978-1441930248.

    List of probability distributions.

    Computational Software:

    The main computational software that I use is R.

    R is open source and is free. It is installed on some GMU computers, but there are various binary executables available at the main R website, and it is best to load it on your own computer.

    A good way to learn R is just to use it for progressively more complicated problems. While there are many books on R, the various PDF manuals that come with the installation (use "Help" on the GUI) should be sufficient.

    Document Development Software:

    The main document development software that I use is TeX.

    TeX is owned by the American Mathematical Society. It is free. There are various implementations, and it is installed on some GMU computers. One version is MiKTeX. It is available at and it is best to load it on your own computer.

    There are many books on TeX, but a good way to learn TeX is just to use it for progressively more complicated writing problems.

    Email Communication

    The primary means of communication outside of class will be by email.

    Students must use their Mason email accounts to receive important University information, including messages related to this class. (You may, of course, foward email from your Mason email account to one that you check regularly.)

    If you send email to the instructor, please put "CSI 771" or "STAT 751" in the subject line.


    Grading

    Student work in the course (and the relative weighting of this work in the overall grade) will consist of


    Homework

    Each homework will be graded based on 100 points, and 5 points will be deducted for each day that the homework is late.
    Start each problem on a new sheet of paper and label it clearly.
    Homework will not be accepted as computer files; it must be submitted on paper.

    Project

    The course requires each student to complete a project that involves a Monte Carlo study of a statistical method.
    Project will be graded on


    Collaboration and Academic Integrity


    Each student enrolled in this course must assume the responsibilities of an active participant in GMU's scholarly community in which everyone's academic work and behavior are held to the highest standards of honesty. The GMU policy on academic conduct will be followed in this course.

    Collaborative work

    Students are free to discuss homework problems or other topics with each other or anyone else, and are free to use any reference sources. Group work and discussion outside of class is encouraged, but of course explicit copying of homework solutions should not be done.

    Make sure that work that is supposed to be yours is indeed your own.

    With cut-and-paste capabilities on webpages, it is easy to plagiarize.
    Sometimes it is even accidental, because it results from legitimate note-taking; nevertheless, it is plagiarism and it is illegal.

    Although the likelihood of "getting caught" should not influence your ethical standards, you should be aware of the fact that web searches can often identify plagiarism, and that there is even specialized software to facilitate such searches. Whenever I encounter phrases in a student's work that seem to be inconsistent with the usual language that the student uses, I routinely search the web for documents containing those phrases.

    Some good guidelines are here:
    http://ori.dhhs.gov/education/products/plagiarism/
    See especially the entry "26 Guidelines at a Glance".

    Self-Plagiarism

    The definition of ``plagiarism'' applies to the ``work of others'', so copying your own work does not fall within the scope of the crime of plagiarism. Generally, of course, you are free to copy what you've written. I do this all the time with class notes, for example. Whenever you reuse any material, except for relatively brief background or supporting material, you should reference your original source. In the case of my class notes that have not appeared in formal publications, I do not reference my earlier work.

    Representing a rehash or restatement of earlier work as original work is wrong. Such self-plagiarism becomes a breech of academic honor, for example, when a paper submitted for credit in one instance is subsequently submitted for credit in another instance.

    Students with disabilities

    Certification of a disability that requires accommodations must be be made by the Office of Disability Services (ODS). If you are a student with a disability and desire academic accommodations, please contact ODS and inform me during the first two week of classes.

    All academic accommodations must be arranged through the ODS.



    Lectures / assignments / exams schedule


    Week 1, August 31

    Course overview.
    Monte Carlo methods in statistics.

    Brief introduction to R.
    R functions.
    Random number generation in R.
    Saving graphics files in R.

    Assignments: Read Appendix A (pages 337-350).
    Choose two articles in the statistics literature that report Monte Carlo studies and write brief descriptions of them, telling speifically what questions were studied by Monte Carlo.


    Week 2, September 7

    Brief discussion of Monte Carlo methods in statistics and project.
    Objectives and methods of computational statistics.
    Use of the ECDF.
    Statistical methods as optimization problems.
    Optimization methods; EM.

    Assignments: Read Chapter 1.
    Work problems 1.2, 1.3, 1.7, 1.9, and 1.13 to turn in (as hardcopies).


    Week 3, September 14

    Monte Carlo methods for statistical inference.
    Simulation of probability distributions.

    Assignments: Read Chapter 2.
    Work problems 2.2, 2.3, and 2.6 to turn in (as hardcopies).


    Week 4, September 21


    Data partitioning: cross validation.
    Data partitioning: jackknife.

    Assignment: Prepare and write up your plan for your project.
    This includes a brief description of the Monte Carlo study in the paper. (What are the statistical methods being evaluated? What scenarios were studied? What are the ``treatments'' (that is, methods) you will study? What scenarios (that is, blocks in your experiment) will you study?)
    I expect this write-up should be about 5 to 10 pages long.


    Week 5, September 28

    Monte Carlo methods, tests, use in partitioning (Exercise 2.11).
    Data resampling: bootstrap.

    Assignments: Read Chapters 3, 4.
    Work problems 2.11, 3.1, 3.6, and 4.5 to turn in (as hardcopies).


    Week 6, October 5

    Structure in data.
    Linear transformations.

    Week 7, October 12

    Midterm exam. Closed book and closed notes.

    Week 8, October 19

    Vector spaces; representation, transformations, etc.
    Linear structure.
    Assignment 6

    Week 9, October 26

    Methods of approximating and estimating functions (Chapter 6).
    Nonparametric probability density function estimation. (Chapter 9).

    Assignments: Read Chapters 6 and 9.
    Work problems 6.6, 6.7, 6.9, 6.10 to turn in November 2.


    Week 10, November 2

    Nonparametric probability density function estimation. (Chapter 9).

    Assignment:
    Work problems 9.1,9.2,9.9 to turn in November 9.


    Week 11, November 9

    Structure in data; measures of similarity (Chapters 5 and 10).
    Principal components (Chapter 10).

    Assignments:Read Chapter 10.
    Work problems 10.4, 10.5 to turn in November 16.


    Week 12, November 16

    Identifying structure in data.
    Clustering and classification (Chapter 10).

    Assignment:
    Work problems 10.1, 10.2, 10.12 to turn in November 30.


    November 23

    Class does not meet this week

    Week 13, November 30

    Project due
    This should be a hardcopy document that identifies the article you used, describes the problem studied, describs the design of your Monte Carlo study and how it compares with the one in the article, summarizes the results of your study and how they compare with those in the article, and states the conclusions of your study.

    Presentations of projects (computer slides preferred).


    Week 14, December 7

    Presentations of projects (continued).

    December 14

    4:30pm - 7:15pm Final Exam.