project, STAT 789, Summer 2005

Project information

Your project can be a big factor in your final grade, and so it's something that you may want to start working on well before it is due. Each student is to give a 15 to 20 minute oral presentation of his or her project, as well as turn in a written report of about 10 to 15 pages of text, tables, and graphs. (Up to 20 pages will be okay if you have a lot of white space.) You can provide appendices with more material, but I may not be able to read everything in the appendices, and so the main part of the report should be a nice stand-alone document.

Since some students have volunteered to give presentations on Friday (to my seminar group), it's not necessary to have more than one presentation every 30 minutes. However, you should plan on using no more than 20 minutes (assuming no interuptions for questions), and this will allow for 5 minutes of questions, answers, and comments at the end, and also allow 5 minutes for setting up and clearing out of the way between presentations. (It's possible that some questions will be asked during the presentations.)

For most of your projects, I'd allow 5 minutes to explain the setting and the nature of the data --- giving some graphical descriptions. Then describe what methods you applied, along with the strategies and tactics that you used. (Don't spend time discussing general information that I covered in class --- focus on the analysis of your chosen data.) Save some time to give results and conclusions.

While students should not work in groups, it is okay for two (or possibly three) students to use the same data set, as long as their projects have different objectives, and as long students do not work too closely with one another. (I don't want a student to just mimic another student's work.) Because I don't want too many students to use the same data set, let me know what data you want to use, so that I can approve of it.

For your project, you should analyze at least one data set using many (at least 6) different methods. You should use at least one of the Salford System software packages, and R and/or Weka, and you can use other software too as long as I approve of it. Each project should focus on either classification or regression, although for some data sets I suppose that both classification and regression methods could be used to address related issues.

One type of project would be to take a large real data set, and divide in into two parts initially --- one part to use as a training sample (although it can be further divided to create a validation set), and one part to only use at the end of your analysis to assess the accuracy of the models/classifiers that you created using the training data. Then you would use various methods presented in class to create classifiers and regression models, decide which results seem best (only using the original training data at this point), and then finally using the set-aside data to get unbiased estimates of the generalization errors of the various classifiers/models that you considered. (Note: When you initially divide the large data set, you set aside part of it and don't use it at all when you analyze the rest of the data. The rest of the data (the part not set aside) can be used as training data, and for some methods cross-validation can be used to help fit models and build and tune classifiers. Alternatively, one can divide this data into two parts --- one part to now play the role of the training sample, and one part to use as a validation set to help you decide how to select and tune models and classifiers.)

For the most part, data sets should have between 300 and 1500 cases, and at least 6 (maybe 5) explanatory variables. I recommend using data sets that have no (or only very few) missing values.

Here are some possibilities for classification data sets:
1. Stress Echocardiography data (from the Vanderbilt site)
2. Pima Indians Diabetes data (from the UCI site)
3. George Forman spam data (from the UCI site)
4. Egyptian Skulls data (from StatLib (but may be hard to find)) maybe this data set is too small
Here are some possibilities for regression data sets:
1. (corrected) Boston Housing data (from the StatLib site (& elsewhere))

Another type of project would be to do a study which extends one of the examples from HTF based on generated data. For a project like this, one wouldn't work with just a single data set, but rather one would work with a group of related data sets. Each data set in the group could have a different sample size (or different sample sizes), or each one could have a different number of variables, but the overall settings for the different data sets could be quite similar.

For a regression project, you should show some residual plots (plotting the residuals against the predicted values). First, do one based on the raw data --- using no transformations. Use OLS with the response untransformed, and use each of the explanatory variables, untransformed, in a first-order model. If you transform the response, give another residual plot based on the transformed response, but just using untransformed explanatory variables in a 1st-order model. Then, once you've identified a final model (whether it be from MARS, using PPR in R, or using some other method), be sure to give a residual plot based on it.

For a classification project, it'd be nice to have some sort of graphic to show the two (or more) classes. Here are two possibilities: (1) plot the classes in different colors using axes based on two dominant explanatory variables; (2) plot the classes in different colors using axes based on the first two principal components of the explanatory variables (representing categorical variables with binary dummy variables). (The second type of plot is preferred unless it only takes two explanatory variables to do a good job of separating the classes.)

For both classification and regression projects, explain how you can pick the model which you think ought to be the best predicting model, just using the data that you pretended you had available to fit and select models (i.e., don't use the generalization data set that you should have initially set aside). Identify your second and third choices for the best-prediciting model. Also, if the best predicting model is from a method which makes interpretation and understanding of the phenomenon difficult, identify a weaker model which does provide some information about the relationships between the explanatory variables and the response variable. Finally, use the generalization data to estimate generalization errors of the three methods that you guessed would be the best ones, based on what you could learn without using the generalization data in any way. In addition, estimate the generalization errors for some inferior, and simpler methods. (With regression, this could be a 2nd- or 3rd-order polynomial model (or a 1st-order polynomial model if a 2nd-order model happens to be the winner), using only variables which are statistically significant in an initial 1st-order model. With classification, this could be a 1st-order logistic regression model using only statistically significant predictors, or it could be a stepwise QDA model (or stepwise LDA model if a stepwise QDA model happens to be the winner).) Also, if the method you think is the winner is not a purely local method (like k nearest neighbors for classification, or loess for regression), estimate the generalization error of a good local method in order to see how how accuracy suffers.

scheduling of the presentations

The syllabus shows that the projects are to be presented on Thursday 7/21 and Tuesday 7/26. But if you'd like to present your project on Friday 7/22 or Friday 7/29, during the 3 PM to 5 PM time period (when I have a room reserved in Innovation Hall for my Friday afternoon seminar meetings) that would be great. If enough people volunteer to present their projects on Friday, then each person can have a bit more time for their presentation, as we spread the projects over more days. Please let me know what day you'd prefer for your project presentation. Friday time slots will be assigned as people request them --- up to four students can present on each of the two Fridays. Please let me know asap if you want to present your work on one of the specified Friday afternoons. (On Friday afternoons the audience would consist of perhaps a half dozen participants in my Friday afternoon seminar.)

Students who present their projects the week before the final won't be expected to use any methods that I just covered that week. Also, since all written versions of your projects will be due on the night of the exam, if you present early then you'll have more time to adjust your written presentation to address any major flaws that might be discovered during your oral presentation.

Based on input received, here is the schedule for the presentations. (Up to three students can switch to 7/29 if problems are encountered and more time is needed, but be aware that the time slot on Friday 7/29 is 3-5 PM.) All written presentations of the projects are due at the time of the final exam (on 7/28).

Thursday, July 21 (regular classroom, regular time)
- J. Dean (regression, analysis of college tuition costs)
- F. Pecjak (regression, analysis of generated data (to investigate some specific issues))
- H. Zheng (regression, analysis of abalone ages)
- B. Ferris (regression, analysis of U.S. county data)
- E. Leeds (classification, analysis of George Forman spam data)
Friday, July 22 (133 Innovation Hall, 3-5 PM)
- V. Plamadea (classification and regression, analysis of university freshman performance based on admissions data)
- N. Nu (regression and classification, analysis of FAA flight departure delays)
- X. Liu (classification, analysis of George Forman spam data)
Tuesday, July 26 (regular classroom, regular time)
- T. Al-Fouxan (regression, analysis of NBA salaries)
- P. Nayak (regression, analysis of housing prices and school district information)
- S. Surina (classification, analysis of Indonesian birth control choices)
- A. Dey (classification, analysis of stress echocardiography data)
Friday, July 29 (133 Innovation Hall, 3-5 PM)
- S. Pinguli (regression, analysis of Boston housing data)
- A. Keesee (regression, analysis of stock performance)