Class 1, STAT 672

1st Class Meeting (Ch. 1 of text)

Comments about the examples given on pp. 1-5

Comments about the wages example given on pp. 1-2

Upon looking at the plots in FIGURE 1.1, it seems like many of the wages in the sample are rather high, with the average wage being near 100K. This makes me wonder how the data was collected ... surely it's not a representative random sample of all men living in the Atlantic region. Also notice that the box plots in the right plot of the figure indicate that nearly 25% of the men having the lowest education level have a salary of about 99K or more (since the 75th percentile seems to be about 99K). Once you load the ISLR library/package into R (once you get base R installed), if you enter

?Wage

you'll get some documentation about the Wage data set. Unfortunately, although it has a little about where the data came from, it doesn't provide a lot of information about how the 3000 men were selected.

The 2nd to the last sentence of the example (on the bottom portion of p. 2) suggests that one should use a method that can account for the non-linear relationship between salary and age. But even though the left plot of FIGURE 1.1 suggests a non-linear relationship, we really don't know if the age contribution will be non-linear once we adjust for the other variables. (It might be that the average wage drops for older people due to the fact that most of the older workers are ones with the lowest level of education, and that's primarily what accounts for the low wages for the older workers ... and if we were to restrict attention to only workers having the same level of education, we wouldn't see wages go down as age goes up. (This is just one guess as to what the story may be. Further exploration (described on pp. 283-285 of the book) indicates that it's an incorrect guess, but the point is, until we make adjustments for the other variables, we really don't know that the contribution to wage due to age is highly non-linear like the leftmost plot suggests.))

Comments about the stock market example given on pp. 2-4

They indicate that a good model can correctly predict whether the market will go up or go down 60% of the time. But a trader could still lose money if the loses from the incorrect predictions are larger than the gains from the correct predictions, even though correct predictions are more frequent. So perhaps it'd be wiser to build a regression model for a numerical response (the percent change).

Comments about the gene epression example given on pp. 4-5

On p. 4, Z₁ and Z₂ are described as the "first two principal components" of the data," however the values actually plotted in FIGURE 1.1 are the scores of the first two principal components for the 64 data points (with each data point corresponding to a cell line). Technically, the entire data set of 64 points (aka cases) yields two vectors, one called the 1st principal component and one called the 2nd principal component. These two vectors are then used to obtain scores of the first two principal components for each of the 64 data points. It is the scores of the first two principal components that are plotted; not the first two principal components (which are two high-dimensional vectors).

Looking at FIGURE 1.4 on p. 5, one might make a case for there being more than 4 clusters, or fewer than 4 clusters. (E.g., if the two leftmost points are their own cluster, why can't the two leftmost green points be their own cluster?) I wonder if things would be clearer if we used the first three principal components instead of just the first two principle components.

How do we decide if one clustering is better than another clustering? At this point in the course, we don't have any measures of "goodness" that we can apply. What would allow us to say that a clustering is correct (or even just sensible)?

Comments about the history of statistical learning given on pp. 5-6

I'm providing a lot of information in a hardcopy handout distributed at our first class meeting, and so I'll be brief here.

Near the middle of p. 6, two computer-intensive methods which were developed in the 1980s are mentioned; classification and regression trees (CART) and generalized additive models (GAMs). Both are good methods, but CART (and similar variations of it) became much more popular than GAMs. (However, while classification trees are frequently constructed, this does not seem to be the case with regression trees.)

Comments about notation and matrix algebra given on pp. 9-12

I'll go over some of this in class, but I'll comment here that the book is inconsistent about the use of p. On the bottom of p. 9 it's indicated that p is the number of variables (and so p = 12 for the Wage data set), but on the top half of p. 11, where y_i is used to denote the value of the response variable for the ith observation and x_i is used to denote the vector of predictor variables for the ith observation, it's indicated that p is the number of predictor variables (and so p = 11 for the Wage data set, since one of the variables (wage) is used as the response variable). This second use of p seems to be the one used mostly in the book.

Don't worry if you don't know a lot about matrix algebra. The book uses it sparingly, and I'll do the same in my lectures. It's convenient for expressing some models and statistics concisely, but you won't have to do any by hand for any of the course's graded work.

I'll address some of the other things referred to in Ch. 1 as I go through the course syllabus with you, but I'll mention three more things here before ending this web page.

Premise 2 on p. 8 pertains to a strength of the book: for the various anlaysis methods covered, the intuition and assumptions that underlie them, as well as their strengths and weaknesses, will be discussed. Also, the methods will be compared with one another. All of this should help you to avoid viewing the various methods as a collection of "black boxes."
In premise 4 on p. 8 it is indicated that when the book's authors have taught courses based on the book's material, they have roughly allocated one-third of the classroom time to working through the "lab" portions of the book pertaining to the use of the software package R. However, on p. 13 it is indicated that another option is to let students work through the labs at their own pace, and I choose to do things this way. (My experience has been that students generally benefit more from me helping them to better understand the more mathematical details covered in the book, and that the labs in the book and the portions of the videos available at the book's web site which pertain to using R are adequate for learning to use software to implement the many analysis methods covered in the course.)
I strongly encourage you to watch Hastie and Tibshirani's two videos pertaining to Ch. 1 (and also to watch the videos for the other chapters). They discuss many application areas, and have nice graphics to accompany their talk. The video information closely matches what is in the book, but an important extra part is where they stress the act of looking at your data as an important initial step in your analysis. They also mention the discovery of an outlier in the prostate data set that they refer to. Looking at the data helps you find outliers. Once found, you can perhaps then find out if the strange value is legit, or is it a mistake that should be corrected or removed. (A nice thing about some of the newer methods is that they can be less sensitive to outliers than many of the older methods are. This is good in case a mistake in your data goes undiscovered.)

I'll close with something a bit funny (to me, anyway) pertaining to the "beer and diapers urban legend." (This isn't the best web page I found about the beer and diapers story, but it's got an interesting photo on it. I encourage you to google beer and diapers data mining urban legend to hopefully find a bit more information.)