Some Comments About Ch. 12 of Text



I like the clustering examples in the last paragraph of Sec. 12.1 on p. 498 of the text. I know some of the web sites I use for online shopping frequently make recommendations to me based on the shopping habits of people who they guess have similar preferences.

A related application of clustering is to use demographic information to group zip codes into clusters in order to design effective marketing strategies for companies that do a lot of mail-order business, so that catelogs can be sent to people in areas who might be more likely to purchase certain products. (This activity is know as market segmentation.)

The example in the last paragraph on p. 528 of the text indicates a situation where a correlation-based distance may be good to use for clustering online shoppers.



For the three-dimensional data set displayed in FIGURE 12.2 of the text, the three marginal distributions of the original variables would all show the green, orange, and blue groups as being overlapping. But the two dimensional scatterplot based on the 1st two principal components does a nice job of showing the three groups as not overlapping.



The example of the 1st full paragraph on p. 525 of the next is a nice one, of how a good three group-clustering need not being nested in a good two-group clustering. That is, the three clusters cannot be formed by dividing one of the clusters of the two-group clustering.



Since "the curse of dimensionality" might create problems if there are a lot of variables (particularly if some of the variables don't seem to contribute much to the clustering), doing dimension reduction either by variable selection or using principal components may be worth trying at times. (There is no standard way to do variable selection when clustering. One method would be to use all of the variables initially, and then use graphics to try to identify those that don't seem to be very useful. E.g., if for a given variable the mean value of that variable is about the same for each of the clusters, and a plot shows that the clusters overlap a lot with respect to that variable, then one might consider omitting the variable. (One can look at one-dimensional scatter plots, or perhaps two-dimensional scatter plots (since sometimes it might appear the clusters overlap a lot with respect to a variable when just looking at a one-dimensional scatter plot, but a two-dimensional scatter plot will show that when another variable is also plotted, groups form in two dimensions).))