Data Mining CRISPDM

CRISP-DM

The second widely used data mining approach is CRISP-DM (CRoss-Industry Standard Process for Data Mining). The CRISP-DM data mining process is as follows:

Business Understanding
Data Understanding
Data Preparation
Modeling
Evaluation
Deployment

Chang (2006) describes the use of the CRISP-DM data mining process in the freshman admissions process at a large unnamed state university. The data mining/modeling software Clementine was used in this study. Clementine (now known as SPSS Modeler) is a product of SPSS (a company, now a part of IBM, which produces analytical software).

In the Business Understanding phase, the basic questions to be answered are defined. In the Data Understanding phase a study is done to examine what data is available and how it can be mined. Chang examined demographic, academic and communications activity data. Fifteen predictor variables (high school GPA, gender, ethnicity, etc.) and one outcome variable (enrollment status) were chosen. Data Preparation includes cleansing, combining and transforming the data.

In the Modeling phase, various models are implemented, and in the Evaluate phase the models are tested for validity. Chang used three different modeling techniques: classification and regression tree (C&RT), neural networks, and logistic regression. Similar to the SEMMA approach, part of the data is used to build the model while the remaining part is used for the validity test. Results indicated that enrollment could be predicted to some degree. Finally, validated models are put into practice in the Deployment phase. The models were then used to predict enrollment in future years.