Competency 5.1: Learn to conduct prediction modeling effectively and appropriately.
I think this competency can be achieved if we are able to complete the given activity in RapidMiner. It is quite difficult for a newbie, but its well-described in the course and definitely doable :)
We were asked to build a set of detectors predicting the variable ONTASK for the given data file using RapidMiner 5.3. I had previously installed Rapidminer 6.1, so I was using it. The question progresses to the next one only if you answer it correctly and there were 13 questions,most of which require you to enter the Kappa of the model you executed. There were a few difficulties along the way (for which I will try to give some useful tips) and I had a huge relief when I finally got this screen ;)
1) Build
a decision tree (using operator W-J48 from the Weka Extension Pack) on the
entire data set. What is the non-cross-validated kappa?
You can follow the Rapidminer walkthrough video to answer this question. It's almost the same steps, but in the last stage of data import from excel, you need to change the variable types if they are not correctly guessed. My Rapid Miner 6.1 version guessed the data types correctly, but if you are using 5.3, you should probably change the types of polynomial variables which were incorrectly guessed as integer. (You can open and check the input excel file to see what kind of values are there)
You should set attribute
name as “ONTASK” and target role = label (the value to be predicted in the
given exercise) and add operators W-J48, Apply model and Performance (Binary Classification). The rest should be fine.
2) The kappa value you just obtained is artificially high -
the model is over-fitting to which student it is. What is the
non-cross-validated kappa, if you build the model (using the same operator),
excluding student?
Two ways to exclude a field - to delete the field or use Select attributes operator. The latter is better for obvious reasons. For this question you need to add "Select Attributes" operator and set Attribute filter type = single, attribute=StudentID and check invert
selection (since we are asked to exclude student).
3) Some other features in the data set may make your model
overly specific to the current data set. Which data features would not apply
outside of the population sampled in the current data set? Select all that
apply.
For this question, you need to select the options which will not generalise outside your population. The system will assist you if you are wrong.
4) What is the non-cross-validated kappa, if you build the W-J48
decision tree model (using the same operator), excluding student and the
variables from Question 3?
For this, we need to exclude all variables in Q3 which do not
apply to the population outside our sample data in addition to the studentID we already excluded. You can change the attributes by selecting filter type= subset and Select the attributes to be excluded in the
next window. Check invert selection.
5) What is the non-cross-validated kappa, for the same set of
variables you used for question 4, if you use Naive Bayes?
Replace the W-J48 operator for Weka’s decision tree by Naïve bayes operator.
6) What
is the non-cross-validated kappa, for the same set of variables you used for
question 4, if you use W-JRip?
Replace Naïve Bayes operator by W-Jrip operator.
7) What is the non-cross-validated kappa, for the same set of
variables you used for question 4, if you use Logistic Regression? (Hint: You
will need to transform some variables to make this work; RapidMiner will tell
you what to do)
Add
operators "Nominal to Numerical" and "Logistic Regression" because Logistic
Regression cannot handle polynominal attributes/ label.
This was the one for which I spent maximum time, but still couldn't go through. I'm not sure if the Kappa I got was wrong or what the system expects itself was wrong. Anyways, since I couldn't afford more than a day on that issue and I was almost on the verge of quitting the activity, I had to trespass this question with Ryan's answer in the discussion forum of Quickhelper. That's very unfortunate, but I hardly had a choice :(
8) What
is the non-cross-validated kappa, for the same set of variables you used for
question 4, if you use Step Regression (called Linear Regression)?
Just replace Logistic Regression by Linear Regression operator.
9) What is the non-cross-validated kappa, for the same set of
variables you used for question 4, if you use k-NN instead of W-J48? (We will
discuss the results of this test later).
Replace Linear Regression by K-NN operator.
10) What is the kappa, for the same set of variables you used
for question 4, if you use W-J48, and conduct 10-fold stratified-sample
cross-validation?
For cross-validating our model, you can refer the Rapidminer walkthrough. You need to add X-Validation operator. Remove the W-J48, Apply model and Performance operators from the process and add it inside the training and test set of X-Validation operator.
11) Why is the kappa lower for question 11 (cross-validation)
than question 4 (no cross-validation?)
K-NN predicts a point using itself when cross-validation is turned
off, and that’s bad.
You should be able to answer this question, otherwise the system will help.
12) What
is the kappa, for the same set of variables you used for question 4, if you use
k-NN, and conduct 10-fold stratified-sample cross-validation?
Replace W-J48 by K-NN inside the X-Validation training set.
13) k-NN and W-J48 got almost the same Kappa when compared using
cross-validation. But the kappa for k-NN was much higher (1.000) when
cross-validation was not used. Why is that?
You should be able to answer this question as well, else the system will help.
I wanted to post a tutorial with the pictures so it can help new comers, but I didn't have time for that since I haven't started Week6 yet, which is running now. Hope my tips help. All the best!