Data Mining with Weka (2.3: Repeated training and testing)
Hello again! In the last lesson, we looked
at training and testing. We saw that we can evaluate a classifier on
an independent test set, or using a percentage split, with a certain percentage of the dataset
used to train and the rest used for testing, or — and this is generally a very bad idea — we
can evaluate it on the training set itself, which gives misleadingly optimistic performance
figures. In this lesson, we’re going to look a little
bit more at training and testing. What we’re going to do is repeatedly
train and test using a percentage split. Now, in the last lesson, we saw that if you
simply repeat the training and testing, you get the same result each time because Weka
initializes the random number generator before it does each run to make sure that you know
what’s going on when you do the same experiment again tomorrow.
But, there is a way of overriding that. So, we will be using independent random numbers
on different occasions to produce a percentage split of the dataset into a training and test
set. I’m going to open the segment-challenge data
again. That’s what we used before. Notice there are 1500 instances here; that’s quite a lot. I’m going to go to Classify. I’m going to choose J48, our standard method,
I guess. I’m going to use a percentage split, and because
we’ve got 1500 instances, I’m going to choose 90% for training and just 10% for testing. I reckon that 10% — that’s 150 instances — for
testing is going to give us a reasonable estimate, and we might as well train on as many as we
can to get the most accurate classifier. I’m going to run this, and the accuracy figure
I get — this is what I got in the last lesson — is 96.6667%. Now, this is misleadingly high accuracy here. I’m going to call that 96.7%, or 0.967. And then, I’m going to do it again and just
see how much variation we get from that figure initializing the random number generator
to different amounts each time.
If I go to the More Options menu, I get a
number of options here which are quite useful: outputting the model, we’re doing that; outputting statistics; we can output different evaluation measures; we’re doing the confusion matrix; we’re storing the prediction for visualization; we can output the predictions if we want; we can do a cost-sensitive evaluation; and we can set the random seed for cross-validation
or percentage split. That’s set by default to 1. I’m going to change that to 2, a different
random seed. We could also output the source code for the
classifier if we wanted, but I just want to change the random seed. Then I want to run it again. Before we got 0.967, and this time we get 0.94,
94%. Quite different, you see. If I were then to change this again too, say,
3, and run it again. Again I get 94%. If I change it again to 4 and run it again,
I get 96.7%. Let’s do one more. Change it to 5, run it again, and now I get
95.3%. Here’s a table with these figures. If we run it 10 times, we get this set of
results.
Given this set of experimental results, we
can calculate the mean and standard deviation. The sample mean is the sum of all of these
error figures — or these success rates, I should say — divided by the number, 10 of
them. That’s 0.949, about 95%. That’s really what we would expect to get. That’s a better estimate than the 96.7% that
we started with. A more reliable estimate. We can calculate the sample variance. We take the deviation from the mean, subtract
the mean from each of these numbers, we square that, add them up, and we divide, not by n,
but by n – 1. That might surprise you, perhaps. The reason for it being n – 1 is that we’ve
calculated the mean from this sample. When the mean is calculated from the sample,
you need to divide by n – 1, leading to a slightly larger variance estimate than if you were to divide
by n. We take the square root of that, and in this
case, we get a standard deviation of 1.8%.
Now you can see that the real performance
of J48 on the segment-challenge dataset is approximately 95% accuracy, plus or minus
approximately 2%. Anywhere, let’s say, between 93-97% accuracy. These figures that you get, that Weka puts
out for you, are misleading. You need to be careful how you interpret them
because the result is certainly not 95.333%. There’s a lot of variation on a lot of these
figures. Remember, the basic assumption is the training
and test sets are sampled independently from an infinite population, and you should expect
a slight variation in results — perhaps more than just a slight variation in results. You can estimate the variation in results
by setting the random-number seed and repeating the experiment. You can calculate the mean and the standard
deviation experimentally, which is what we just did.
Off you go now, and do the activity associated
with this lesson. I’ll see you in the next lesson. Bye!.
Traffic Xtractor ᶜˡⁱᶜᵏ ᵗʰᵉ ˢⁿᵒʷᵐᵃⁿ ☃ Page 1 Of Google & YouTube In MINUTES! Software Gets As Much FREE Traffic As You Want With A FEW CLICKS OF YOUR MOUSE… NEW Features Include: Video Title & Description Curating & Optimization Google suggest keywords ⇝ Google related keywords ⇝ Bing suggest keyword ⇝ Bing related keywords