The aim of this exercise is to get familiar with the issue of classification, automatic learning from a series of
examples, using tools for data analysis, and the selected methods of machine learning.
The assignment should be executed using a program for data mining such as WEKA or RapidMiner. It can also be done
with a different tool for statistical analysis or data mining, such as the statistical package in Matlab.
The main part of the assignment is to exercise at least two selected methods of inductive machine learning on a set
of good statistical data describing some real phenomenon. How to choose a "good" set of data, and what it means "to
exercise" is explained below.
- 1. Data
-
The data should be a statistical phenomenon described by a vector of attributes and expressed as a number of cases
(samples). Ideally, the data can be understood by the layman, at least partially. Data must be true, the report
should cite their source.
Artificially generated data are generally not suitable for this experiment. In a statistical set same samples can be
repeated, and usually all possible combinations of values do not occur in the set (some can be quite unusual, or even
physically unrealizable). A set of all possible combinations of attribute values, one sample of each, is in practice
never a statistic.
Note some data sets on automobiles - it is almost impossible to perform a correct analysis on such set. There is
something in human thinking about cars, which are not subjected to machine learning. Besides, most of such sets
found in the Internet are not statistics.
Please also do not use the most common data sets: irises, or fungi. The latter is a good collection of statistical
data, however, due to the popularity and descriptions in many studies, it is difficult to perform a truly independent
exercise with it.
Some sets of statistical data are so good that ... they are too good. They have been optimized for classification.
For example, they comprise already filtered attributes, of which none can be removed without lowering the quality
of classification. Sometimes optimized discretization of attribute values has been performed resulting in the
attribute(s) converted into discrete. Such a set can be identified by the fact that any operations on attributes
only worsen the classification. The execution of a valuable, informative exercise on such a set is impossible.
Sometimes the original, non-optimized version of such a set can be found, and such would be fit for the exercise.
Note 1:
Rough quantitative requirements: minimum 5 attributes and a minimum of 500 samples of data. Although a comfortable
minimum should be closer to 1,000 samples and up from 10,000 samples interesting work begins. The number of samples
needed for meaningful analysis depends on the number of attributes and their value sets. 200 samples may be a lot of
data with five binary attributes, but for the five attributes each with 5 possible values 2000 samples is too little
to achieve any learning.
Note 2:
If the data is numeric, especially floating point, a discretization is necessary, which is dividing them into ranges
of values. This discretization can be done manually before the start of the analysis, if you know how to do it
properly. Some machine learning algorithms automatically find significant range boundaries for a set. An analysis
of the efficiency of discretization is also an important part of data mining, and programs have a number of methods
for that purpose.
Note 3:
In real dataset there may be samples with missing attribute values. Samples may also be contradictory, i.e., the
same combinations of attribute values with different classes. Neither case precludes the analysis os such a set,
although the effectiveness of learning will be reduced. Programs for data analysis usually deal with missing values,
whereas the contradictory samples can be detected using a separate program, and dealt with properly, e.g. removed
from the set at least for the preliminary analysis.
Note that each large enough set of real data may contain erroneous samples, which are generally impossible to detect
(except for some obvious cases, like when the age a person falls outside the range 0..130, some percentage outside
the range 0..100, etc.). They reduce the efficiency and accuracy of machine learning, but do not prevent it
altogether.
- 2. Preliminary analysis
-
When selecting a set of statistical data for a machine learning experiment one should check such aspects as the
number of attributes, the type of their values, the specific sets of values, abnormal values, the issue of missing
and conflicting data, etc. You can also examine the detailed distribution of each attribute (like the histogram of
the data). This analysis is often useful at the later stages, when various operations are performed on the
collection of values.
For some data sets it is obvious which attribute is the class (for the classification experiment) or function value
(for the regression experiment). Other times an attribute must be choosen arbitrarily, and you can experiment with
different attributes, achieving different results.
Having selected a data set and the distinguished class-attribute, a good next step is to review the histograms of
each attribute with the division of samples into classes.
Another basic experiment is to determine a benchmark for the target classification accuracy. One way to determine
this level is the inverse of the cardinality of the class attribute set of values. For two values the reference
level will be 1/2, but for a larger number of values, like six, the reference level is much lower: 1/6.
Another approach to determine the reference level to take into account the most numerous class. Even if the class
vallues are many (eg. 16), but the most frequent value includes eg. 45\% of the samples, a simple majority vote
reaches 45\% correct, and it may be the reference level. Majority vote is sometimes called the Zero R classifier.
Still other approaches to calculate the reference level can be taken when there is a default class, or the default
classification model.
Knowing the reference level of classification accuracy is important for evaluating the results of the classification.
If the accuracy of classification does not exceed the reference level, it can be concluded that it is impossible to
automatically learn classification based on a given set of attributes.
- 3. Classification - the first experiment
-
For the first classification experiment the Naive Bayes Classifier is often used. Although it makes assumptions that
are not satisfied in many data sets, it nonetheless often works very well, and in many cases as well as much more
advanced algorithms. At the same time, both the construction of the classifier and its application, are extremely
simple, making this an attractive algorithm.
For the first experiment, the classification typically use all of the attributes and the maximum set of samples (it
is worth reducing the set of samples only for very large set, like more than 100,000 samples, to lower the
computation time). For numeric attributes is best to rely on the automatic discretization carried out by the
program.
The purpose of this analysis is to determine the classification error on the test set and the other learnability
metrics useful for the more sophisticated experiments.
The work should start by calculating the classification error on the training set, using the whole set of samples.
It will measure the learnability of the whole set, dependent on the properties of the set, the errors it contains,
etc. It will at the same time set the upper bound of learnability on the testing set.
The next stage should be the experiments with the calculation of the classification error on the testing set. The
cross-validation error might be the first indicator of real learnability of the set. Typically it can be slightly
larger than the error on the training set. If this error is much larger than the error on the training set, it may
indicate poor wyuczalności owned collection. It can for example be too small.
If very much data is available then you can start with a small training and a large testing set. If necessary, the
training set can be enlarged until the classifier reaches a certain level of good results. Conversely, if the data
are few (eg. a few hundred), it often makes sense to start learning from a large training set (eg. 99\% of all
samples), and then shrink it, to see how big training series is necessary to train the classifier. Always make sure
if increasing the training set does not improve the classification accuracy.
Note:
The obtained classification error for various data sets may have widely different absolute values. Sometimes it can
be very small, such as a fraction of a percent, sometimes can be 5-10\%, and sometimes much more, like 30-40\%. The
latter situation indicates a weak learnability, but the accuracy is not necessarily useless and the success of the
experiment should not be judged only on its basis.
- 4. Minimizing error
-
The aim of the work in this task is to achieve the smallest testing error. Having determined the initial value of
the error in the first analysis, consider how it can be improved (reduced). Often useful turn out to be the
following methods:
- selection of a subset of attributes
Some attributes might be less important or completely irrelevant. Some may seem to have an impact on the
classification, but this impact may only be apparent. Such attributes should be eliminated. But even if certain
attributes have an impact on the classification, still often the principle of minimalism prevails - a simple
selection rule may be more effective, and a smaller set of attributes may give better (or not worse) results.
There are algorithms which automatically select the subset of attributes. Take advantage of these opportunities, but
the final decision on the selected set of attributes should be made manually.
- discretization of numerical attributes
Numerical attributes, and those with multiple values (eg. >> 10), must be discretized. There are algorithms to
optimize this discretization. It is advisable to experiment with them, and then choose the optimal method. In some
cases the best results can only be obtained by dicretizing some variable manually, by pre-processing with a custom
program.
- selecting a machine learning algorithm
Some programs have a lot of implemented machine learning methods. Basic, frequently used are: decision trees, k
nearest neighbors, multilayer neural network (also referred to as the feed-forward neural network, or the Multi-Layer
Perceptron, MLP), and support vector machines. Experimenting with other methods is also acceptable and desirable.
- special methods, resulting from the analysis of the problem
In the analysis of the problem we face a number of special situations which require some unusual approaches. Below
a few of such are mentioned. Additionally it is worth considering the meaning of specific, and consider possible
custom procedures which could be appropriate for it.
- Sometimes the problem is a sample with a non-standard attribute value. The procedure may be different,
depending on the circumstances. Individual samples with abnormal values of some attribute may be simply wrong, and
then be removed from the data set to obtain better learning.
However, sometimes unusual attribute values are important from the point of view of the classifier application, and
such samples can not be removed. We need to find a way for them to be effectively considered by the ML algorithm.
For example, a specific attribute value widely different from the typical range of values cause the linear
discretization to put most samples into one discretization compartment, effectively eliminating this attribute of
classification. Non-linear discretization can be used then, for example using the range from some x to infinity.
- Sometimes the job of the classifier is to detect a rare phenomenon. Imagine X-ray screening of the lungs,
where most people are healthy, but in rare cases tuberculosis can be detected. The data set may be, for
example. 100,000 healthy subjects, and 100 cases of tuberculosis. Classifier that learns to recognize the
tuberculosis samples poorly, eg. with 80\% error, but recognizes the normal cases very well, eg. with 0.1\% error,
will likely have a 0.1\% overall error and can be regarded as satisfactory. (The same result is obtained with a
trivial ZeroR classifier, considering all people as healthy.) The real purpose of the screening and the
classification should be just to detect the tuberculosis cases. The quality measure of this classifier should really
be the error on the tuberculosis cases alone, but few machine learning programs can accept such a criterion. A
workaround may be, for example, to replicate the tuberculosis samples so that their number was comparable with
healthy subjects.
- Generalizing the above case, if a class has few samples, and it is important for classification, we can
replicate the samples of this class in the data set to increase their importance for the ML algorithm.
- Sometimes there are a number of classes and the classifier identifies some othe them quite well, but others
less so, or certain classes seem indistinguishable. This can be seen by analyzing the confusion matrix.
What can be done? Perhaps distinguishing the indistinguishable classes is not the most important for the final
result, and they can be combined to obtain a good classification accuracy. If distinguishing the "indistinguishable"
classes is important however, it is possible to build a second level classifier, which could be trained in a separate
experiment to distinguish only between the samples of these classes.
- 5. Summary of the results
-
After all the error minimizing experiments, using the optimal training configuration (or several alternative
configurations), perform the calculation error on the training and testing sets (usually: using cross-validation),
compare the results obtained, and formulate conclusions.
The last step should be to determine the minimum number of samples required to train the classifier. Repeat the
learning experiment for increasingly smaller training sets, and calculating the average classification testing set
error (eg. using cross-validation). This way we can determine the border size of the training set, below which the
error begins to increase significantly. That is the minimum required number of samples.
References:
UCI Machine Learning Repository
CMU Statistical datasets
Rapid Miner tutorial