The aim of this exercise is to get familiar with the issue of classification, automatic learning from a series of examples, using tools for data analysis, and the selected methods of machine learning.

The assignment should be executed using a program for data mining such as WEKA or RapidMiner. It can also be done with a different tool for statistical analysis or data mining, such as the statistical package in Matlab.

The main part of the assignment is to exercise at least two selected methods of inductive machine learning on a set of good statistical data describing some real phenomenon. How to choose a "good" set of data, and what it means "to exercise" is explained below.

1. Data

The data should be a statistical phenomenon described by a vector of attributes and expressed as a number of cases (samples). Ideally, the data can be understood by the layman, at least partially. Data must be true, the report should cite their source.

Artificially generated data are generally not suitable for this experiment. In a statistical set same samples can be repeated, and usually all possible combinations of values do not occur in the set (some can be quite unusual, or even physically unrealizable). A set of all possible combinations of attribute values, one sample of each, is in practice never a statistic.

Note some data sets on automobiles - it is almost impossible to perform a correct analysis on such set. There is something in human thinking about cars, which are not subjected to machine learning. Besides, most of such sets found in the Internet are not statistics.

Please also do not use the most common data sets: irises, or fungi. The latter is a good collection of statistical data, however, due to the popularity and descriptions in many studies, it is difficult to perform a truly independent exercise with it.

Some sets of statistical data are so good that ... they are too good. They have been optimized for classification. For example, they comprise already filtered attributes, of which none can be removed without lowering the quality of classification. Sometimes optimized discretization of attribute values has been performed resulting in the attribute(s) converted into discrete. Such a set can be identified by the fact that any operations on attributes only worsen the classification. The execution of a valuable, informative exercise on such a set is impossible. Sometimes the original, non-optimized version of such a set can be found, and such would be fit for the exercise.

Note 1:
Rough quantitative requirements: minimum 5 attributes and a minimum of 500 samples of data. Although a comfortable minimum should be closer to 1,000 samples and up from 10,000 samples interesting work begins. The number of samples needed for meaningful analysis depends on the number of attributes and their value sets. 200 samples may be a lot of data with five binary attributes, but for the five attributes each with 5 possible values 2000 samples is too little to achieve any learning.

Note 2:
If the data is numeric, especially floating point, a discretization is necessary, which is dividing them into ranges of values. This discretization can be done manually before the start of the analysis, if you know how to do it properly. Some machine learning algorithms automatically find significant range boundaries for a set. An analysis of the efficiency of discretization is also an important part of data mining, and programs have a number of methods for that purpose.

Note 3:
In real dataset there may be samples with missing attribute values. Samples may also be contradictory, i.e., the same combinations of attribute values with different classes. Neither case precludes the analysis os such a set, although the effectiveness of learning will be reduced. Programs for data analysis usually deal with missing values, whereas the contradictory samples can be detected using a separate program, and dealt with properly, e.g. removed from the set at least for the preliminary analysis.

Note that each large enough set of real data may contain erroneous samples, which are generally impossible to detect (except for some obvious cases, like when the age a person falls outside the range 0..130, some percentage outside the range 0..100, etc.). They reduce the efficiency and accuracy of machine learning, but do not prevent it altogether.

2. Preliminary analysis

When selecting a set of statistical data for a machine learning experiment one should check such aspects as the number of attributes, the type of their values, the specific sets of values, abnormal values, the issue of missing and conflicting data, etc. You can also examine the detailed distribution of each attribute (like the histogram of the data). This analysis is often useful at the later stages, when various operations are performed on the collection of values.

For some data sets it is obvious which attribute is the class (for the classification experiment) or function value (for the regression experiment). Other times an attribute must be choosen arbitrarily, and you can experiment with different attributes, achieving different results.

Having selected a data set and the distinguished class-attribute, a good next step is to review the histograms of each attribute with the division of samples into classes.

Another basic experiment is to determine a benchmark for the target classification accuracy. One way to determine this level is the inverse of the cardinality of the class attribute set of values. For two values the reference level will be 1/2, but for a larger number of values, like six, the reference level is much lower: 1/6.

Another approach to determine the reference level to take into account the most numerous class. Even if the class vallues are many (eg. 16), but the most frequent value includes eg. 45\% of the samples, a simple majority vote reaches 45\% correct, and it may be the reference level. Majority vote is sometimes called the Zero R classifier.

Still other approaches to calculate the reference level can be taken when there is a default class, or the default classification model.

Knowing the reference level of classification accuracy is important for evaluating the results of the classification. If the accuracy of classification does not exceed the reference level, it can be concluded that it is impossible to automatically learn classification based on a given set of attributes.

3. Classification - the first experiment

For the first classification experiment the Naive Bayes Classifier is often used. Although it makes assumptions that are not satisfied in many data sets, it nonetheless often works very well, and in many cases as well as much more advanced algorithms. At the same time, both the construction of the classifier and its application, are extremely simple, making this an attractive algorithm.

For the first experiment, the classification typically use all of the attributes and the maximum set of samples (it is worth reducing the set of samples only for very large set, like more than 100,000 samples, to lower the computation time). For numeric attributes is best to rely on the automatic discretization carried out by the program.

The purpose of this analysis is to determine the classification error on the test set and the other learnability metrics useful for the more sophisticated experiments.

The work should start by calculating the classification error on the training set, using the whole set of samples. It will measure the learnability of the whole set, dependent on the properties of the set, the errors it contains, etc. It will at the same time set the upper bound of learnability on the testing set.

The next stage should be the experiments with the calculation of the classification error on the testing set. The cross-validation error might be the first indicator of real learnability of the set. Typically it can be slightly larger than the error on the training set. If this error is much larger than the error on the training set, it may indicate poor wyuczalności owned collection. It can for example be too small.

If very much data is available then you can start with a small training and a large testing set. If necessary, the training set can be enlarged until the classifier reaches a certain level of good results. Conversely, if the data are few (eg. a few hundred), it often makes sense to start learning from a large training set (eg. 99\% of all samples), and then shrink it, to see how big training series is necessary to train the classifier. Always make sure if increasing the training set does not improve the classification accuracy.

Note:
The obtained classification error for various data sets may have widely different absolute values. Sometimes it can be very small, such as a fraction of a percent, sometimes can be 5-10\%, and sometimes much more, like 30-40\%. The latter situation indicates a weak learnability, but the accuracy is not necessarily useless and the success of the experiment should not be judged only on its basis.

4. Minimizing error

The aim of the work in this task is to achieve the smallest testing error. Having determined the initial value of the error in the first analysis, consider how it can be improved (reduced). Often useful turn out to be the following methods:

5. Summary of the results

After all the error minimizing experiments, using the optimal training configuration (or several alternative configurations), perform the calculation error on the training and testing sets (usually: using cross-validation), compare the results obtained, and formulate conclusions.

The last step should be to determine the minimum number of samples required to train the classifier. Repeat the learning experiment for increasingly smaller training sets, and calculating the average classification testing set error (eg. using cross-validation). This way we can determine the border size of the training set, below which the error begins to increase significantly. That is the minimum required number of samples.

References:

UCI Machine Learning Repository

CMU Statistical datasets

Rapid Miner tutorial