The information gain measure is used to select the test attribute at each node in the
tree. We refer to such a measure an attribute selection measure or a measure of
goodness of split. The algorithm computes the information gain of each attribute. The
attribute with the highest information gain is chosen as the test attribute for the given
set [18].
Consequently the data must be preprocessed to select a subset of attributes to use in
learning. Learning schemes themselves try to select attributes appropriately and
ignore irrelevant and redundant ones, but in practice their performance can frequently
be improved by preselection. For example, experiments show that adding useless
attributes causes the performance of learning schemes such as decision trees and
rules, linear regression, instance-based learners, and clustering methods to deteriorate
[18].
In the tree building stage, the most important step is the selection of the test
attribute. Information gain measure is used to select the test attribute at each node in
the tree. Such a measure is referred to as an attribute selection measure or a measure
of the goodness of split. The attribute with the highest information gain (or greatest
entropy reduction) is chosen as test attribute for the current node. This attribute
minimizes the information need to classify the samples in the resulting partitions and
reflects the least randomness or “impurity” in these partitions.
Finally, the cross-validation evaluation technique measures the correctly and
incorrectly classified instances. We consider that if there are more than 80% of
instances correctly classified than we have enough good data. The obtained model is
further used for analyzing learner’s goals and obtain recommendations. The aim of
the LAS is to “guide” the learner on the correct path in the decision tree such that he
reaches the desired class.
The main characteristic of the LAS is that it uses a machine learning algorithm for
obtaining knowledge regarding learners. The e-Learning environment produces data
regarding the activity of learners and passes this data to LAS. The LAS creates and
maintains a learner’s model based on data received from the e-Assessment tool. This
architecture allows the usage of LAS along with any e-Learning platform as long as
the data is in the accepted format.
The raw data is dumped by e-Learning platform in a log file activity.log. The log
file, together with database relations represent the raw data available for the analysis
process. Because we use Weka [19] the data is extracted and translated into a standard
format called ARFF, for Attribute Relation File Format [20, 21]. This involves taking
the physical log file and database relations and processing them through a series of
steps to generate an ARFF dataset.
At this phase the most important decision regards the features selection for
instances. There may be derived a large number of features that describe the activity
of a student. Choosing the attributes is highly dependent on data that we have domain
knowledge and experience. For our classification we choose three attributes:
noOfTests– the number of taken tests, avgTimeForTesing – the average time spent for
testing, vgResultsOnTests – average results obtained at testing and avgFinalResults-
average of final results. For each registered student the values of these attributes are
determined based on the raw data from the log files and database relations. Each
student is referred to as an instance within classification process.
The values of attributes are computed for each instance through a custom
developed off-line Java application. The outcome of running the application is in the
66