Centroid-based Clustering for Student Models in Computer-based

Multiple Language Tutoring

Maria Virvou, Efthymios Alepis and Christos Troussas

Department of Informatics, University of Piraeus, 80, Karaoli and Dimitriou str., Piraeus, Greece

Keywords: User Modelling, User Clustering, Multiple Language Learning, Intelligent Tutoring Systems, K-means

Algorithm.

Abstract: This paper proposes an approach for the initialization and the construction of student models in an

intelligent tutoring system that teaches multiple foreign languages. The basic concept for the construction of

the initial user models is to assign each new student to a model with similar characteristics. As it is quite

easy to understand that a tutoring system has rather little information about its new users, our effort is to

provide as much information as possible for each specific user relying on the user’s initial data. To this end,

a machine learning algorithm, namely k-means, is responsible for creating clusters relying on the system’s

pre-entered past data and as a next step, each new entry is assigned to the nearest centroid.

1 INTRODUCTION

In the past few years, there has been an increasing

focus on the use of Internet, which allows greater

flexibility in all aspects of modern life, especially

with the spread of unmetered high-speed

connections. Educational material at all levels is

available from Internet. Regarding e-learning, it has

never been easier for people to access educational

information at any level from any place. The low

cost and nearly instantaneous sharing of ideas,

knowledge and skills has rendered the distant

learning process feasible for people with less spare

time. By these means, not only can a group cheaply

communicate and share ideas but the wide reach of

the Internet allows such groups to form in an easy

and efficient way. Hence, the development of web-

based applications has become common place.

Moreover, all the emerging needs of modern life

accentuate the importance of learning foreign

languages (Virvou and Troussas, 2011). Taking into

account the scientific area of Intelligent Tutoring

Systems (ITSs), there is an increasing interest in the

use of computer-assisted foreign language

instruction. Especially, when these systems offer the

possibility of multiple language learning at the same

time, the students may further benefit from this

educational process (Virvou et al., 2000).

An issue of great importance in e-learning is the

personalization of users, since it is quite difficult to

monitor users’ learning patterns (Licchelli et al.,

2004). Specifically, it is performed through student

modeling, which consists of the analysis of students’

behavior and prediction of their future behavior and

learning performance. A solution to this problem is

the exploitation of automatic tools for the generation

and discovery of user profiles in order to obtain an

effective student model based on his/her learning

performance and preferences, that in turn allows to

create a personalized education environment.

Adaptive personalized e-learning systems could

accelerate the educational process by revealing the

strengths and weaknesses of each student. Most

student models are concerned with representing the

student’s ability on portions of the domain (Beck

and Woolf, 2000). However, the way of mapping the

low-level knowledge to higher level teaching actions

is not always obvious.

In view of the above, in this paper we propose a

machine learning architecture which permits the

initialization of students’ models. Our framework

uses an innovative combination of stereotypes and

the k-means clustering algorithm in order to partition

multiple observations into a number of k clusters in

which each observation belongs to the cluster with

the nearest mean. Each cluster is represented by a

single mean vector. In particular, a student is first

assigned to a stereotype category on the basis of

his/her background knowledge level in the

instruction of multiple foreign languages. This is

198

Virvou M., Alepis E. and Troussas C..

Centroid-based Clustering for Student Models in Computer-based Multiple Language Tutoring.

DOI: 10.5220/0004128201980203

In Proceedings of the International Conference on Signal Processing and Multimedia Applications and Wireless Information Networks and Systems

(SIGMAP-2012), pages 198-203

ISBN: 978-989-8565-25-9

 2012 SCITEPRESS (Science and Technology Publications, Lda.)

conducted based on the students’ performance on a

preliminary test posed to the student at the first time

of his/her interaction with the system. Then, the k-

means algorithm takes as input multiple students’

characteristics, which are described below and

serves as means for the initialization of the new-

student-model based on recognized similarities

between the new student and past students who

belong to the same stereotype category.

This paper is organized as follows. First, we

present the related scientific work. In sections 3 and

4, we discuss our system’s architecture, namely the

machine learning in student modelling and the k-

means clustering algorithm. Finally, in section 5, we

come up with a discussion about the usability of

centroid-based clustering for user models and we

present our next plans.

2 RELATED WORK

Teaching languages through computer-assisted

approaches is a quite significant field in language

learning. User modeling has already been applied in

a wide variety of scientific areas, including

educational software for language instruction.

Machine learning techniques have been applied to

user modeling problems for acquiring models of

users. In this section, we try to imprint the speckle of

the scientific progress of student modeling

concerning Machine Learning and CALL (Computer

Assisted Language Learning).

Basile et al (2011) proposed the exploitation of

machine learning techniques to improve and adapt

the set of user model stereotypes by making use of

user log interactions with the system. To do this, a

clustering technique is exploited to create a set of

user models prototypes; then, an induction module is

run on these aggregated classes in order to improve a

set of rules aimed as classifying new and unseen

users. Their approach exploited the knowledge

extracted by the analysis of log interaction data

without requiring an explicit feedback from the user.

Nino (2009) presented a snapshot of what has been

investigated in terms of the relationship between

machine translation (MT) and foreign language (FL)

teaching and learning. Moreover, the author outlined

some of the implications of the use of MT and of

free online MT for FL learning. Friaz-Martinez et al

(2007) investigated which human factors are

responsible for the behavior and the stereotypes of

digital libraries users so that these human factors can

be justified to be considered for personalization. To

achieve this aim, the authors have studied if there is

a statistical significance between the stereotypes

created by robust clustering and each human factor,

including cognitive styles, levels of expertise and

gender differences. Virvou and Chrysafiadi (2006)

described a web-based educational application for

individualized instruction on the domain of

programming and algorithms. Their system

incorporates a user model, which relies on

stereotypes, the determination of which is based on

the knowledge level of the learner. Liccheli et al

(2004) focused on machine learning approaches for

inducing student profiles, based on Inductive Logic

Programming and on methods using numeric

algorithms, to be exploited in this environment.

Moreover, an experimental session has been carried

out from the authors, comparing the effectiveness of

these methods along with an evaluation of their

efficiency in order to decide how to best exploit

them in the induction of student profiles. Tsiriga and

Virvou (2004) introduced the ISM framework for

the initialization of the student model in Web-based

ITSs, which is a methodology that uses an

innovative combination of stereotypes and the

distance weighted k-nearest neighbor algorithm to

set initial values for all aspects of the student model.

SignMT was implemented by Ditcharoen et al

(2010) to translate sentences/phrases from different

sources in four steps, which are word

transformation, word constraint, word addiction and

word ordering. Finally, Virvou and Troussas (2011)

described a ubiquitous e-learning tutoring system for

multiple language learning, called CAMELL

(Computer-Assisted Multilingual E-Language

Learning). It is a post-desktop model of human-

computer interaction in which students “naturally”

interact with the system in order to get used to

electronically supported learning. Their system

presents advances in user modeling, error proneness

and user interface design.

However, after a thorough investigation in the

related scientific literature, we came up with the

result that there was no implementation of

multilingual educational systems that combine

student modeling and machine learning. Hence, we

implemented a prototype system, which incorporates

intelligence in its diagnostic component, offers

proneness to students’ errors provides error

diagnosis and advice based on students’ needs.

3 MACHINE LEARNING IN

USER MODELING

Student modeling can undoubtedly benefit from

Centroid-basedClusteringforStudentModelsinComputer-basedMultipleLanguageTutoring

199

machine learning, given that machine learning

consists of the induction of knowledge, normally

leading to improvements in classifying objects in a

specific domain. Thus, our system’s student model

can extend or compile its background knowledge,

namely its bug library, so that the resulting student

model could be more accurate and efficient. User

modelers that deal with simple students’ behaviors

have the ability to collect a set of behaviors from

which to induce a student model. The task of

constructing the student model from a multiple

behaviors set can be regarded as an inductive

learning task and therefore machine learning

techniques can be used to address this task.

Constructing student models in multiple language

learning environments is quite complex, since

student behaviors are likely to be inconsistent and

incomplete, which can be due to any of the

following reasons (Virvou and Troussas, 2011):

I. Accidental sips.

II. Quick elimination of old knowledge errors.

III. Recurrence of old knowledge errors.

IV. Sudden appearance of new knowledge errors.

Except for accidental slips, all the above error

categories that our student model can predict may

change over time in unforeseen ways. This causes

problems in the generation of student models

because the predictions become less accurate. The

degree of precise prediction conducted by the user

models can be ameliorated by the use of

unsupervised inductive learning techniques.

For the incorporation of the algorithm into the

resulting multilingual system (Fig. 3) we may

observe the following basic steps:

i. For the initialization of the system, the k-means

algorithm receives as input, pre-stored data or data

from empirical studies. It uses several fundamental

characteristics which tend to influence the

educational procedure:

a. the age of students,

b. their level of knowledge in one of the

foreign language taught,

c. the degree of carefulness when answering

questions and

d. the error proneness of the student in each

concept of the domain knowledge.

These characteristics have been found quite

significant in past language learning applications

(Tsiriga and Virvou, 2004).

ii. Machine learning techniques are used as a next

step in order to describe efficiently the cognitive

processes that underlie the student’s actions along

with the student’s behavioral patterns and

preferences.

iii. Based on the aforementioned characteristics, the

system creates clusters of the already existing

students. These clusters contain valuable

information about their members, considering their

behavior, their preferences and generally their

interaction with the system.

Our system uses this model to support students

while studying the theory and solving exercises. In

particular, based on the information that emanates

from the knowledge level of the student in each

concept of the domain knowledge, the system

provides personalized help and support when s/he

navigates through the curricula. The error proneness

of the student supported by the student modeler is

used for error diagnosis. In particular, this

information is used in cases where the system has to

disambiguate between competing hypotheses that

concern the cause of students’ mistakes (Tsiriga and

Virvou, 2004).

4 K-MEANS CLUSTERING

ALGORITHM

K-means clustering is a well known machine

algorithm that is widely used to classify or to group

objects based on attributes/features into a number of

k groups/sets. “K” is positive integer number. The

grouping is done by minimizing the sum of squares

of distances between data and the corresponding

cluster centroid. Thus the purpose of K-mean

clustering is to classify the data. Each object

represented by one attribute point is an example to

the algorithm and it is assigned automatically to one

of the cluster. This consists of unsupervised learning

as the algorithm classifies the object automatically

only based on the criteria of minimum distance to

the centroid. The learning process depends on the

training examples with witch the algorithm is fed.

There are two choices in this learning process:

i. Infinite training. Each data that feed to the

algorithm will automatically consider as the training

examples.

ii. Finite training. After the training is considered as

finished, the algorithm is started to work by

classifying the cluster of new points. This is

conducted simply by assigning the point to the

nearest centroid without recalculate the new

centroid. Thus after the training finished, the

centroid are fixed points.

The basic steps of k-means clustering are simple. In

SIGMAP2012-InternationalConferenceonSignalProcessingandMultimediaApplications

200

the beginning we determine number of cluster K and

we assume the centroid or center of these clusters.

We can take any random objects as the initial

centroids or the first K objects in sequence can also

serve as the initial centroids.

Then the K means algorithm will do the three

steps below until convergence is reached, namely

there is no object move in groups, as illustrated in

Figure 1:

i. Determine the centroid coordinate randomly

from the data set.

ii. Determine the distance of each object to the

centroids and creation of k clusters.

iii. Group the object based on minimum distance

and determine the new means as the centroid of each

one of the k clusters.

Figure 1: Steps of K-means algorithm.

More specifically, the above steps can be

summarized as follows (Teknomo, 2006):

i. Step 1. Begin with a decision on the value of k as

the number of clusters.

ii. Step 2. Put any initial partition that classifies the

data into k clusters. Assign of the training samples

randomly, or systematically as the following:

a. Take the first k training sample as single-

element clusters.

b. Assign each of the remaining (N-k) training

sample to the cluster with the nearest centroid.

After each assignment, recomputed the centroid

of the gaining cluster.

iii. Step 3. Take each sample in sequence and

compute its distance from the centroid of each of the

clusters. If a sample is not currently in the cluster

with the closest centroid, switch this sample to that

cluster and update the centroid of the cluster gaining

the new sample and the cluster losing the sample.

iv. Step 4. Repeat step 3 until convergence is

achieved, namely until a pass through the training

sample causes no new assignments.

The key idea of k means is simple and is described

as follows: In the initialization phase, the number of

clusters k is determined. Then the algorithm assumes

the centroids or centers of these k clusters. These

centroids can be randomly selected or designed

deliberately. If the number of data is less than the

number of clusters, then each data is assigned as the

centroid of the cluster. Each centroid will have a

cluster number. If the number of data is greater than

the number of clusters, the algorithm computes the

Euclidean distance between each object and all

centroids to get the minimum distance. This data is

belongs to the cluster that has minimum distance

from itself. Given that the location of the real

centroid is unknown during the process, the

algorithm needs to revise the centroid location with

regard to the updated information (i.e., minimum

distance between new objects and the centroids).

After updating the values of the centroids, all the

objects are reallocated to the k clusters. The process

is repeated until the assignment of objects to clusters

ceases to change much, or when the centroids move

by negligible distances in successive iterations.

Mathematically the iteration can be proved to be

convergent.

Since the location of the centroid cannot be fixed

or prearranged, the centroid location is adjusted,

based on the current updated data. Then all the data

is assigned to this new centroid. This process is

repeated until no data is moving to another cluster

anymore. Mathematically, this loop can be proved to

be convergent. The convergence will always occur if

the following condition satisfied:

i. Each switch in step 2 the sum of distances from

each training sample to that training sample's group

centroid is decreased.

ii. There are only finitely many partitions of the

training examples into k clusters.

In order to better clarify the clustering algorithmic

process, we are providing the pseudo-code of k-

means algorithm:

Input: A dataset D, a user speciﬁed

number k

Output: k clusters

Initialize cluster centroids (randomly);

While not convergent

For each object o in D do

Find the cluster c whose centroid is

most close to o;

Allocate o to c;

End

For each cluster c do

Recalculate the centroid of c based

on the objects allocated to c;

End

Centroid-basedClusteringforStudentModelsinComputer-basedMultipleLanguageTutoring

201

The two key features of k-means which make it

efficient are often regarded as its biggest drawbacks:

i. Euclidean distance is used as a metric and

variance is used as a measure of cluster scatter.

ii. The number of clusters k is an input parameter:

an inappropriate choice of k may yield poor results.

That is why, when performing k-means, it is

important to run diagnostic checks for determining

the number of clusters in the data set. The correct

choice of k is often ambiguous, with interpretations

depending on the shape and scale of the distribution

of points in a data set and the desired clustering

resolution of the user. In addition, increasing k

without penalty will always reduce the amount of

error in the resulting clustering, to the extreme case

of zero error if each data point is considered its own

cluster (i.e., when k equals the number of data

points, n). Intuitively then, the optimal choice of k

will strike a balance between maximum compression

of the data using a single cluster, and maximum

accuracy by assigning each data point to its own

cluster. If an appropriate value of k is not apparent

from prior knowledge of the properties of the data

set, it must be chosen somehow. There are several

categories of methods for making this decision. One

simple principle, that we incorporated in the

implementation of k-means algorithm, sets the

number to (Mardia et al., 1979):

(1)

with n as the number of objects (data points). In our

case, given that the data points, which are an

outcome from empirical studies, are 32 we come up

with the conclusion that k=4. Figure 2 illustrates a

snapshot of our system and specifically a report of k-

means, the initial user data, the resulting k-mean

vectors, the number and members of means.

Figure 2: Snapshot of k-means algorithm.

5 CONCLUSIONS

In this paper we have presented our approach for

improving student models in the initialization phase

of an educational system. We have already our own

implementation of the k-means machine learning

algorithmic approach, as well as a tutoring system

for multiple language learning. After processing user

personal data we come up with more sophisticated

user models containing stereotypic information that

is based on similarities with other user groups seen

as clusters. We believe that this approach will

produce good results since it uses well known

techniques, already implemented in other similar

scientific areas with quite promising reports. As a

next step, it is in our near future plans to give our

resulting system to real students to use it

supplementarily for their language learning courses

in order to evaluate it and test its usefulness as an

educational tool.

REFERENCES

Niño, A., 2009. Machine translation in foreign language

learning: Language learners and tutors perceptions of

its advantages and disadvantages. In ReCALL. Vol. 21,

pp. 241-258.

Friaz-Martinez, E., Chen, S. Y., Macredie, R. D., Liu, X.,

2007. The role of human factors in stereotyping

behavior and perception of digital library users: a

robust clustering approach. In User Modelling and

User-Adapted Interaction. Vol. 13, pp. 305-337.

Webb, G. I., Pazzani, M. J., Billsus, D., 2001. Machine

Learning for User Modeling. In User Modelling and

User-Adapted Interaction. Vol. 11, pp. 19-29.

Virvou, M., Troussas, C., 2011. CAMELL: Towards a

ubiquitous multilingual e-learning system. In CSEDU

2011 - Proceedings of the 3rd International

Conference on Computer Supported Education. Vol.

2, pp. 509-513.

Virvou, M., Troussas, C., 2011. Web-based student

modeling for learning multiple languages. In

International Conference on Information Society, i-

Society 2011. Article number 5978484, pp. 423-428,

2011.

Virvou, M., D. Maras, D., Tsiriga, V., 2000. Student

modelling in an intelligent tutoring system for the

passive voice of english language. In Educational

Technology and Society.

Virvou, M., Chrysafiadi, K., 2006. A web-based

educational application for teaching of programming:

Student modeling via stereotypes. In Proceedings -

Sixth International Conference on Advanced Learning

Technologies, ICALT 2006. Vol. 2006, pp. 117-119.

Ditcharoen, N., Naruedomkul, K., Cercone, N., 2010.

SignMT: An alternative language learning tool. In

SIGMAP2012-InternationalConferenceonSignalProcessingandMultimediaApplications

202

Computers and Education, Vol. 55, pp. 118-130.

Licchelli, O., Basile, T. M. A., Di Mauro, N., Esposito, F.,

Semeraro, G., Ferilli. S., 2004. Machine Learning

Approaches for inducing Student Models. In Lecture

Notes in Artificial Intelligence (Subseries of Lecture

Notes in Computer Science). Vol. 3029, pp. 935-944.

Basile, T., Esposito, F., Ferilli, S. 2011. Improving User

Stereotypes through Machine Learning. In

Communications in Computer and Information

Science. Vol. 249, pp. 38-48.

Tsiriga, V., Virvou, M., 2004. A framework for the

initialization of student models in web-based

intelligent tutoring systems. In User Modelling and

User-Adapted Interaction. Vol. 14, pp. 289-315.

Mardia, K. et al.,1979. Multivariate Analysis, In Academic

Press.

Teknomo, K., 2006. K-Means Clustering Tutorials.

Centroid-basedClusteringforStudentModelsinComputer-basedMultipleLanguageTutoring

203