Dynamical Creation of Policy Trees for a POMDP-based

Intelligent Tutoring System

Fangju Wang

University of Guelph, 50 Stone Road East, Guelph, Ontario, Canada

Keywords:

Intelligent Tutoring System, Computer Supported Education, Partially Observable Markov Decision Process,

Computational Complexity.

Abstract:

In this paper, we discuss a new technique for creating policy trees in an intelligent tutoring system (ITS) that

is based on a partially observable Markov decision process (POMDP). The POMDP model is a useful tool for

dealing with uncertainties. With a POMDP, an ITS may choose optimal teaching actions even when uncer-

tainties exist. Great computational complexity in solving a POMDP has been a major obstacle to applying the

POMDP model to intelligent tutoring. The technique of policy trees is considered a less expensive approach.

However, policy trees are still too expensive for building ITSs that teach practical subjects. In our research, we

develop a new technique of policy trees, in which trees are grouped and dynamically created. This technique

has advantages of better time and space efﬁciencies. It enables us to build more efﬁcient ITSs. Particularly

the technique makes it possible to build ITSs on platforms which have limited storage capacity and computing

power.

1 INTRODUCTION

Computational complexity is a key issue in build-

ing an interactive intelligent tutoring system (ITS).

An ITS must be able to reside on a computing plat-

form, and respond to student questions or requests in

a timely fashion. However, many mathematical mod-

els underlying ITSs are computationally intractable.

Huge space consumption and lengthy computing time

have been major obstacles to applying the models to

intelligent tutoring. The partially observable Markov

decision process (POMDP) model is one of them.

The POMDP model may enable an ITS to choose

optimal actions in teaching a student, even when un-

certainties exist. A major goal for building an ITS is

adaptive teaching. To be adaptive, in each tutoring

step a system should be able to choose the action that

is most beneﬁcial to the student it teaches. Mathemat-

ically, adaptive tutoring can be modeled by a Markov

decision process (MDP), in which the agent makes

optimal decisions considering the current states. In an

MDP, the agent observes states clearly and knows ex-

actly what the current states are. However, in a tutor-

ing process, the teacher is often uncertain about stu-

dent states (Woolf, 2009).

A POMDP is an extension of an MDP for deal-

ing with uncertainties. In a POMDP, the task of

choosing an optimal action is referred to as solving

the POMDP. This task is computationally expensive.

A simpliﬁed, less expensive technique for POMDP-

solving is to use policy trees, in which decision mak-

ing involves evaluating a set of trees and choosing an

optimal one. However, the technique of policy trees

is still too expensive to be used in practical applica-

tions. In making a decision, the number of trees to

evaluate is exponential, and the number of operations

in evaluating a tree is also exponential. To apply the

POMDP model to intelligent tutoring, we must ad-

dress the problems of computational complexity.

In our research, we develop a new technique of

policy trees for POMDP-solving in an ITS. In this

technique, policy trees are subdivided into small tree

sets. When making a decision, the POMDP agent

evaluates the trees in a set, instead of all the possi-

ble trees. A tree set is dynamically created when the

agent needs to evaluate it. The technique has advan-

tages of better time and space efﬁciencies.

In this paper, we ﬁrst introduce the technical back-

ground of the POMDP model that is needed for dis-

cussing our technique, followed by reviewing the ex-

isting work related to our research. Then we present

our techniques for grouping trees and dynamic tree

creation, and discuss some experimental results.

Wang, F.

Dynamical Creation of Policy Trees for a POMDP-based Intelligent Tutoring System.

DOI: 10.5220/0006774601370144

In Proceedings of the 10th International Conference on Computer Supported Education (CSEDU 2018), pages 137-144

ISBN: 978-989-758-291-2

137

2 PARTIALLY OBSERVABLE

MARKOV DECISION PROCESS

2.1 MDP and POMDP

A POMDP is an extension of an MDP, as mentioned

in the previous section. An MDP can model a deci-

sion process in which different actions can be chosen

in different states to maximize rewards. The core of

an MDP includes S, A, and ρ, which are a set of states,

a set of actions, and a reward function. In a decision

step, the agent is in s ∈S, takes a ∈A that is available

in s, enters s

∈ S, and receives reward ρ(s,a,s

). The

MDP model is stochastic. An MDP includes T , which

is a set of state transition probabilities. P(s

|s,a) ∈ T

is the probability that the agent enters s

after taking

a in s. Another core component in an MDP is policy

π(s). It guides the agent to choose the optimal action

available in s to maximize rewards.

As an extension of an MDP, a POMDP has two

additional core components: O and Z, which are a set

of observations and a set of observation probabilities.

A POMDP can model a decision process in which

the agent is not able to see states completely. In a

POMDP, the agent infers information about states and

represents the information about states by a belief, de-

noted by b. In a decision step, the agent is in s ∈S that

it is not able to see, chooses a ∈A based on its current

belief b, enters s

∈ S that it is not able to see either,

observes o ∈O, and infers information about s

by us-

ing P(o|a,s

) ∈ Z and P(s

|s,a) ∈ T .

Belief b is deﬁned as

b = [b(s

),b(s

),...,b(s

)] (1)

where s

∈ S (1 ≤ i ≤ Q) is the ith state in S, Q is the

number of states in S, b(s

) is the probability that the

agent is in s

, and

∑

i=1

b(s

) = 1.

In a POMDP, the policy is π(b). In a decision step,

it guides the agent to choose an action considering the

current belief b to maximize the long term reward.

2.2 Policy Trees for POMDP-Solving

For a given b, an optimal π returns an optimal ac-

tion. In a POMDP, ﬁnding the optimal π is referred

to as solving the POMDP. For most practical appli-

cation problems, POMDP-solving is a task of great

complexity (Carlin and Zilberstein, 2008; Rafferty

et al., 2011). A simpliﬁed, less expensive technique

for POMDP-solving is to use policy trees.

In a policy tree, nodes are actions, and edges are

observations. The action at the root is called the root

action. An action node has observation edges to ac-

tions at the next level. After an action is taken, the

next action to take is one of the actions at the next

level, depending on what the agent observes. Fig-

ure 1 illustrates the general structure of a policy tree,

in which a

is the root action, a is an action, and K

is the number of possible observations. Note that an

action node has edges of all the possible observations

to the next level.

a aa

...

a a

...

... ... ...

Figure 1: The general structure of a policy tree.

When a technique of policy trees is used for

POMDP-solving, ﬁnding the optimal policy is to

identify the optimal tree. In each decision step, the

agent ﬁnds the optimal policy tree considering its cur-

rent belief, and takes the root action of the tree. In the

following, we discuss a method to ﬁnd the optimal

tree.

Each policy tree is associated with a value func-

tion, which evaluates the long term reward of taking

the tree (policy). Let τ be a policy tree. The value

function of state s given τ is

(s) = R (s, a)+γ

∑

∈S

P(s

|s,a)

∑

o∈O

P(o|a,s

τ(o)

)

(2)

where a is the root action of τ, s

is the next state,

i.e. the state that the agent enters into after taking a,

γ is a discounting factor (0 ≤ γ ≤ 1), o is the obser-

vation after a is taken, τ(o) is the subtree in τ which

is connected to the root by the edge of o, and R (s,a)

is the expected immediate reward after a is taken in s,

calculated as

R (s,a) =

∑

∈S

P(s

|s,a)R (s, a,s

) (3)

where R (s,a, s

) is the expected immediate reward

after the agent takes a in s and enters s

. The sec-

ond term on the right hand side of Eqn (2) is the dis-

counted expected value.

From Eqns (1) and (2), we have the value function

of belief b given τ:

(b) =

∑

s∈S

b(s)V

(s). (4)

CSEDU 2018 - 10th International Conference on Computer Supported Education

138

Thus we have π(b) returning the optimal policy tree

for b:

π(b) =

τ = argmax

τ∈T

(b), (5)

where T is the set of trees to evaluate in making the

decision.

The size of a policy tree depends on the number

of possible observations and the horizon. When the

horizon is H, the number of nodes in a tree is

H−1

∑

t=0

|O|

−1

|O|−1

(6)

where |O| is the size of O. At each node, the number

of possible actions is |A|. Therefore, the total number

of all possible H-horizon policy trees is

|A|

|O|

−1

|O|−1

. (7)

Both numbers are exponential.

3 RELATED WORK

Researchers in the ﬁelds of ITSs have seen the great

potential of the POMDP model in building ITSs.

Extensive research has been conducted in applying

POMDPs to intelligent tutoring (Cassandra, 1998;

Williams et al., 2005; Williams and Young, 2007;

Theocharous et al., 2009; Rafferty et al., 2011; Chi-

naei et al., 2012; Folsom-Kovarik et al., 2013). In the

work related to applying the model to ITSs, POMDPs

were used to model student states, and to customize

and optimize teaching. In a commonly used structure,

student states had a boolean attribute for each of the

subject contents, actions available to a tutoring agent

were various types of teaching techniques, and obser-

vations were results of tests given periodically. The

goals were to teach as many of the contents in a ﬁ-

nite amount of time, or to minimize the time required

to learn the entire subject. In the following, we re-

view some work in which policy trees were used for

POMDP-solving in ITSs.

Rafferty and co-workers created a POMDP-based

system for teaching concepts (Rafferty et al., 2011).

A core component of the system was a technique

of fast teaching by POMDP planning. The tech-

nique was for computing approximate POMDP poli-

cies, which selected actions to minimize the expected

time for the learner to understand concepts. The

researchers framed the problem of optimally select-

ing teaching actions by using a decision-theoretic ap-

proach, and formulated teaching as a POMDP plan-

ning problem. In the POMDP, the states represented

the learners’ knowledge, and the transitions mod-

eled how teaching actions stochastically changed the

learners’ knowledge.

For solving the POMDP, the researchers devel-

oped a method of forward trees, which are variations

of policy trees. A forward tree is constructed by in-

terleaving branching on actions and observations. For

the current belief, a forward trees was constructed to

estimate the value of each pedagogical action, and the

best action was chosen. The learner’s response, plus

the action chosen, was used to update the belief, and

then a new forward search tree was constructed for se-

lecting a new action for the updated belief. The cost of

searching the full tree is exponential in the task hori-

zon, and requires an O(|S|

) operations at each node.

To reduce the number of nodes to search through, the

researchers restricted the tree by sampling only a few

actions, and limited the horizon to control the depth

of the tree.

In the work reported in (Wang, 2016), an exper-

imental ITS was developed for teaching concepts in

computer science. A POMDP was used in the sys-

tem to model processes of intelligent tutoring. In

the POMDP, states, actions, and observations mod-

eled student knowledge states, system tutoring ac-

tions, and student actions, respectively. A method of

policy trees was proposed for POMDP-solving. In the

method, policy trees were created and stored in a tree

database. To choose an optimal action to respond to a

given student query, the agent searched the database

and evaluated a set of trees. For reducing the costs

in making a decision, techniques were developed to

minimize the tree sizes and decrease the number of

trees to evaluate. A major disadvantage of the pro-

posed policy tree method was its space complexity.

The techniques of policy trees for improving

POMDP-solving have made good progress towards

building practical POMDP-based ITSs. However,

they were still too costly to use. For example, as the

authors of (Rafferty et al., 2011) concluded, computa-

tional challenges existed in their technique of forward

trees, despite sampling only a fraction of possible ac-

tions and using short horizons. Also, how to sample

actions and how to shorten a horizon are challenging

problems. Computational complexity has been a bot-

tleneck in applying the POMDP model to intelligent

tutoring.

4 GROUPING OF POLICY TREES

4.1 An Overview

To address the problems of computational complexity

in applying the method of policy trees in a POMDP-

based ITS, we develop a new technique, in which pol-

icy trees are grouped and trees are dynamically cre-

Dynamical Creation of Policy Trees for a POMDP-based Intelligent Tutoring System

139

ated. With this technique, the agent evaluates a small

set of trees when making a decision. The trees in the

set are dynamically created for better space efﬁciency.

In this section, we discuss the grouping of trees, and

in the next section, the dynamic tree creation.

The discussion is in the context of an experimen-

tal ITS, which we developed for testing our tech-

niques. The ITS teaches concepts in software basics.

A POMDP helps the ITS choose optimal teaching ac-

tions. We cast the ITS on to the POMDP, by using

POMDP states to represent student knowledge states,

and actions to represent system tutoring actions. We

treat student actions (asking questions, accepting an-

swers, etc ) as observations.

4.2 State Space Partitioning

To have small tree sets, we partition the state space

into subspace, and then group trees in each subspace

into tree sets. Before the tree grouping technique, we

discuss our method for state space partitioning.

We deﬁne the states in terms of concepts in the

instructional subject. In software basics, concepts in-

clude program, instruction, algorithm, and many oth-

ers. We associate each state with a state formula,

which is of the form:

...C

), (8)

where C

is the variable for the ith concept C

, taking

a value

√

or ¬C

(1 ≤i ≤N), and N is the number

of concepts in the subject. We use

√

to represent

that the student understands C

, and ¬C

to represent

that the student does not. A formula is a representa-

tion of a student knowledge state. For example, for-

mula (

√

¬C

...) is a representation of the state

in which the student understands C

and C

, but not

, ... States thus deﬁned have Markov property.

When there are N concepts in an instructional sub-

ject, the number of state formulae is 2

. This implies

that the number of possible states is 2

. As can be

seen in Eqn (2), the cost for evaluating a value func-

tion is proportional to the size of state space. To re-

duce the cost, we partition the state space into smaller

subspaces. The partitioning technique is based on

prerequisite relationships between concepts.

Prerequisite relationships are pedagogical orders

of concepts. A concept may have zero or more pre-

requisites, and a concept may serve as a prerequisite

of zero or more concepts. For example, in mathemat-

ics, derivative has prerequisites function, limit and so

on, and function is a prerequisite of derivative, inte-

gral, and so on. To understand a concept well, a stu-

dent should understand all its prerequisites ﬁrst. For

a set of concepts, the prerequisite relationships can be

represented by a directed acyclic graph (DAG). In this

paper, when a concept is a prerequisite of another, we

call the latter a successor of the former.

In the ﬁrst step of our space partitioning tech-

nique, we subdivide concepts such that concepts hav-

ing prerequisite relationships are in the same group.

Some very “basic” concepts may be in two or more

groups. In the second step, for each group, we cre-

ate a state subspace by using concepts in the group to

deﬁne states, in the way just discussed. In the third

step, we eliminate invalid states. For details of the

partitioning technique, please see (Wang, 2015). Af-

ter space partitioning, we create policy trees for each

subspace. In a tree, the nodes and edges concern con-

cepts in the subspace only.

This partitioning technique is based on our obser-

vation that in a window in a tutoring process, stu-

dent questions likely concern concepts that have pre-

requisite relationships with each other. The observa-

tion suggests that we could localize the computing for

choosing an optimal teaching action within a smaller

state subspace deﬁned by concepts having prerequi-

site relationships.

The total number of states in the subspaces are

much smaller than the number of states in the space

deﬁned by using all the concepts. In addition, the

number and sizes of policy trees in subspaces are

much smaller because the sets of actions and obser-

vations are smaller.

4.3 Policy Tree Grouping

The cost for making a decision depends on the num-

ber of trees to evaluate, that is, the size of T in Eqn

(5). For lower costs, we group the trees in each sub-

space into small tree sets. When choosing an optimal

teaching action, the agent evaluates trees in a single

set. For discussing the grouping method, we ﬁrst de-

ﬁne optimal action and tutoring session.

In science and mathematics subjects, many con-

cepts have prerequisites. When the student asks about

a concept, the system should decide whether it would

start with teaching a prerequisite for the student to

make up some required knowledge, and, if so, which

one to teach. The optimal action is to teach the con-

cept that the student needs to make up in order to un-

derstand the originally asked concept, and that the stu-

dent can understand it without making up other con-

cepts.

A tutoring session is a sequence of interleaved

student and system actions, starting with a question

about a concept, possibly followed by answers and

questions concerning the concept and its prerequi-

sites, and ending with a student action accepting the

CSEDU 2018 - 10th International Conference on Computer Supported Education

140

answer to the original question. If, before the accep-

tance action, the student asks a concept that has no

prerequisite relationship with the concept originally

asked, we consider that a new tutoring session starts.

We classify questions in a session into the original

question and current questions. The original question

starts the session, concerning the concept the student

originally wants to learn. We denote the original ques-

tion by (?C

), where C

is the concept concerned in

the question and superscipt o stands for “original”. A

current question is the question to be answered by the

agent at a point in the session, usually for the student

to make up some knowledge. A current question may

be asked by the student, or made by the agent. We

denote a current question by (?C

), where C

is the

concept concerned in the question, and superscipt c

stands for “current”. Concept C

is in (℘

∪C

where ℘

is the set of all the direct and indirect pre-

requisites of C

In the following, we discuss an example, which

involves concepts database and ﬁle. We assume that

ﬁle is a prerequisite of database. At a point in a tu-

toring process, the student asks question “What is a

database?” If database has no prerequisite relation-

ship with the concepts asked/taught right before the

question, we consider the question starts a new tu-

toring session, and it is the original question of the

session. If the agent believes that the student already

understands all the prerequisites of database, and an-

swers the question directly, the question is also the

current question when the agent answers it. If the

agent teaches database in terms of ﬁle, and then the

student asks question “What is a ﬁle?”, the system

action of teaching database is not an optimal because

the student needs to make up a prerequisite. At this

point the question about ﬁle is the current question. If

the agent answers the question about ﬁle and the stu-

dent satisﬁes the answer, the system action is optimal.

Now we consider the grouping of trees. When the

agent has current question (?C

) to answer, it needs to

choose an optimal action. The optimal action may be

to teach C

or teach one of the prerequisites of C

depending on the agent’s belief about the student’s

knowledge state. Recall that in a tutoring step, the

agent evaluates a set of trees and chooses the root ac-

tion of the tree that has the highest value. The set of

trees to evaluate to answer (?C

) should include trees

in which root actions are to teach C

or prerequisites

of C

. Since the ultimate goal to answer (?C

) is to

answer the original question (?C

), actions to answer

(?C

) should be included in the trees in the set.

Based on the above consideration, we have our

grouping strategy: for each possible pair of (?C

)

and (?C

), we create tree set T

. In a tutoring ses-

sion with original question (?C

), to choose an opti-

mal action to answer current question (?C

), the agent

evaluates trees in T

. Since T

is normally much

smaller than the set of all the possible trees, the cost

for choosing an optimal tree can be signiﬁcantly re-

duced. The tree structure will be discussed in the next

section.

5 DYNAMIC CREATION OF

POLICY TREES

5.1 Structure of the Policy Trees

In the following, we denote the system action for

teaching concept C by (!C), and denote a student ac-

ceptance action by (Θ). An acceptance action can be

something like “I understand”, and “I see”.

As just discussed, in a state subspace, for each

possible pair of original and current questions, de-

noted by (?C

) and (?C

), we create tree set T

. The

optimal action to answer current question (?C

) may

be to teach C

, or to teach one of the prerequisites

of C

. Therefore, the trees in T

have root actions

(!C

), or (!C) where C ∈℘

. For each C ∈℘

, we

have a tree in T

, of which the root is (!C). We de-

note a tree in T

with root action (!C) by T

.τ

In a tutoring session started by (?C

), the ultimate

goal of the agent is to teach C

. In a policy tree for

answering a current question, any leaf node must be a

system action to terminate the session, after a student

acceptance action that accepts (!C

). A path in the

tree includes possible questions and answers concern-

ing prerequisites of C

. Also, since possible student

questions in the session concern prerequisites of C

we limit the observation set O to questions concern-

ing concepts in ℘

only. (Student actions are treated

as observations.)

...

... ...

... ......... ...

...

′

(Θ)

(!C)

a a a a

′

Figure 2: The general structure of policy tree T

.τ

Figure 2 illustrates the general structure of pol-

icy tree T

.τ

. The root of T

.τ

is (!C), i.e. an

Dynamical Creation of Policy Trees for a POMDP-based Intelligent Tutoring System

141

action for teaching C ∈ (℘

∪C

). When C has M

prerequisites C

, ... C

, the root has M + 1 children.

The observations o

,...,o

are student actions

(?C

),...,(?C

). The ﬁrst M children are

sub-trees connected by the observation edges. Ac-

tions at the sub-tree roots are (!C

),...,(!C

). Note

that edge o

(1 ≤ i ≤ M) may connect to any one of

them. The last child is a sub-tree rooted by (!C

where C

is one of the direct successors of C. When

C has more than one direct successor, C

is the one

on the path from C

to C. The semantics of such root-

children structure is that after (!C), if the student ac-

cepts (!C), teach the direct successor C

, if the student

asks about a prerequisite of C, teach a prerequisite.

The prerequisite to teach is dynamically selected. The

selection will be discussed in the next subsection.

In a policy tree, each sub-tree is structured in the

same way. That is, the root has edges for prerequisites

of the concept in the root action and an acceptance

edge, illustrated by o

, ..., o

and (Θ) in Fig-

ure 2, where M

is the number of prerequisites of the

concept in the action connected by o

. However, if a

prerequisite has been taught in the path from the tree

root, the edge is not included. If a root is (!C

) for

answering the original question, its acceptance edge

connects to an action terminating the session.

5.2 Creation of the Policy Trees

When the student asks question (?C

) starting a new

tutoring session, the agent goes to the subspace that

contains C

, and contains all the prerequisites of C

as well. To answer current question (?C

) in the ses-

sion, it evaluates all the trees in tree set T

. We have

developed a new technique to dynamically create the

tree set when it is evaluated. This technique has bet-

ter space efﬁciency than the method of storing a tree

database.

As discussed in the previous subsection, in trees in

, root actions teach concepts in (℘

∪C

). In the

general structure of a policy tree (illustrated in Fig-

ure 1), each edge (observation) connects to all the

possible actions (in different trees). With this struc-

ture, for each C ∈(℘

∪C

) the number of trees with

root (!C) is exponential in the number of possible ob-

servations (see Eqn (7)). That is, the number of trees

having the same root action is exponential.

To reduce the cost for evaluating policy trees, in

we create only one tree for each C ∈ (℘

∪C

and use it to approximate an exponential number of

trees. To have only one tree for each C ∈ (℘

∪C

we connect each edge to one action, instead of all the

possible actions. For example, when creating the tree

in Figure 2, we select one action for edge o

, one ac-

tion for o

, ... one action for o

, and so on.

In our research, we discovered that in a state only

a small number of actions have large enough chances

to be taken. In computing Eqn (2) for evaluating trees,

most actions contribute little to tree values. This sug-

gests that we would not lose much information when

ignoring the actions that are less likely to be taken

and contribute less. In the following, we discuss the

selection of an action for each edge by using the tree

in Figure 3.

′

(Θ)

(!C)

′

t+1

t+2

... ...

... ... ... ...

... ...

... ... ... ...

Figure 3: The general structure of policy tree T

.τ

Assume at time step t, the agent has belief b

, and

will evaluate tree set T

, and assume the tree in Fig-

ure 3 is T

.τ

. In creating the tree, we select (!C)

as the root, which is a possible action to take at t, and

then select an action for each edge based on an up-

dated belief at the next level, and so on. For example,

we need to select action a for edge o

based the up-

dated belief b

t+1

, select a

for o

based on the updated

t+2

, and so on.

A belief is a set of probabilities (see Eqn (1)).

To update a belief, we update each of the probabili-

ties. The following is the formula to calculate element

) in updated belief b

) =

∑

s∈S

b(s)P(s

|s,a)P(o|a,s

)/P(o|a) (9)

where P(s

|s,a) ∈ T and P(o|a,s

) ∈ Z transition

probability and observation probability, P(o|a) is the

total probability for the agent to observe o after a is

taken, calculated as

P(o|a) =

∑

s∈S

b(s)

∑

∈S

P(s

|s,a)P(o|a,s

). (10)

P(o|a) is used in Eqn (9) as a normalization. Using

Eqn (9) we can calculate b

t+1

from b

, (!C), and o

Let b

t+1

= [b

t+1

),b

t+1

),...,b

t+1

)]. (11)

In b

t+1

we can ﬁnd the j such that b

t+1

) ≥b

t+1

)

for all the k 6= j (1 ≤ j,k ≤ Q). Assume the state

formula of s

(

√

...

√

l−1

¬C

...¬C

). (12)

CSEDU 2018 - 10th International Conference on Computer Supported Education

142

The belief and state formula indicate that most likely

the student does not understand C

, but understands all

of its prerequisites. Therefore, ρ(s

,(!C

)) returns a

high value. Considering a single step, we select (!C

)

as an optimal action at b

t+1

. Thus we select (!C

) as

a, and connect the edge of o

to it.

6 EXPERIMENTAL RESULTS

AND DISCUSSION

In this section, we present two sets of experimental

results. The ﬁrst set includes the results of evaluating

adaptive teaching of the system, and the second set in-

cludes the results of testing the technique for dynamic

creation of trees. The data set used in the experiments

included 90 concepts in software basics. Each con-

cept had zero to ﬁve prerequisites.

We used a two-sample t-test method to evaluate

the system performance in adaptive teaching. The test

method was the independent-samples t-test. 30 stu-

dents participated in the experiment, who were adults

without formal training in computing. The students

were randomly divided into two equal size groups.

Group 1 studied with the ITS with the POMDP turned

off, and Group 2 studied with the POMDP turned on.

Each student studied with the ITS for about 45 min-

utes. The student asked questions about concepts in

the subject, and ITS taught the concepts. The per-

formance parameter was rejection rate, which was

the ratio of the number of system actions rejected by

a student to the total number of system actions for

teaching the student.

For each student, we calculated a rejection rate.

For the two groups, we calculated mean rejection

rates

and

. The two sample means were used

to represent population means µ

and µ

. The alterna-

tive and null hypotheses are:

: µ

−µ

6= 0, H

: µ

−µ

= 0

The means and variances calculated for the two

groups are listed in Table 1. In the experiment, n

=15

and n

=15, thus the degree of freedom is (15 −1) +

(15 −1) = 28. With alpha at 0.05, the two-tailed t

crit

is 2.0484 and we calculated t

obt

= +8.6690. Since

the t

obt

is far beyond the non-reject region deﬁned

by t

crit

= 2.0484, we could reject H

and accept H

The analysis suggested that the POMDP could sig-

niﬁcantly reduce the rejection rate. This implies that

the POMDP helped the system signiﬁcantly improve

adaptive teaching.

We tested the dynamic tree creation technique

with the same data set of software basics, on a desk-

top computer with an Intel Core i5 3.2 GHz 64 bit

Table 1: Number of students, mean and estimated variance

of each group.

Group 1 Group 2

Number of students n

= 15 n

= 15

Sample mean

= 0.5966

= 0.2284

Estimated variance s

= 0.0158 s

= 0.0113

processor and 16GB RAM. For comparison, we also

tested a static tree creation technique. Both the static

and dynamic tree creation techniques have the same

algorithm for grouping trees. The difference is that

in the static technique all the tree sets were created

and stored in a database before the ITS started teach-

ing students, while in the dynamic technique a tree

set was created right before it was evaluated. In both

techniques, the state space was subdivided into six

subspaces. The largest subspace included 27 con-

cepts, 4,970 valid states, 170 tree sets, and 688 trees.

Table 2: Comparison between static and dynamic tree cre-

ation methods.

Static Dynamic

creation creation

Permanent pace usage 1.078GB 0

Max space usage 1.078GB 215.68MB

Database creation time 36,888ms

Max tree creation time 158ms

Belief update time 525ms 518ms

Max decision time 147ms 152ms

Max Response time 669ms 828ms

In Table 2 we list results of the two techniques.

The space usage includes that for the tree database

or tree sets only. The maximum tree creation time

was for creating the largest tree set, of which the size

was 215.68MB. The maximum decision time was for

evaluating trees in the largest tree set and choosing

the optimal tree. Response time included the time for

calculating a new belief, and evaluating a tree set to

choose an optimal tree. The maximum response time

was recorded when the largest tree set was evaluated.

The experimental results suggest that the dynamic

tree creation technique is effective for building space-

efﬁcient ITSs. Its space usage was a small fraction

of that of a static tree creation technique. In terms of

time efﬁciency, the dynamic technique was compara-

ble to a static technique. Since the time for dynami-

cally creating a tree set is short, the total response time

was only slightly longer than that of the static tech-

nique. As can be seen in Table 2, the maximum re-

sponse time with dynamic tree creation was less than

a second. For a tutoring system, such response time

could be considered acceptable. Time efﬁciency can

be improved with a cache of tree sets.

Dynamical Creation of Policy Trees for a POMDP-based Intelligent Tutoring System

143

7 CONCLUDING REMARKS

We have developed a new technique to address the

space complexity problem in building POMDP-based

ITSs. In this technique, policy trees are dynamically

created when they are evaluated, and no space is re-

quired to store a tree database. The technique is espe-

cially useful for building ITSs on handheld devices,

which usually have limited storage spaces. While sig-

niﬁcantly improving space efﬁciency, the technique

does not sacriﬁce much time efﬁciency. In some

cases, it may even have advantages in time efﬁciency.

For example, for system without durable storage, the

technique may largely reduce the time to start, since

the lengthy tree database creation is not needed.

ACKNOWLEDGEMENTS

This research is supported by the Natural Sci-

ences and Engineering Research Council of Canada

(NSERC).

REFERENCES

Carlin, A. and Zilberstein, S. (2008). Observation compres-

sion in DEC-POMDP policy trees. In Proceedings of

the 7th International Joint Conference on Autonomous

Agents and Multi-agent Systems, pages 31–45.

Cassandra, A. (1998). A survey of pomdp applications. In

Working Notes of AAAI 1998 Fall Symposium on Plan-

ning with Partially Observable Markov Decision Pro-

cess, pages 17–24.

Chinaei, H. R., Chaib-draa, B., and Lamontagne, L. (2012).

Learning observation models for dialogue POMDPs.

In Canadian AI’12 Proceedings of the 25th Cana-

dian conference on Advances in Artiﬁcial Intelligence,

pages 280–286.

Folsom-Kovarik, J. T., Sukthankar, G., and Schatz, S.

(2013). Tractable POMDP representations for intel-

ligent tutoring systems. ACM Transactions on In-

telligent Systems and Technology (TIST) - Special

section on agent communication, trust in multiagent

systems, intelligent tutoring and coaching systems

archive, 4(2):29.

Rafferty, A. N., Brunskill, E., Thomas, L., Grifﬁths, T. J.,

and Shafto, P. (2011). Faster teaching by POMDP

planning. In Proceesings of Artiﬁcial Intelligence in

Education (AIED 2011), pages 280–287.

Theocharous, G., Beckwith, R., Butko, N., and Philipose,

M. (2009). Tractable POMDP planning algorithms for

optimal teaching in spais. In IJCAI PAIR Workshop

2009.

Wang, F. (2015). Handling exponential state space in a

POMDP-based intelligent tutoring system. In Pro-

ceedings of 6th International Conference on E-Service

and Knowledge Management (IIAI ESKM 2015),

pages 67–72.

Wang, F. (2016). A new technique of policy trees for

building a POMDP based intelligent tutoring system.

In Proceedings of The 8th International Conference

on Computer Supported Education (CSEDU 2016),

pages 85–93.

Williams, J. D., Poupart, P., and Young, S. (2005). Fac-

tored partially observable Markov decision processes

for dialogue management. In Proceedings of Knowl-

edge and Reasoning in Practical Dialogue Systems.

Williams, J. D. and Young, S. (2007). Partially observable

Markov decision processes for spoken dialog systems.

Elsevier Computer Speech and Language, 21:393–

422.

Woolf, B. P. (2009). Building Intelligent Interactive Tutors.

Morgan Kaufmann Publishers, Burlington, MA, USA.

CSEDU 2018 - 10th International Conference on Computer Supported Education

144