technique to practical tutoring problems.
We develop a novel technique of policy trees, aim-
ing at minimizing the number of trees to evaluate in
making a decision, and reducing the costs for evaluat-
ing individual trees. This technique is based on the in-
formation of pedagogical order of the contents in the
instructional subject. In this paper, we first provide
the background knowledge of ITS and POMDP, re-
view some existing work of using POMDP for build-
ing ITSs, with emphasis on POMDP solving, then we
present our technique of policy trees, and finally we
discuss some experimental results.
2 INTELLIGENT TUTORING
SYSTEMS
Two major features of an ITS are knowledge tracking
and adaptive instruction: An ITS should be able to
store and track a student’s knowledge states during
a tutoring process, and choose the optimal tutoring
actions accordingly.
The core modules in an ITS include a domain
model, a student model, and a tutoring model. The
domain model stores the domain knowledge, which
is basically the knowledge in the instructional sub-
ject. For a subject, an ITS may teach concepts or
problem-solving skills, or both. In the domain model,
the knowledge for the former is usually declarative,
while the knowledge for the latter is procedural.
The student model contains information about stu-
dents. There are two types of student information: the
information about the behavior of general students in
studying the subject, and the information about the
current state of the student being tutored. The tutor-
ing model represents the system’s tutoring strategies.
In each tutoring step, the agent accesses the stu-
dent model to obtain information about the student’s
current state, then based on the information it applies
the tutoring model to choose a tutoring action and re-
trieves the domain model for the knowledge to teach.
After taking the action, it updates the student model,
chooses and takes the next action based on the up-
dated model, and so on, till the tutoring session ends.
The above discussion suggests that intelligent tu-
toring can be modeled by a Markov decision process
(MDP). In MDP, the decision agent is in a state at
any point of time. Based on information of the state,
it chooses and takes the action it considers optimal.
After the action, the agent receives an award and en-
ters a new state, where it chooses the next action, and
so on. In MDP, states are completely observable to
the decision agent, and the agent knows exactly what
the current state is. However, as mentioned before,
in a tutoring process a student’s states are not always
completely observable. Partially observable Markov
decision process (POMDP) is a more suitable model-
ing tool for intelligent tutoring processes.
3 PARTIALLY OBSERVABLE
MARKOV DECISION PROCESS
The major components of a POMDP are S, A, T, ρ, O,
and Z, where S is a set of states, A is a set of actions,
T is a set of state transition probabilities, ρ is a reward
function, O is a set of observations, and Z is a set of
observation probabilities. At a point of time, the deci-
sion agent is in state s ∈ S, it takes action a ∈ A, then
enters state s
′
∈S, observes o ∈O, and receives award
r = ρ(s,a,s
′
). The probability of transition form s to
s
′
after a is P(s
′
|s,a) ∈ T. The probability of observ-
ing o in s
′
after a is P(o|a, s
′
) ∈ Z. Since the states are
not completely observable, the agent infers state in-
formation from its observations, and makes decisions
based on its inferred beliefs about the states.
An additional major component in POMDP is the
policy denoted by π. It is used by the agent to choose
an action based on its current belief:
a = π(b) (1)
where b is the belief, which is defined as
b = [b(s
1
),b(s
2
),...,b(s
Q
)] (2)
where s
i
∈ S (1 ≤ i ≤ Q) is the ith state in S, Q is the
number of states in S, b(s
i
) is the probability that the
agent is in s
i
, and
∑
Q
i=1
b(s
i
) = 1.
Given a belief b, an optimal π returns an optimal
action. For a POMDP, finding the optimal π is called
solving the POMDP. For most applications, solving
a POMDP is a task of great computational complex-
ity. A practical method for POMDP-solving is using
policy trees. In a policy tree, nodes are actions and
edges are observations. Based on a policy tree, after
an action (at a node) is taken, the next action is deter-
mined by what is observed (at an edge). Thus a path
in a policy tree is a sequence of “action, observation,
action, observation, ..., action”.
In the method of policy trees, a decision is to
choose the optimal policy tree and take the root ac-
tion. Each policy tree is associated with a value func-
tion. Let τ be a policy tree and s be a state. The value
function of s given τ is
V
τ
(s) = R (s,a)+γ
∑
s
′
∈S
P(s
′
|s,a)
∑
o∈O
P(o|a,s
′
)V
τ(o)
(s
′
)
(3)
where a is the root action of τ, γ is a discounting fac-
tor, o is the observation after the agent takes a, τ(o) is