Figure 2: Lattice graph with walls.
count values whose complexity is O(n). The com-
plexity for the comparison of two vectors is also O(n).
4 INCREMENTAL
OPTIMIZATION
Next we focus on how the real-time search algorithm
can be generalized with the leximax criterion. Un-
fortunately, this is impossible due to a problematical
monotonicity on cyclic paths.
Consider the case shown in Fig. 2, where the
agents start from node 1. For the nodes adjacent to
node 1, h(2) + w
1,2
= [ ] + [1] = [1] and h(4) + w
1,4
=
[ ] + [2] = [2]. With the vleximax and the rules based
on the LRTA* shown in Section 2.3, the agent moves
to node 2 and updates h(1) to [1]. Then for the nodes
adjacent to node 2, h(1) + w
1,2
= [1] + [1] = [1, 1],
h(3) + w
2,3
= [ ] + [2] = [2], and h(5) + w
2,5
= [ ] +
[2] = [2]. Therefore, the agent returns to node 1 and
updates h(2) to [1, 1]. In the third step, for the nodes
adjacent to node 1, h(2)+ w
1,2
= [1, 1] +[1] = [1, 1, 1]
and h(4) + w
1,4
= [ ] + [2] = [2]. Therefore, the agent
returns to node 2 again and repeats this round-trip for-
ever to add cost value 1 to h(1) and h(2).
The above example reveals the necessity of other
approaches for exploration in the case of sorted ob-
jective vectors with variable lengths when there can
be cyclic paths. Such cyclic paths can be detected
with a threshold length. Then some such incorrect
vectors can be replaced by appropriate vectors that
break the cyclic movements. However, such an ap-
proach might be problematic, since the invariance of
vleximin does not hold and may affect the correctness
of the dynamic programming.
4.1 Episode-based Approach
Here we address more safe approaches with a rela-
tively direct extension of conventional search algo-
rithms. Since the dynamic programming is correct,
we employ episode-based learning, where the learn-
ing phases are separated from the exploration phase.
This approach is also called off-line learning.
Assume that a complete path between the start and
goal nodes was obtained from an exploration phase.
Cyclic paths are allowed to increase the learning op-
portunities.
Then the path is scanned from the goal node to the
initial start node by updating corresponding estimated
values h(s
k
) except the goal node. Note that here s
k
denotes the k
th
value from the initial start node on a
path:
1. The agent evaluates f(s
k+1
) = w
i, j
+ h(s
k+1
),
where w
i, j
corresponds to edge e
i, j
between s
k
and
s
k+1
.
2. If h(s
k
) has not been updated yet, it is up-
dated by f(s
k+1
). Otherwise, it is updated by
min( f(s
k+1
), h(s
k
)).
In the example of Fig. 2, assume that an episode
of nodes (1, 2, 3, 6, 5, 2, 3, 6, 9) has been performed in
the initial trial. Since the node 9 is the goal node,
h(9) holds empty vector []. Then h(6) is updated by
f(9) = [] + [1] = [1]. Similarly, for their previous part
of nodes (5, 2, 3), h(3) = [1, 1], h(2) = [1, 1, 2], h(5) =
[1, 1, 2, 2] are updated. However, for their previous
node 6, h(6) holds its vector by min( f(5), h(6)) =
min([1, 1, 1, 2, 2], [1]) = [1]. h(3) = [1, 1] and h(2) =
[1, 1, 2] are also unchanged for the previous part of
nodes (3, 2). Finally, h(1) is updated by [1, 1, 1, 2].
The above h(s
k
) is the upper bound of the optimal
cost value from s
k
to the goal node, since it is up-
dated by the propagation from the goal node. When
the agent’s explorations are sufficient, h(s
k
) converges
to the optimal value, since the algorithm exactly per-
forms partial updates of the dynamic programming.
4.2 Boundaries of Paths
While the above episode-based approach needs com-
plete paths to the goal nodes, it converges with ap-
propriate exploration strategies. The other problem of
the above simple update rule is that it does not employ
the information of neighborhood nodes which are not
on the path. Also, the algorithm cannot evaluate the
lower bound cost values which can be employed by
best-first strategies.
Here we address the lower bound of optimal cost
value h
(s
i
) and the upper bound value h(s
i
). The
boundaries h
(s
i
) and h(s
i
) of the estimated cost val-
ues are initialized as follows:
1. Except for the goal node, h
(s
i
) and h(s
i
) are ini-
tialized to [ ] and [⊤···⊤], respectively, where ⊤
denotes the maximum cost value. h(s
i
) must con-
tain a sufficient number of duplicates of the maxi-
mum cost value to exceed the other objective vec-
tors in the manner of vleximin.