Even though leximax aggregation is precisely decom-
posed with dynamic programming approaches in sev-
eral cases (Matsui et al., 2018a; Matsui et al., 2018c;
Matsui et al., 2018b), this is not the case. We assume
that fair policies are more easily improved than un-
fair policies when additional actions are aggregated.
This depends on the freeness of the problem that al-
lows such a greedy approach. Although this approach
is only a heuristic, it will reasonably work when there
are a number of opportunities to improve the partial
costs with previous actions in the manner of leximax.
For most non-deterministic cases, the learning
process will not converge. However, after a sufficient
number of updates with an appropriate learning rate,
the Q-vectors will have some statistic information of
the optimal policy similar to the case of a determinis-
tic process.
Since leximax employs operations with sorted
vectors, the computational overhead of the proposed
approach is significantly large. While our major inter-
est in this work is how the information of bottlenecks
and fairness is mapped to Q-values, the overhead
should be mitigated with several techniques, such as
the caching of sorted vectors in practical implementa-
tions.
4 EVALUATION
We experimentally evaluated our proposed approach
for deterministic and non-deterministic processes by
employing the example domain of the pursuit prob-
lem.
4.1 Settings
As shown in Section 3.1, we employ a pursuit prob-
lem with four hunters and one target. The grid size
of the torus world is 5× 5 or 7 × 7 grids for the de-
terministic domain, and the size is 5× 5 grids for the
non-deterministic domain due to the limitations of the
computation time of the learning process.
After the learning process, policy selection is per-
formed and the individual total cost values for the
hunters are evaluated. Here the total cost value cor-
responds to the total number of moves of each hunter.
In this experiment, the locations of the four hunters
are set to four corner cells (Figure 1). In the grid
world, it is identical to that the hunters are gathered
into an area. On the other hand, the initial locations
of the target are set to all the cells except the initial
locations of the hunters. We performed ten trials for
each setting and averaged their results.
We compared the following three methods.
• sum: a single objective reinforcement learning
method shown in Equation (6) and (10) that mini-
mizes the total cost for all the hunters.
• lxm: a multi-objective reinforcement learning
method that minimizes the individual total cost
values for all the hunters with the minimum lexi-
max filtering.
• sumlxm: a multi-objective reinforcement learning
method that minimizes the total cost for all the
hunters. However, the policy selection is the same
as ‘lxm’. Note that we employed update rules that
resemble Equations (8) and (11) by replacing the
filtering criterion to minimum summation, so that
the Q-vectors are compatible with ‘lxm’.
Here ‘sumlxm’ is employed to distinguish the ef-
fects of the Q-vectors for the minimum leximax ap-
proach from the action selection method.
4.2 Results
4.2.1 Deterministic Domain
We first evaluated the proposed approach for the de-
terministic decision process without any randomness
for the moves of the hunters and the target. The learn-
ing process repeatedly updated the Q-vectors for all
the joint state-action pairs, similar to the Bellman-
Ford algorithm. In the deterministic case, after a few
iterations of all the updates for all the Q-vectors, the
vectors converged. Due to the infinity cost vector for
the all-stop joint actions, there were no infinite cyclic
policies.
Table 1 shows the total cost values for different
methods in the 5× 5 world. The size of the joint state-
action space to be learned is (5× 5× 2)
4
= 6, 250, 000
for this problem. In a policy selection experiment,
there are 5 × 5 − 4 = 21 initial locations for the tar-
get and 210 instances for ten trials. The results show
that ‘sum’ minimizes the total cost for all the hunters
which is equivalent to the average cost. On the other
hand, ‘lxm’ minimizes the maximum total cost for the
individual hunters, and also reduces the values of the
Theil index, which is a criterion of unfairness. Since
there are trade-offs between efficiency and fairness,
the total (i.e. the average) cost for all hunters of ‘lxm’
is not less than that of ‘sum’.
The result of ‘sumlxm’, which is a combination
of the learning of ‘sum’ and the action selection of
‘lxm’, shows mismatches between the learning and
the action selection. It also reveals that both the learn-
ing and the action selection of ‘lxm’ havesome effects
in its improvements.
In addition, the policy length of ‘lxm’ is relatively
greater than that of ‘sum’. A possible reason is that