5 DISCUSSION
Since the proposed approach resembles an extension
of shortest path problems, it can be applied to the
class of similar problems in the context of reinforce-
ment learning including the case of positive rewards
on acyclic state transition. In the other cases based on
a potential field, more consideration is necessary to
model such problems.
A shown in (Matsui, 2019), applying the idea
of leximin/leximax to optimization of joint policies
among multiple agents so that the equality of indi-
vidual agents is improved is difficult because the ag-
gregation of sorted-objective vectors cannot be well
decomposed in a direction of episodes. On the other
hand, the class of problems in this study is relatively
better, since the decomposition of a real-valued his-
togram is based on dynamic programming. However,
more investigations are necessary to discuss whether
this approach can appropriately fit some class of mul-
tiagent or multi-objective problems.
Vleximax is based on the assumption that predic-
tion of future optimal episodes is possible. Therefore,
it cannot be employed in highly stochastic problems.
On the other hand, where statistically optimal policy
can be calculated with real-valued histograms, there
may be opportunities to improve policies with vlexi-
max.
We employed a simple definition of the weighted
average of real-valued histograms with scalar weight
values. It may be possibile to use weight vectors to
emphasize part of the discrete cost values. In addition,
other aggregation operators that differently generate
a histogram from the original two histograms could
affect solution quality.
6 CONCLUSIONS
We investigated applying a leximin based criterion to
the Q-learning method to consider the worst-case and
equality among individual cost/reward values in an
episode. The experimental results show that the crite-
rion is effective in several cases of problems, while
more considerations are necessary to employ it to
other problem classes.
Our future work will include, improvement of the
proposed method, and investigations applying the cri-
terion to part of the practical multi-objective and mul-
tiagent problems.
ACKNOWLEDGEMENTS
This work was supported in part by JSPS KAKENHI
Grant Number JP19K12117 and Tatematsu Zaidan.
REFERENCES
Barto, A. G., Bradtke, S. J., and Singh, S. P. (1995). Learn-
ing to act using real-time dynamic programming. Ar-
tificial Intelligence, 72(1-2):81–138.
Bouveret, S. and Lema
ˆ
ıtre, M. (2009). Computing leximin-
optimal solutions in constraint networks. Artificial In-
telligence, 173(2):343–364.
Greco, G. and Scarcello, F. (2013). Constraint satisfac-
tion and fair multi-objective optimization problems:
Foundations, complexity, and islands of tractability.
In Proc. 23rd International Joint Conference on Arti-
ficial Intelligence, pages 545–551.
Hart, P., N. N. and Raphael, B. (1968). A formal basis
for the heuristic determination of minimum cost paths.
IEEE Trans. Syst. Science and Cybernetics, 4(2):100–
107.
Hart, P., N. N. and Raphael, B. (1972). Correction to ’a for-
mal basis for the heuristic determination of minimum-
cost paths’. SIGART Newsletter, (37):28–29.
Marler, R. T. and Arora, J. S. (2004). Survey of
multi-objective optimization methods for engineer-
ing. Structural and Multidisciplinary Optimization,
26:369–395.
Matsui, T. (2019). A study of joint policies considering bot-
tlenecks and fairness. In 11th International Confer-
ence on Agents and Artificial Intelligence, volume 1,
pages 80–90.
Matsui, T., Silaghi, M., Hirayama, K., Yokoo, M., and Mat-
suo, H. (2014). Leximin multiple objective optimiza-
tion for preferences of agents. In 17th International
Conference on Principles and Practice of Multi-Agent
Systems, pages 423–438.
Matsui, T., Silaghi, M., Hirayama, K., Yokoo, M., and Mat-
suo, H. (2018). Study of route optimization consider-
ing bottlenecks and fairness among partial paths. In
10th International Conference on Agents and Artifi-
cial Intelligence, pages 37–47.
Matsui, T., Silaghi, M., Okimoto, T., Hirayama, K., Yokoo,
M., and Matsuo, H. (2015). Leximin asymmetric mul-
tiple objective DCOP on factor graph. In 18th In-
ternational Conference on Principles and Practice of
Multi-Agent Systems, pages 134–151.
Russell, S. and Norvig, P. (2003). Artificial Intelligence: A
Modern Approach (2nd Edition). Prentice Hall.
Sen, A. K. (1997). Choice, Welfare and Measurement. Har-
vard University Press.
Sutton, R. S. and Barto, A. G. (1998). Reinforcement learn-
ing : an introduction. MIT Press.
ICAART 2020 - 12th International Conference on Agents and Artificial Intelligence
342