
near-optimal solutions in terms of cost (7.13). Cou-
pled with only a 0.46% difference relative to the GR
model, the DiCE
Mish
model emerges as an efficient
alternative for routing problems, balancing solution
quality with computational efficiency.
5 CONCLUSION AND FUTURE
WORK
This work presented a novel approach for solving the
CVRP using GATs trained under the RL paradigm
employing the DiCE estimator. Our primary contri-
bution is the elimination of the need for a dual-actor
structure, which is commonly employed in traditional
methods like REINFORCE with a GR baseline, re-
sulting in lower computational costs without compro-
mising the quality of the solutions.
The experiments indicate that by using the DiCE
estimator, the developed GAT models obtain near-
optimal solutions while reducing training time and
computational costs compared to more traditional
techniques, such as the actor-critic model. Specifi-
cally, the architectures employing the DiCE estimator
showed training times approximately 30% lower than
the time spent by REINFORCE with Greedy Roll-
out. Moreover, the DiCE method not only makes the
model more efficient in terms of time but also sim-
plifies the training process by eliminating the need to
optimise both an actor and a critic simultaneously.
This line of research opens up important chal-
lenges for exploration in future work. One signifi-
cant development would be the implementation of a
warm-start strategy, which has the potential to reduce
the computation time of exact models by providing
sub-optimal solution values during initialisation. This
strategy is particularly relevant in large-scale optimi-
sation operations, where a good initial solution can
significantly reduce the problem’s search space. Ad-
ditionally, combining traditional optimisation tech-
niques, such as dynamic programming and branch
and bound, with machine learning models could lead
to the creation of hybrid solutions that leverage the
strengths of each approach. Finally, techniques such
as Transfer Learning could be used to apply knowl-
edge gained from solving smaller or less complex in-
stances to larger or more complex ones without re-
quiring new training.
REFERENCES
Applegate, D. L., Bixby, R. E., Chv
´
atal, V., and Cook, W. J.
(2006). The traveling salesman problem. http://www.
math.uwaterloo.ca/tsp/concorde.html. Accessed: Oc-
tober, 2023.
Battaglia, P. W., Hamrick, J. B., Bapst, V., Sanchez-
Gonzalez, A., Zambaldi, V., Malinowski, M., Tac-
chetti, A., Raposo, D., Santoro, A., Faulkner, R., et al.
(2018). Relational inductive biases, deep learning, and
graph networks. arXiv preprint arXiv:1806.01261.
Bello, I., Pham, H., Le, Q. V., Norouzi, M., and Bengio, S.
(2016). Neural combinatorial optimization with rein-
forcement learning. CoRR, abs/1611.09940.
Christofides, N. (1976). Worst-case analysis of a new
heuristic for the travelling salesman problem. (388).
Dai, H., Khalil, E. B., Zhang, Y., Dilkina, B., and Song,
L. (2017). Learning combinatorial optimization algo-
rithms over graphs. CoRR, abs/1704.01665.
Foerster, J., Farquhar, G., Al-Shedivat, M., Rockt
¨
aschel, T.,
Xing, E. P., and Whiteson, S. (2018). Dice: The in-
finitely differentiable monte-carlo estimator.
Google (2023). Or-tools. https://developers.google.com/
optimization. Accessed: October, 2023.
Gori, M., Monfardini, G., and Scarselli, F. (2005). A new
model for learning in graph domains. 2:729–734 vol.
2.
Helsgaun, K. (2000). An effective implementation of the
lin–kernighan traveling salesman heuristic. European
Journal of Operational Research, 126(1):106–130.
Hopfield, J. and Tank, D. (1985). Neural computation of
decisions in optimization problems. Biological cyber-
netics, 52:141–52.
Kool, W., van Hoof, H., and Welling, M. (2019). Attention,
learn to solve routing problems!
Lei, K., Guo, P., Wang, Y., Wu, X., and Zhao, W. (2021).
Solve routing problems with a residual edge-graph at-
tention neural network. CoRR, abs/2105.02730.
Misra, D. (2020). Mish: A self regularized non-monotonic
activation function.
Schulman, J., Heess, N., Weber, T., and Abbeel, P.
(2016). Gradient estimation using stochastic compu-
tation graphs.
Sutskever, I., Vinyals, O., and Le, Q. V. (2014). Sequence
to sequence learning with neural networks.
Sutton, R. S., Mcallester, D., Singh, S., and Mansour, Y.
(2000). Policy gradient methods for reinforcement
learning with function approximation. 12:1057–1063.
Talbi, E.-G. (2009). Metaheuristics: From Design to Imple-
mentation. Wiley Publishing.
Toth, P., Vigo, D., Toth, P., and Vigo, D. (2014). Vehicle
Routing: Problems, Methods, and Applications, Sec-
ond Edition. Society for Industrial and Applied Math-
ematics, USA.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,
L., Gomez, A. N., Kaiser, L., and Polosukhin, I.
(2023). Attention is all you need.
Velickovic, P., Cucurull, G., Casanova, A., Romero, A., Li
`
o,
P., and Bengio, Y. (2018). Graph attention networks.
ArXiv, abs/1710.10903.
Vinyals, O., Fortunato, M., and Jaitly, N. (2017). Pointer
networks.
Williams, R. J. (1992). Simple statistical gradient-following
algorithms for connectionist reinforcement learning.
Machine Learning, 8:229–256.
Integrating Machine Learning and Optimisation to Solve the Capacitated Vehicle Routing Problem
293