Figure 11: Moves of the RL Agents (20 Steps per Episode)
in the Machine Space. The Episode Trajectories Are Ran-
domly Sampled across the Three Trained RL Agents.
The learned optimisation strategy is visualised in Fig-
ure 11. Similarly to the case with 110 steps per
episode, the RL agent moves the WP out of the axis
collision area in big steps first and then gradually ap-
proaches the optimal points.
A further decrease of the training steps per episode
leads to a decrease of the stability of the results be-
cause depending on the WP initialisation point, the
RL agent may not always be capable of reaching the
optimal point.
5 CONCLUSION AND OUTLOOK
We formalised the problem of finding an optimal WP
clamping position as an MDP problem, set up a train-
ing environment for an RL agent using an approxima-
tion of the SinuTrain milling simulation with stacked
ML models and successfully demonstrated that an
RL agent can solve the WP clamping position opti-
misation task. Through a number of evaluations we
demonstrated that a trained RL agent is capable of
generalisation to new, previously unseen (but simi-
lar in geometry) WPs. In the case of a 3-axis ma-
chine tool, it is capable of finding a valid and near-
optimal clamping position within a limited number
of optimisation steps for an unseen WP without ad-
ditional training when the WP geometry is explicitly
described as a part of the state space.
This paper is a proof-of-concept work demonstrat-
ing that RL can be applied in complex optimisation
tasks from the field of mechanical engineering, such
as the search for an optimal WP clamping position.
In this study, we elaborated only on a simple 3-axis
milling machine use case. In future work, the demon-
strated results can be transferred to more complex
WPs requiring processing on a 5-Axis milling ma-
chine. Another important aspect of the possible fu-
ture research focuses on a solution able to handle
a variety of WPs, without providing a set of hand-
crafted features describing the WP. For example, we
can define the optimisation task as a Partially Observ-
able Markov Decision Process (POMDP) where WP
related features can be learned indirectly by the RL
agent through simulation feedback at the beginning
of the optimisation process. Possible solutions for
such a POMDP optimisation task includes the use of
frame stacking, RL with Recurrent Neural Networks
or a meta-learning method. To progress further in the
direction of a production capable solution, we would
need to improve the RL training efficiency and trans-
ferability of RL-learned solutions to new WPs and
machine tools with complex axis movements with no
or little extra training required.
REFERENCES
Anderson, R. L. (1953). Recent advances in finding best op-
erating conditions. Journal of the American Statistical
Association, 48(264):789–798.
Baer, S., Bakakeu Romuald Jupiter, Meyes Richard, and
Meisen Tobias (25.09.2019-27.09.2019). Multi-agent
reinforcement learning for job shop scheduling in flex-
ible manufacturing systems. In 2019 Second Interna-
tional Conference on Artificial Intelligence for Indus-
tries (AI4I). IEEE.
Bakakeu, J., Tolksdorf, S., Bauer, J., Klos, H.-H., Peschke,
J., Fehrle, A., Eberlein, W., B
¨
urner, J., Brossog, M.,
Jahn, L., and Franke, J. (2018). An artificial intel-
ligence approach for online optimization of flexible
manufacturing systems. Applied Mechanics and Ma-
terials, 882:96–108.
Bello, I., Pham, H., Le, Q. V., Norouzi, M., and Bengio, S.
(2016). Neural combinatorial optimization with rein-
forcement learning. arXiv preprint arXiv:1611.09940.
Brockman, G., Cheung, V., Pettersson, L., Schneider, J.,
Schulman, J., Tang, J., and Zaremba, W. (2016). Ope-
nai gym.
Garud, S. S., Karimi, I. A., and Kraft, M. (2017). Design
of computer experiments: A review. Computers and
Chemical Engineering.
Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. (2018).
Soft actor-critic: Off-policy maximum entropy deep
reinforcement learning with a stochastic actor.
Hajela, P. and Lin, C.-Y. (1992). Genetic search strategies
in multicriterion optimal design. Structural Optimiza-
tion, 4(2):99–107.
Hill, A., Raffin, A., Ernestus, M., Gleave, A., Kanervisto,
A., Traore, R., Dhariwal, P., Hesse, C., Klimov, O.,
Nichol, A., Plappert, M., Radford, A., Schulman, J.,
Sidor, S., and Wu, Y. (2018). Stable baselines.
Howard, R. A. (1960). Dynamic programming and markov
processes.
Johannink, T., Bahl, S., Nair, A., Luo, J., Kumar, A.,
Loskyll, M., Ojea, J. A., Solowjow, E., and Levine,
S. (20.05.2019 - 24.05.2019). Residual reinforcement
learning for robot control. In 2019 International Con-
Using Reinforcement Learning for Optimization of a Workpiece Clamping Position in a Machine Tool
513