Using Reinforcement Learning for Optimization of a Workpiece

Clamping Position in a Machine Tool

Vladimir Samsonov

, Chrismarie Enslin

, Hans-Georg K

opken

, Schirin Baer

and Daniel L

utticke

Institute of Information Management in Mechanical Engineering, RWTH Aachen University, Aachen, Germany

Siemens AG, Digital Factory Division, Nuernberg, Germany

{schirin.baer, hans-georg.koepken}@siemens.de

Keywords:

Reinforcement Learning, Soft Actor-Critic, Supervised Learning, Industrial Manufacturing, Process Optimi-

sation, Machine Tool Optimisation.

Abstract:

Modern manufacturing is increasingly data-driven. Yet there are a number of applications traditionally per-

formed by humans because of their capabilities to think analytically, learn from previous experience and adapt.

With the appearance of Deep Reinforcement Learning (RL) many of these applications can be partly or com-

pletely automated. In this paper we aim at ﬁnding an optimal clamping position for a workpiece (WP) with

the help of deep RL. Traditionally, a human expert chooses a clamping position that leads to an efﬁcient, high

quality machining without axis limit violations or collisions. This decision is hard to automate because of the

variety of WP geometries and possible ways to manufacture them. We investigate whether the use of RL can

aid in ﬁnding a near-optimal WP clamping position, even for unseen WPs during training. We develop a use

case representing a simpliﬁed problem of clamping position optimisation, formalise it as a Markov Decision

Process (MDP) and conduct a number of RL experiments to demonstrate the applicability of the approach in

terms of training stability and quality of the solutions. First evaluations of the concept demonstrate the capa-

bility of a trained RL agent to ﬁnd a near-optimal clamping position for an unseen WP with a small number of

iterations required.

1 INTRODUCTION

Computer Numerical Control (CNC) machining is

widely used in the manufacturing industry. To pro-

duce a part of a certain geometry, a CNC program

deﬁning relative movements of the cutting tool to the

workpiece (WP) is created as a ﬁrst step. Addition-

ally, a clamping position of the WP in the milling ma-

chine space has to be chosen manually, solely based

on the human expert knowledge and experience.

The controller translates the relative toolpath into

real axis moves based on the machine kinematics and

the chosen WP clamping position. Therefore, de-

pending on the choice of the WP clamping position,

the tool movements can be conducted by various ma-

chine axes resulting in different levels of processing

speed, quality, machine wear and energy efﬁciency.

In this study we investigate whether a deep rein-

forcement learning agent (from here on referred to as

the RL agent) can learn a strategy to ﬁnd a clamp-

ing position close to the optimum for a given WP.

Furthermore, we investigate whether an RL agent

can not only reproduce learned optimal solutions,

but come up with abstract solution strategies capa-

ble of efﬁciently generalising to new, previously un-

seen, WPs. The RL agent is expected to learn WP

positioning strategies addressing two tasks simulta-

neously: avoiding axis collisions and ﬁnding optimal

clamping positions in terms of axis accelerations and

traveled distances.

2 RELATED WORK

Over years many efﬁcient approaches have been

proposed to tackle optimisation tasks. Random

search (Anderson, 1953) and simulated annealing

(van Laarhoven and Aarts, 1987) are conventional

search methods which attempt to approximate the op-

timal solution for optimisation problems. Due to the

large search space for complex problems, calcula-

tion time and computational efforts are enormous and

506

Samsonov, V., Enslin, C., Köpken, H., Baer, S. and Lütticke, D.

Using Reinforcement Learning for Optimization of a Workpiece Clamping Position in a Machine Tool.

DOI: 10.5220/0009354105060514

In Proceedings of the 22nd International Conference on Enterprise Information Systems (ICEIS 2020) - Volume 1, pages 506-514

ISBN: 978-989-758-423-7

the solutions are rather use-case speciﬁc than generic.

Genetic search algorithms (Hajela and Lin, 1992) are

more efﬁcient in ﬁnding an approximately optimal so-

lution, but there is still a large amount of engineering

effort needed for every different problem deﬁnition.

These methods search for the global optimum for a

certain problem while inserting randomness to try and

circumvent possible local optimums. The further dis-

advantages of these methods, besides computational

efforts, include not considering the current state of

the environment during the search, not ”remember-

ing” previous knowledge gained in the optimisation

process and having to be started from scratch every

time.

We are aiming for a solution that is able to gener-

alise to various problems with the possibility of trans-

ferring existing knowledge to new scenarios. The re-

cent success of RL and its powerful ability to solve

complex tasks by trial and error (Vinyals et al., 2019)

inspired the use of this method for our optimisation

problem. As discussed in detail in (Sutton and Barto,

2018), an RL agent has a deﬁned environment that it

interacts with and receives feedback from. The RL

agent will ﬁnd itself in a state (S) and performs an

interaction with the environment by choosing a spe-

ciﬁc action (a), from a ﬁnite set of possible actions.

A feedback signal will be returned in the form of a

reward (R) and a deﬁnition of the new state (S

) will

be provided. The aim is to learn a function approxi-

mation for the policy that chooses the optimal actions.

Q-values are estimations of the future value that each

action could bring and helps in the decision-making

of which action to choose. The value function is an

estimation of the total possible future values.

In this work we use a novel RL Algorithm referred

to as Soft Actor-Critic (SAC) (Haarnoja et al., 2018).

It is an off-policy algorithm based on an actor-critic

architecture. The main peculiarity of the algorithm is

the use of an entropy term as part of the value function

that intensiﬁes exploration in an efﬁcient way. Com-

paring to other off-policy algorithms, SAC is fairly

robust and stable, thus reducing the effort of its appli-

cation to a certain task considerably.

More and more research focuses on the usage of

RL methods in the manufacturing domain, for exam-

ple to solve robot control (Johannink et al., 2019), for

online optimisation of ﬂexible manufacturing systems

in terms of energetic load management (Bakakeu

et al., 2018) and for online job shop scheduling in

ﬂexible manufacturing systems (Baer et al., 2019). At

the same time RL is proved to be an efﬁcient tool for

solving a number of optimisation problems in other

domains. (Li and Malik, 2016) and (Li and Malik,

2017) demonstrate that RL can be used as an optimi-

sation algorithm for high dimensional stochastic op-

timisation problems. Using an RL agent as an op-

timizer for training shallow neural networks or ﬁnd-

ing an optimal solution for a given objective function

converges faster and/or reaches better optima compar-

ing to commonly used optimisation approaches. A

number of RL Architectures are proposed to solve

combinatorial optimisation problems in (Nazari et al.,

2018), (Bello et al., 2016) and (Kool et al., 2018).

To conclude RL is widely used across various

manufacturing engineering and other domains and

proves to be a capable tool for addressing various op-

timisation tasks.

3 EXPERIMENTAL SETUP

The whole experimental setup is built around the WP

clamping position optimisation problem, formalised

as a sequential decision making process and covers

the setup for the RL agent training and evaluation.

The ultimate goal is to train and evaluate an RL agent

capable of ﬁnding a valid and near-optimal clamping

position for seen, as well as unseen WPs. A valid

clamping position is deﬁned as a position that causes

no axis collisions during the milling process. A near-

optimal clamping position for a given geometry offers

close to minimal energy consumption and machine

wear via minimising axis accelerations and total trav-

eled distances along the machine axes.

3.1 Milling Process Description

All investigations are conducted on the simulation of

a 3-Axis milling process using SinuTrain software.

Within the simulation setup various WP geometries

can be generated with a few examples displayed in

Figure 1. The WP geometry changes by varying the

orientation of the slot to be milled into the WP.

For every simulation run, a clamping position for

the WP needs to be deﬁned within the working space

of the milling machine. With respect to the deﬁned

clamping position, SinuTrain simulates the move-

ments of the machine’s axes during the milling pro-

cess, recognises any axis collisions and returns infor-

mation about the speed, acceleration and travelled dis-

tance recorded during the milling process along the

machine axes.

3.2 RL Task Formalisation

The WP clamping position optimisation task can be

formalised as a Markov Decision Process (MDP)

(Howard, 1960). According to the Markov property

Using Reinforcement Learning for Optimization of a Workpiece Clamping Position in a Machine Tool

507

Figure 1: Examples of the Shape of the WP with Slot An-

gles (SA) (a) 0

◦

(B) 45

◦

the following state (S

) in an MDP is dependent only

on the current state (S) and action (a) pair.

The RL agent can optimise the WP clamping po-

sition by iteratively changing the WP position in the

work space of the milling machine. After a prede-

ﬁned number of WP position changes is conducted,

the optimisation process is terminated. The last posi-

tion before the termination is considered to be the ﬁ-

nal WP position proposed by the agent. Such a cycle

from random initialisation of the WP position to the

termination of the optimisation process is called an

episode. A number of allowed WP position changes

within the episode is referred to as a number of steps

within an episode.

The addressed task is non-stochastic, meaning that

a certain action (a) in a given state (S) will always

result in the same following state (S

). The clamp-

ing position problem is designed as a sequential deci-

sion making process with a ﬁnite number of steps (i)

within an episode and with the goal, that the trained

agent ﬁnds a near-optimal position within a few steps.

3.2.1 State-Action Representation

At every optimisation step the RL agent is allowed to

move the WP in the X and Y-axis directions, as well

as to rotate the WP. The magnitude of the changes

made to the WP position at each optimization step is

controlled by the action of the RL agent. Undertaken

step sizes can range from zero up to the allowed maxi-

mum limit. In our experiments we limit the maximum

single move along X and Y-axis to 40 mm. The max-

imum step size for the rotation angle is equal to 35

◦

The agent’s current state plus the action de-

ﬁnes the new clamping position of the WP,

with which the milling process simulation is ex-

ecuted. This is summarised in Figure 2, such

that the WP is initially placed within the work

space at (X

, Y

, RotationAngle

). At each step

(i) the RL agent incrementally moves the WP

by (∆X, ∆Y, ∆RotationAngle) into a new position

Figure 2: Formalisation of the Clamping Position Problem

as an MDP with Initial State, Intermediate States and Ter-

minated State within One Episode.

i+1

, Y

i+1

, RotationAngle

i+1

), with which the simu-

lation is executed again and a feedback is given back

to the RL agent. This feedback states how much of

an improvement the change brought to the overall

milling performance in terms of avoiding axis colli-

sions, optimising the total traveled distance and min-

imising acceleration over all machine axes in the form

of reward (R

The observation space consists of eleven di-

mensions. Firstly, features describing the WP

clamping position in the machine working space

(X, Y, Rotation Angle), the current WP geometry

(Slot Angle), as well as a number of process param-

eters captured during the milling process at the cho-

sen clamping position by the agent is included. These

process parameters are the integrals of the squared

accelerations in the X- and Y-axis directions (called

, e

) since they correspond to energy), total trav-

eled distances in the X- and Y-axis directions (d

, d

)

and indicators of reaching axis limits and axis colli-

sions (limit X, limit Y ). Lastly, we introduce the num-

ber of steps left within an episode before termina-

tion (steps left) as an additional input, as proposed by

(Pardo et al., 2018). This helps the agent to determine

whether it can reach a certain point in the working

space with the steps left in the episode and is espe-

cially beneﬁcial if the number of steps allowed within

an episode is small. The ﬁnal input is a tuple deﬁned

as (X, Y, Rotation Angle, Slot Angle, d

, d

, e

limit X, limit Y, steps left).

3.2.2 Reward Function

A reward function is required for the RL agent to

guide it through the optimisation process. It incen-

tivises the RL agent to learn the optimal clamping po-

sition. Higher rewards result from the minimisation

of accelerations and travelled distances. All observa-

tions have to be combined into a total reward. The fol-

lowing equations invert the observations (lower val-

ues correspond to higher rewards) and normalises

them:

ICEIS 2020 - 22nd International Conference on Enterprise Information Systems

508

X, norm

= (e

X, max

− e

)/(e

X, max

− e

X, min

) (1)

Y, norm

= (e

Y, max

− e

)/(e

Y, max

− e

Y, min

) (2)

X, norm

= (d

X, max

− d

)/(d

X, max

− d

X min

) (3)

Y, norm

= (d

Y, max

− d

)/(d

Y, max

− d

Y, min

) (4)

where:

.,norm

– normalised values for the integrated squared

accelerations along the X- and Y-axis

.,min

, e

.,max

– minimum and maximum values for e

and e

.,norm

– normalised values for the travelled distances

along the X- and Y-axis

.,min

, d

.,max

– minimum and maximum values for d

and d

For the considered machine, the X-axis is heavier than

the Y-axis. Therefore, e

X,norm

and d

X,norm

is weighted

higher than e

Y,norm

and d

Y,norm

in the combined reward

components e and d:

e = 2e

X, norm

+ e

Y, norm

(5)

d = 2d

X, norm

+ d

Y, norm

(6)

The total reward after every action is a combination

of the acceleration related term e from eq. (5) and the

distance related term d from eq. (6), such that:

(

0.7e + 0.3d no axis collision

−1 axis collision

(7)

The weights in eq. (7) give a higher importance to

the acceleration term e than to the distance term d,

because accelerations also cause disturbances in the

machining process. When an axis collision occurs

a reward of −1 is immediately awarded. The tuples

for the state-action-reward representation are brieﬂy

summarised in Table 1.

Table 1: Summary of the Main Parameters of the Optimisa-

tion Task Formalised as an MDP.

State

(X, Y, Rotation Angle, Slot Angle, d

, e

, limit X, limit Y, steps left)

Action (∆X, ∆Y, ∆Rotation Angle)

= 0.7e + 0.3d

Reward e = 2e

X,norm

+ e

Y,norm

d = 2d

X,norm

+ d

Y,norm

3.3 RL Agent Training and Validation

While learning the WP position optimisation strategy

the RL agent conducts a number of training episodes.

The training and evaluation process employed in this

study is summarised in Figure 3. At the start of each

episode a new WP geometry is generated and the WP

Figure 3: Training and Evaluation Process of the RL Agent.

position is randomly initialised. WP geometry stays

unchanged within one episode. During training the

RL agent encounters various WP geometries (slot an-

gles (SA)) with the exception of the angles ranging

between 40

◦

and 50

◦

. The trained agent is evaluated

on an unseen WP with the slot angle of 45

◦

. This

allows to determine if the given RL agent is not over-

ﬁtting to solutions seen during the training, but learns

a generic optimisation strategy capable of interpreting

and solving unseen WP geometries.

Since the last WP position within an episode is

considered to be the ﬁnal solution, the last reward in

an episode is an important metric. It provides an indi-

cation whether the agent found the optimal clamping

position. A trained RL agent is considered to be ca-

pable of generalisation if it can ﬁnd a near-optimal

clamping position for an unseen WP with no addi-

tional learning within one episode.

The WP optimisation task is implemented as an

OpenAI Gym environment (Brockman et al., 2016).

We use the SAC implementation from stable baselines

(Hill et al., 2018) for solving the optimisation task in

the created environment. All training and evaluation

is conducted in docker containers (Merkel, 2014) to

ensure the independence of the experiments from the

used software and hardware. Each RL setup is in-

dependently trained and evaluated three times with

three ﬁxed random seeds shared across all experi-

ments. This allows to evaluate and compare not only

the ﬁnal performance, but also the stability of learning

while ensuring the reproducibility of the results.

3.3.1 Approximation of the Simulation with

Machine Learning

The training environment for the RL agent is a

SinuTrain simulation incorporating various WP ge-

ometries and machine kinematics. Depending on the

Using Reinforcement Learning for Optimization of a Workpiece Clamping Position in a Machine Tool

509

task, a SinuTrain simulation takes at least 4 seconds

for a successful run and 1 second for a failed run. In

this case a failed run would refer to an axis collision

and early termination of the run. During the train-

ing process thousands of interactions between the RL

agent and the simulation may be required, which re-

sults in impractically long training times. A consider-

ably faster environment for RL agent testing is created

to mimic the SinuTrain simulation with a combination

of machine learning (ML) models. These models pre-

dict simulation outcomes for a given set of input pa-

rameters. The steps to building these ML models are

data generation, ML-ensemble creation for prediction

and ML-ensemble training and validation.

3.3.2 Data Generation

Predictive models approximating the milling process

behaviour have to be built from the SinuTrain sim-

ulation runs with a set of input parameters (X, Y ,

Rotation Angle, Slot Angle) that covers the entire pa-

rameter space. These parameters are generated ac-

cording to randomised design. Random sampling

achieves good performance for large sample counts

(Garud et al., 2017). In total 180000 points are gener-

ated.

3.3.3 ML Ensemble for Prediction

The ML models are structured as a waterfall of a clas-

siﬁer, followed by a set of regression models. Fig-

ure 4 provides an overview of the ML-ensemble ar-

chitecture used to mimic the functioning of a 3-axis

milling machine. The classiﬁer has the purpose of

determining whether a given WP position in the ma-

chine space would cause an axis collision during the

milling process or complete successfully, and further-

more returns the corresponding error message. These

error messages contain the axis of the machine that

caused the collision and whether it was in the positive

or negative direction of movement.

Only the set of inputs that will lead to success-

ful runs are propagated further to the regression mod-

els. Four regression models are trained to predict the

, e

, d

values (deﬁned in section 3.2.1) during

the WP processing. These values are used as input to

the RL agent and to calculate the reward.

3.3.4 ML-Ensemble Training and Validation

Ten-fold cross-validation (CV) is used to validate the

ML models (Kohavi, 1995). The data is split into ten

equal parts, nine parts are joined together and used as

the training set and the tenth part is used as the vali-

dation set. This process is repeated until each of the

Figure 4: Architecture of the ML-Ensemble. A Classiﬁer

Predicts If the Simulation with the given Input Parameters

Is Going to Fail or Run Successfully. In the Case of a Pos-

itive Result, Regression Models Predict Output Parameters

(One Model per Parameter). Gradient Boosting, with a De-

gree of 2 Polynomial Feature Basis Expansion, Is Used as

a Classiﬁer and K-Nearest Neighbors (K-NN), with k = 75

and Uniform Weighting, Is Used for the Regression Models.

Table 2: Summary of the Classiﬁer Accuracy.

F1-Score Totals

Success 0.990 34943

Limit X + 0.949 4200

Limit X - 0.901 36608

Limit Y + 0.960 16892

Limit Y - 0.932 46730

Weighted Average 0.942 139373

parts was used once as the validation set and each rep-

etition of the process is called a CV-fold. This process

ensures that there is never any overlap between the

training and the validation data. For each of the folds

a prediction accuracy is calculated, and the average of

these accuracies is returned as the ﬁnal performance

evaluation of the ML model.

The accuracy of the classiﬁer is determined by

the weighted F1-score, as it is important to have

good performance in the prediction of each of the

classes. Table 2 summarises the F1-score for each

of the classes and the overall classiﬁcation is evalu-

ated by a weighted average according to the sizes of

the different classes. This overall F1-score is 94.2%

and the fact that 99% of the successful runs are cor-

rectly identiﬁed means that the regression models re-

ceive the information they need to make accurate pre-

dictions.

For the regression models, the R-squared metric

is used to compare models and to choose the opti-

mal model. The R-squared value for each regression

ICEIS 2020 - 22nd International Conference on Enterprise Information Systems

510

Figure 5: Predicted versus Actual Rewards for Successful

Simulation Runs.

model is quoted in Table 3 and Figure 5 depicts the re-

ward surface comparing predicted and actual rewards.

The reward surface represents only the part of eq. (7)

corresponding to no axis collisions and is a combina-

tion of the predictions for e

, e

, d

. These pre-

dictions are summarised as described in section 3.2.2.

The deviation of the predicted values from the ac-

tual values are almost negligible, therefore the surface

produced by the ML-ensemble is a good representa-

tion of the true surface.

Table 3: Summary of the Regression Models Accuracy.

R-Squared

0.951

0.947

0.840

0.835

The resulting ML-ensemble can predict the outcome

of any simulation iteration under 0.01 seconds. This

offers up to a 400-fold improvement in the training

speed of the chosen RL algorithms.

4 EXPERIMENTAL RESULTS

The training of the RL agent is conducted on various

WP geometries and validated on two seen WP geome-

tries (slot angle of 90

◦

and 60

◦

) and an unseen WP

geometry (slot angle of 45

◦

) as described in section

Section 3. It tests whether the RL agent ”remembers”

good solutions seen during the training without over-

ﬁtting to them and whether it can generalise to unseen

WPs.

Figure 6: Last Reward per Episode during Training,

Episode Length - 110 Steps, Total Training Steps - 600.000.

Firstly, we investigate if the RL agent is capable of

ﬁnding an optimal clamping position for an unseen

WP within 110 steps per episode. It is a fairly large

number of steps which allows the RL agent to conduct

more exploration and recover from mistakes, such as

not heading straight to the optimal point from the be-

ginning of an episode.

Figure 6 and Figure 7 show the last reward per

episode, with a smoothing window of 30 episodes,

during training and validation respectively. The RL

agent is trained over 600.000 steps before the valida-

tion starts. The three lines of different colours rep-

resent the achieved last reward per episode for inde-

pendent training (white background area) and valida-

tion (the light-grey background area for the 90

◦

slot

angle, the grey background area for the 60

◦

slot angle

and the dark-grey background area for the 45

◦

slot an-

gle) runs with different random seeds. The maximum

possible reward depends on the slot angle because of

the implicit differences in the milling process. This

explains the changes in general reward level seen at

different validation stages for each slot angle. At all

validation runs the RL agent never gets negative re-

wards at the last episode step. It proves that the RL

agent learns how to avoid axis collisions for differ-

ent initialisation positions across the whole machine

space. Moreover, the last reward per episode stays

close to the maximum for both seen WP conﬁgura-

tions with 90

◦

and 60

◦

slot angles, as well as for the

unseen 45

◦

conﬁguration. The RL agent is capable of

consistently placing the WP with an unseen geometry

in positions close to the optimum.

Figure 8 shows position changes of the WP in

the three-dimensional space of (X, Y, Rotation Angle)

at each optimisation step over some randomly sam-

pled validation episodes from all three independently

trained RL agents. This allows for understanding of

the actions of the RL agent and to validate if the

learned strategy optimally solves the task. We pro-

vide the visualisation only for the WP with a 45

◦

slot

angle, since understanding the generalisation capabil-

Using Reinforcement Learning for Optimization of a Workpiece Clamping Position in a Machine Tool

511

Figure 7: Last Reward per Episode during Validation,

Episode Length - 110 Steps, Total Training Steps - 600.000.

Figure 8: Moves of the RL Agents (110 Steps per Episode)

in the Machine Space. The Episode Trajectories Are Ran-

domly Sampled across the Three Trained RL Agents.

ities of the RL agent is the main purpose of this study.

One connected line in the ﬁgure corresponds to one

optimisation episode. Points show positions of the

WP in the machine working space, while lines depict

the transitions following the RL agent’s chosen ac-

tions. The colour of the points is changing gradually

from white to black throughout the episode such that

the start and end points can easily be recognised. The

cloud of points represents the area where the milling

process runs through without axis collisions and the

empty space surrounding the point cloud represents

the WP positions where the simulation fails to run.

The colour of every point in the cloud encodes the

reward for placing the WP in the given position. It

can be seen, that across all initialisation points the RL

agent ﬁrst moves the WP out of the axis collision area

in the shortest possible way to minimise the penalty.

Afterwards, it gradually moves the WP towards one

of the two optimal areas with the highest rewards. As

soon as the high reward area is achieved the RL agent

signiﬁcantly decreases the step-size and stays within

the high-reward area.

Looking at Figure 8, the RL agent mostly needs

a small fraction of the available steps to reach areas

of high reward. The overall goal is to ﬁnd an optimal

Figure 9: Last Reward per Episode during Training,

Episode Length - 20 Steps, Total Training Steps - 300.000.

Figure 10: Last Reward per Episode during Validation,

Episode Length - 20 Steps, Total Training Steps - 300.000.

WP clamping position while keeping the number of

iteration steps at a minimum. Therefore we reduce the

number of steps per episode to 20. This requires the

agent to deviate as little as possible from the direction

of the optimal point during the episode, in order to

reach it before or at the last step.

With 20 steps per episode, up to 300.000 train-

ing steps are required to achieve the best result (Fig-

ure 9). This can be explained by the fact that the RL

agent does not need to learn to stay in the area of high

reward for many steps before the termination of the

episode. At the same time, longer training leads to a

slight performance decrease.

Figure 10 shows that three independently trained

RL agents have higher differences in the ﬁnal per-

formance comparing to the 110 steps per episode RL

agents. This points towards less stable training. Dur-

ing two training runs the RL Agents were able to learn

optimal solutions for all three validation slot angles.

Depending on the initialisation point, the RL Agent

does not always move the WP into the same place

as for 110 steps per episode. Nevertheless, all cho-

sen points are very close to the optimum. The third

trained RL agent underperforms on all three valida-

tion angles.

ICEIS 2020 - 22nd International Conference on Enterprise Information Systems

512

Figure 11: Moves of the RL Agents (20 Steps per Episode)

in the Machine Space. The Episode Trajectories Are Ran-

domly Sampled across the Three Trained RL Agents.

The learned optimisation strategy is visualised in Fig-

ure 11. Similarly to the case with 110 steps per

episode, the RL agent moves the WP out of the axis

collision area in big steps ﬁrst and then gradually ap-

proaches the optimal points.

A further decrease of the training steps per episode

leads to a decrease of the stability of the results be-

cause depending on the WP initialisation point, the

RL agent may not always be capable of reaching the

optimal point.

5 CONCLUSION AND OUTLOOK

We formalised the problem of ﬁnding an optimal WP

clamping position as an MDP problem, set up a train-

ing environment for an RL agent using an approxima-

tion of the SinuTrain milling simulation with stacked

ML models and successfully demonstrated that an

RL agent can solve the WP clamping position opti-

misation task. Through a number of evaluations we

demonstrated that a trained RL agent is capable of

generalisation to new, previously unseen (but simi-

lar in geometry) WPs. In the case of a 3-axis ma-

chine tool, it is capable of ﬁnding a valid and near-

optimal clamping position within a limited number

of optimisation steps for an unseen WP without ad-

ditional training when the WP geometry is explicitly

described as a part of the state space.

This paper is a proof-of-concept work demonstrat-

ing that RL can be applied in complex optimisation

tasks from the ﬁeld of mechanical engineering, such

as the search for an optimal WP clamping position.

In this study, we elaborated only on a simple 3-axis

milling machine use case. In future work, the demon-

strated results can be transferred to more complex

WPs requiring processing on a 5-Axis milling ma-

chine. Another important aspect of the possible fu-

ture research focuses on a solution able to handle

a variety of WPs, without providing a set of hand-

crafted features describing the WP. For example, we

can deﬁne the optimisation task as a Partially Observ-

able Markov Decision Process (POMDP) where WP

related features can be learned indirectly by the RL

agent through simulation feedback at the beginning

of the optimisation process. Possible solutions for

such a POMDP optimisation task includes the use of

frame stacking, RL with Recurrent Neural Networks

or a meta-learning method. To progress further in the

direction of a production capable solution, we would

need to improve the RL training efﬁciency and trans-

ferability of RL-learned solutions to new WPs and

machine tools with complex axis movements with no

or little extra training required.

REFERENCES

Anderson, R. L. (1953). Recent advances in ﬁnding best op-

erating conditions. Journal of the American Statistical

Association, 48(264):789–798.

Baer, S., Bakakeu Romuald Jupiter, Meyes Richard, and

Meisen Tobias (25.09.2019-27.09.2019). Multi-agent

reinforcement learning for job shop scheduling in ﬂex-

ible manufacturing systems. In 2019 Second Interna-

tional Conference on Artiﬁcial Intelligence for Indus-

tries (AI4I). IEEE.

Bakakeu, J., Tolksdorf, S., Bauer, J., Klos, H.-H., Peschke,

J., Fehrle, A., Eberlein, W., B

urner, J., Brossog, M.,

Jahn, L., and Franke, J. (2018). An artiﬁcial intel-

ligence approach for online optimization of ﬂexible

manufacturing systems. Applied Mechanics and Ma-

terials, 882:96–108.

Bello, I., Pham, H., Le, Q. V., Norouzi, M., and Bengio, S.

(2016). Neural combinatorial optimization with rein-

forcement learning. arXiv preprint arXiv:1611.09940.

Brockman, G., Cheung, V., Pettersson, L., Schneider, J.,

Schulman, J., Tang, J., and Zaremba, W. (2016). Ope-

nai gym.

Garud, S. S., Karimi, I. A., and Kraft, M. (2017). Design

of computer experiments: A review. Computers and

Chemical Engineering.

Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. (2018).

Soft actor-critic: Off-policy maximum entropy deep

reinforcement learning with a stochastic actor.

Hajela, P. and Lin, C.-Y. (1992). Genetic search strategies

in multicriterion optimal design. Structural Optimiza-

tion, 4(2):99–107.

Hill, A., Rafﬁn, A., Ernestus, M., Gleave, A., Kanervisto,

A., Traore, R., Dhariwal, P., Hesse, C., Klimov, O.,

Nichol, A., Plappert, M., Radford, A., Schulman, J.,

Sidor, S., and Wu, Y. (2018). Stable baselines.

Howard, R. A. (1960). Dynamic programming and markov

processes.

Johannink, T., Bahl, S., Nair, A., Luo, J., Kumar, A.,

Loskyll, M., Ojea, J. A., Solowjow, E., and Levine,

S. (20.05.2019 - 24.05.2019). Residual reinforcement

learning for robot control. In 2019 International Con-

Using Reinforcement Learning for Optimization of a Workpiece Clamping Position in a Machine Tool

513

ference on Robotics and Automation (ICRA), pages

6023–6029. IEEE.

Kohavi, R. (1995). A study of cross-validation and boot-

strap for accuracy estimation and model selection.

Kool, W., van Hoof, H., and Welling, M. (2018). Atten-

tion, learn to solve routing problems! arXiv preprint

arXiv:1803.08475 In Citavi anzeigen.

Li, K. and Malik, J. (2016). Learning to optimize.

Li, K. and Malik, J. (2017). Learning to optimize neural

nets.

Merkel, D. (2014). Docker: lightweight linux containers for

consistent development and deployment. Linux jour-

nal, 2014(239):2.

Nazari, M., Oroojlooy, A., Snyder, L., and Tak

ac, M.

(2018). Reinforcement learning for solving the ve-

hicle routing problem. In Advances in Neural Infor-

mation Processing Systems, pages 9839–9849.

Pardo, F., Tavakoli, A., Levdik, V., and Kormushev, P.

(2018). Time limits in reinforcement learning. In

Dy, J. and Krause, A., editors, Proceedings of the

35th International Conference on Machine Learning,

volume 80 of Proceedings of Machine Learning Re-

search, pages 4045–4054, Stockholmsm

assan, Stock-

holm Sweden. PMLR.

Sutton, R. S. and Barto, A. (2018). Reinforcement learning:

An introduction. Adaptive computation and machine

learning. The MIT Press, Cambridge, MA and Lon-

don, second edition edition.

van Laarhoven, P. J. M. and Aarts, E. H. L. (1987). Simu-

lated annealing. In van Laarhoven, P. J. M. and Aarts,

E. H. L., editors, Simulated Annealing: Theory and

Applications, pages 7–15. Springer Netherlands, Dor-

drecht.

Vinyals, O., Babuschkin, I., Czarnecki, W. M., Mathieu,

M., Dudzik, A., Chung, J., Choi, D. H., Powell, R.,

Ewalds, T., Georgiev, P., Oh, J., Horgan, D., Kroiss,

M., Danihelka, I., Huang, A., Sifre, L., Cai, T., Aga-

piou, J. P., Jaderberg, M., Vezhnevets, A. S., Leblond,

R., Pohlen, T., Dalibard, V., Budden, D., Sulsky, Y.,

Molloy, J., Paine, T. L., Gulcehre, C., Wang, Z.,

Pfaff, T., Wu, Y., Ring, R., Yogatama, D., W

unsch,

D., McKinney, K., Smith, O., Schaul, T., Lillicrap,

T., Kavukcuoglu, K., Hassabis, D., Apps, C., and

Silver, D. (2019). Grandmaster level in starcraft ii

using multi-agent reinforcement learning. Nature,

575(7782):350–354.

ICEIS 2020 - 22nd International Conference on Enterprise Information Systems

514