Manufacturing Control in Job Shop Environments

with Reinforcement Learning

Vladimir Samsonov

1 a

, Marco Kemmerling

1 b

, Maren Paegert

1 c

, Daniel L

utticke

1 d

Frederick Sauermann

2 e

, Andreas G

utzlaff

2 f

, G

unther Schuh

2 g

and Tobias Meisen

3 h

Institute of Information Management in Mechanical Engineering, RWTH Aachen University, Aachen, Germany

Laboratory for Machine Tools and Production Engineering WZL, RWTH Aachen University, Aachen, Germany

Chair of Technologies and Management of Digital Transformation, University of Wuppertal, Wuppertal, Germany

Keywords:

Manufacturing Control, Production Scheduling, Job Shop Scheduling, Deep Reinforcement Learning,

Combinatorial Optimization.

Abstract:

Computing solutions to job shop problems is a particularly challenging task due to the computational hardness

of the underlying optimization problem as well as the often dynamic nature of given environments. To address

such scheduling problems in a more ﬂexible way, such that changing circumstances can be accommodated,

we propose a reinforcement learning approach to solve job shop problems. As part of our approach, we

propose a new reward shaping and devise a novel action space, from which a reinforcement learning agent can

sample actions, which is independent of the job shop problem size. A number of experiments demonstrate that

our approach outperforms commonly used scheduling heuristics with regard to the quality of the generated

solutions. We further show that, once trained, the time required to compute solutions using our methodology

increases less sharply as the problem size grows than exact solution methods making it especially suitable for

online manufacturing control tasks.

1 INTRODUCTION

Optimization problems in production planning and

control (PPC) are often solved using simple priority

rules and heuristics, which tend to achieve reasonable

results with short computation time. While exact ap-

proaches such as integer linear programming (ILP)

can be used to ﬁnd optimal solutions, the time re-

quired to ﬁnd such solutions is often impractical due

to the NP-hardness of problems in this space (Garey

and Johnson, 1979; Kan and Rinnooy, 2012). Fur-

thermore, such methods create complete solutions for

a given set of inputs, e.g. all jobs on a given day.

Should any of the underlying assumptions of the opti-

https://orcid.org/0000-0003-4147-6470

https://orcid.org/0000-0003-0141-2050

https://orcid.org/0000-0003-4952-6639

https://orcid.org/0000-0002-7070-8018

https://orcid.org/0000-0001-7343-9338

https://orcid.org/0000-0003-4370-8829

https://orcid.org/0000-0002-6076-0701

https://orcid.org/0000-0002-1969-559X

mization problem change after the solution is created,

these algorithms will have to create a complete so-

lution once more, which again will result in a high

cost in terms of computation time and render the ap-

plication of such methods impractical in a dynamic

production environment.

Reinforcement learning (RL) offers an alternative

to the aforementioned approaches. Here, a so-called

agent creates a solution step by step, while the cur-

rent situation is reassessed at every step, such that

new information about changing circumstances is re-

ﬂected in the future decision-making process of the

agent. Before such an agent can be employed in prac-

tice, it needs to be trained with regard to a speciﬁc

problem formulation. While this training is usually

time-intensive, it only needs to be performed once

and a trained agent can create scheduling solutions

in reasonable time. Since a reinforcement learning

agent essentially learns a heuristic suitable to a spe-

ciﬁc problem, there are no mathematical guarantees

for solutions created in this manner to be optimal, but

there is a possibility that such a learned heuristic can

outperform currently used heuristics and priority rules

Samsonov, V., Kemmerling, M., Paegert, M., Lütticke, D., Sauermann, F., Gützlaff, A., Schuh, G. and Meisen, T.

Manufacturing Control in Job Shop Environments with Reinforcement Learning.

DOI: 10.5220/0010202405890597

In Proceedings of the 13th International Conference on Agents and Artiﬁcial Intelligence (ICAART 2021) - Volume 2, pages 589-597

ISBN: 978-989-758-484-8

589

in terms of solution quality while keeping the time re-

quired for scheduling low.

While reinforcement learning approaches have

been investigated in production planning and schedul-

ing before (Zhang and Dietterich, 1995; Aydin and

Oztemel, 2000; Schneckenreither and Haeussler,

2018; Gabel and Riedmiller, 2008; Waschneck et al.,

2018; Qu et al., 2016), some shortcomings still exist.

One crucial issue is the capability of an agent to gen-

eralize to previously unseen job shop problems (JSPs)

of various sizes. If state representation and action

space are directly linked to the size of the problem,

i.e. the number of machines and orders in the produc-

tion, generalization is only possible to a very limited

degree. This paper aims to make one step towards a

reinforcement learning approach capable of general-

izing to different problem sizes by presenting a novel

action space design independent of the number of ma-

chines and orders involved in the planning. How this

design affects generalization performance will be a

focus in the work presented here. Additionally, we

introduce a new reward shaping approach aimed to

incentivize the learning of near-to-optimal scheduling

strategies through high reward gradients towards the

optimum.

The remainder of the paper is structured as fol-

lows. The next section gives a brief description of

the necessary background on production planning and

scheduling, JSP, relevant metrics, common schedul-

ing approaches, as well as a brief introduction to rein-

forcement learning. Section 3 discusses existing work

on the use of RL for production planning and control,

while section 4 elaborates on our own approach to the

issue. Section 5 presents experimental results gener-

ated by our approach and some ﬁnal considerations

are given in section 6.

2 BACKGROUND

2.1 Production Planning and Control

Today, manufacturing companies are, among other

trends, facing the problem of increasing demand for

customized products (Zijm and Regattieri, 2019; Ja-

cobs, 2011; Gyulai et al., 2018). As a consequence,

companies need to increase the ﬂexibility of their pro-

duction. Therefore, they often decide for the prin-

ciple of job shop production, allowing to manufac-

ture more customer-speciﬁc product offerings (Dufﬁe

et al., 2017; Schuh et al., 2019). Here, components

can pass through the workstations required for pro-

cessing in a ﬂexible sequence in a mostly undirected

material ﬂow (Zijm and Regattieri, 2019). As a result,

complexity in PPC increases and demands for well-

designed strategies that support mastering this com-

plexity (ElMaraghy et al., 2013).

For mastering the complexity, targets of PPC need

to be understood. In general, the so-called logistical

targets can be classiﬁed in logistical performance and

logistical costs. The target of improving logistical

performance can be further speciﬁed by shortening

lead times and increasing the adherence to produc-

tion schedules, both internally and externally. On the

other hand, the target of reducing logistical costs is

expressed by increasing utilization of production re-

sources and reducing work-in-process (WIP) and tied

capital, subsequently. However, those targets depend

on each other and improving one can lead to a dete-

rioration of others (e.g. reducing WIP can lead to a

decrease in utilization). Hence, the optimization of

all targets can not be pursued at the same time, and

companies need to prioritize, with high adherence to

schedules and especially delivery dates usually being

the most important targets (L

odding, 2013; Gyulai

et al., 2018).

The planning of an optimal sequence and assign-

ment of orders to machines to achieve the logistic

goals prioritized by companies is laborious and com-

plex. To approach these problems, manufacturing

companies usually use software systems such as an

enterprise resource planning (ERP) system. These

systems are based on the logic of manufacturing re-

source planning (MRP II), which is an extension of

material requirements planning (MRP) (Zijm and Re-

gattieri, 2019). In a MRP II process, end products are

initially broken down into their components. Then,

the steps of operation of these components are as-

signed to work stations. Finally, the different com-

ponents are put into order on each machine, trying

to meet the end completion date of the whole order

(Zijm and Regattieri, 2019; Kurbel, 2016). Due to the

immense combinatorial complexity of PPC tasks such

as order release and sequencing on machines, practi-

cal approaches often make use of heuristics to enable

nearly optimal solutions with signiﬁcantly reduced ef-

fort (Kurbel, 2016).

The task of order release is to determine an opti-

mal point in time to start with the production of an

order in a job shop production environment. This task

has a strong impact on the logistical targets. Releas-

ing orders close to their delivery dates can reduce WIP

and lead times, but bears the risk of a low adherence

to delivery dates. On the other hand, releasing orders

early does not necessarily lead to a higher probability

of better adherence to delivery dates, as an increase of

WIP results in longer and more scattered lead times

that are difﬁcult to plan (Buker, 2001; Mather and

ICAART 2021 - 13th International Conference on Agents and Artiﬁcial Intelligence

590

Plossl, 1978). As a result, the goal is to release orders

as late as possible while still trying to meet delivery

dates. Typical heuristics for releasing orders are e.g.

constant WIP (conWIP), bottleneck control, or work-

load control (L

odding, 2013).

With order lead times up to several weeks or

months, a second important task in PPC is order se-

quencing at work stations (Schuh et al., 2019). The

goal of order sequencing is to support the achieve-

ment of logistical targets. The aim is to sort and pro-

cess work steps on the work stations in such a way

that all orders are completed on schedule. Common

heuristics for sequencing are ﬁrst-in-ﬁrst-out (FIFO),

setup time optimized or minimum slack. Further

heuristics to address the problem are the shortest-

processing-time (SPT) and longest-processing-time

(LPT) priority rules, in which operations are se-

quenced in increasing, respectively decreasing or-

der of their processing times. Finally, the longest-

remaining-processing-time (LRM) rule selects the op-

eration of the job with the longest remaining process-

ing time. Here, the operation under consideration is

excluded from the computation of the remaining pro-

cessing time (L

odding, 2013).

2.2 Reinforcement Learning

Reinforcement learning is an area of machine learning

which investigates how an agent can learn to perform

a speciﬁc task in a given environment. In order to ap-

ply reinforcement learning, the problem at hand needs

to be modeled as a Markov decision process (MDP).

Such a MDP features a state s which can change over

the course of discrete time steps. In each time step, an

action a available in the current state can be chosen,

which may inﬂuence the state s

in the next time step

and lead to a reward R

(s, s

). A reinforcement learn-

ing agent learns a policy, i.e. a mapping from states

to actions, by receiving the corresponding rewards for

performing actions in a problem modeled as an MDP.

A policy is learned with the goal of maximizing not

the reward at any single time step, but rather the long-

term cumulative reward (Sutton and Barto, 2018). If a

sequence of states has a clear beginning and endpoint,

it is called an episode and the corresponding problem

is considered an episodic task.

Reinforcement learning algorithms can be di-

vided into value-based, policy-based, and hybrid ap-

proaches. In value-based approaches, the value of

each available action in a speciﬁc state is estimated

and a policy is derived subsequently from these es-

timated values. In policy-based based approaches,

a policy as described above is learned directly as a

probability distribution of all actions in a given state.

Such policy approaches typically update the learned

policy at the end of an episode based on the over-

all reward received in that episode. Since actions

are not evaluated individually, the current state of

the policy might be evaluated as good overall while

some of the actions performed where actually disad-

vantageous or vice versa. So-called actor-critic ap-

proaches are a hybrid of value-based and policy-based

approaches that allow for an evaluation of individual

actions by updating the parameters of an actor, which

is the policy-based component, based on the value es-

timates computed by a critic, which is the value-based

component. In contrast to value-based approaches,

actor-critic and policy-based approaches lend them-

selves well to high-dimensional and continuous ac-

tion spaces (Konda and Tsitsiklis, 2000). In our work,

we apply Deep Q-Learning (DQN) as an example of

a value-based approach operating with discrete action

spaces and Soft Actor-Critic (SAC) as a representa-

tive of the actor-critic group of algorithms relying on

continuous action space design.

DQN is a variant of the well-known Q-

Learning algorithm (Watkins, Christopher John Cor-

nish Hellaby, 1989), in which the action-value func-

tion Q(s, a), i.e. the function describing the long-term

expected reward when performing action a in state s,

is approximated. These estimates are updated based

on the Bellmann equation given below:

Q(s, a) = Q(s, a) + α



r + γ max

Q(s

, a

) − Q(s, a)



(1)

where the learning rate α determines how much new

information affects the current estimate, r is the cur-

rent reward and γ is a discounting factor describing

how much projected rewards from future actions in-

ﬂuence the current value.

While these estimates were originally stored in

simple tables where each cell corresponded to the

Q-Value of one speciﬁc action in one speciﬁc state,

current approaches such as DQN use deep neural

networks to approximate action-values. These net-

works improve by updating their parameters in the

direction of the gradient of a so-called loss function,

which compares the value Q(s, a) predicted by the

network with the correct value. Since the correct

value is not known in this case, it is substituted by

r + γ max

Q(s

, a

), similarly as in eq. (1). In con-

trast to DQN, the SAC algorithm maximizes not just

the expected reward but also the entropy of its policy,

which promotes exploration and prevents premature

convergence to local optima by encouraging the agent

to act with as much randomness as possible while still

succeeding at the given task. (Haarnoja et al., 2018).

Manufacturing Control in Job Shop Environments with Reinforcement Learning

591

3 RELATED WORK

Early applications of reinforcement learning in the

area of production planning and scheduling com-

prised the modiﬁcation of an initial, but infeasible so-

lution by iteratively reducing the number of violated

constraints in order to arrive at a feasible solution

(Zhang and Dietterich, 1995; Zhang and Dietterich,

1996). Further work has been carried out in the ap-

plication of reinforcement learning for order release

(Schneckenreither and Haeussler, 2018). Another ap-

proach focuses on the selection of appropriate priority

rules to schedule jobs based on the current state (Ay-

din and

Oztemel, 2000).

While the number of single-agent approaches to

the job-shop problem in the literature is scarce, multi-

agent approaches have been investigated particularly

often in the area, possibly because modeling such a

problem becomes easier when various decisions can

be decoupled from each other and addressed sepa-

rately. One such example is the assignment of a sep-

arate agent to each machine in a job-shop schedul-

ing problem as described in (Gabel and Riedmiller,

2008; Gabel, 2009). More recent examples of this

type of approach have been applied in the context of a

semi-conductor production facility (Waschneck et al.,

2018), and in the combination of an agent performing

scheduling with an agent performing human resource

management to solve a ﬂow-shop problem (Qu et al.,

2016).

While multi-agent approaches offer an easy way

to scale to bigger problem sizes by increasing the

number of agents, training such a group of agents

tends to be more difﬁcult due to the inherent non-

stationary of an environment inhabited by other

agents, as well as the need to ensure that the agents

cooperate to solve overarching problem. Scaling to

bigger problem instances using a single-agent ap-

proach is not trivial, as state and action spaces are

typically dependent on the number of machines and

jobs. We partially address this issue by introducing

an action space independent of the number of jobs.

4 EXPERIMENTAL SETUP AND

SOLUTION DESIGN

The given experimental setting aims to evaluate the

proposed RL approach designed for simultaneous

management of two manufacturing control tasks in

complex production environments: order release and

operation sequencing. We pick job-shop production

as one of the most well known and challenging man-

ufacturing control environments. Even small prob-

Machine 1 Machine 2

Machine m

23 7

…

12 6

Job n

23 7

Job 1

✓✓

Job 1

✓

Job 1

Order Release Pull

✓✓ ✓

Job 1

Finished Jobs

Colors : operation types

Numbers : operation durations

Connections : sequence dependencies

Order Release Sequencing

…

Figure 1: Job-Shop Order Release and Scheduling Setup.

lem instances with 3 jobs and 3 machines are demon-

strated to have NP-hard combinatorial complexity of

underlying scheduling tasks (Sotskov and Shakhle-

vich, 1995).

4.1 Job-shop Problem Formalization

A set of job-shop problems used in this study is for-

malized as follows: each problem instance has a set

of jobs J = {J

, ..., J

} to be manufactured. Each

job J

(i = 1, ..., n) includes a set of operations O

, ..., O

} that can be processed in a predeﬁned se-

quence only. Each operation O

(q = 1, ..., n) has

a non-negative integer processing time t

and is as-

signed for processing to one machine from a ma-

chine set M = {M

, ..., M

}. Each machine M

( j =

1, ..., m) can process one operation O

(q = 1, ..., n)

at a time (Sotskov and Shakhlevich, 1995). We de-

ﬁne the optimization objective as ﬁnding a schedule J

on M that minimizes the latest end time of any job J

also referred to as the makespan C

max

To use RL for the task of order release and op-

eration sequencing, we deﬁne the solution process as

a Markov Decision Problem (MDP) (Howard, 1960).

In the initial state, all machines are idle and orders

included in the production program are placed in an

order release pull. Through a set of actions, the RL

agent can decide to stay inactive for one time unit,

move an order from order release pull into production

(order release), or move an order from one machine to

another machine (order sequencing). The ﬁnal state

of an episode is achieved when all planned orders are

ﬁnished or the number of actions executed by the RL

agent exceeds the predeﬁned limit. A sparse reward

is used, meaning that a reward is given only at the last

step of an episode and the reward for all intermediate

steps is zero. This sparse reward is inversely related

to the achieved makespan (see Figure 1).

4.2 RL Implementation

The considered RL implementation is divided into

three major design aspects: state space representation,

ICAART 2021 - 13th International Conference on Agents and Artiﬁcial Intelligence

592

action space, and reward shaping.

4.2.1 State Space Representation

The state observed by the agents is comprised of six

separate components: (1) the machine states, i.e. the

remaining time for the operations currently being pro-

cessed on each individual machine, (2) the sum of all

operations’ processing times currently in the queue

of each individual machine, (3) the sum of all oper-

ations’ processing times for each individual job, (4)

the duration of the next operation for each individual

job, (5) the index of the next required machine for

each individual job, and ﬁnally (6) the time already

passed in any given moment. If no job is being pro-

cessed on a machine at a particular point in time, the

corresponding entry in (1) will be −1. The relevant

elements of (4) and (5) will likewise be set to −1 if

no further operation needs to be carried out for a par-

ticular job. Each of these components is normalized

before being passed onto the agent.

4.2.2 Action Space Design

Based on the observed state, the agent needs to decide

which operation O

i j

to process next, if any. To ensure

a consistent action space regardless of the number of

jobs and operations per job, the agent selects a rel-

ative duration a which will be mapped to a speciﬁc

operation, rather than selecting the operation directly.

Such a relative duration selected by the agent is then

mapped back to an absolute duration using the over-

all minimum and maximum processing times t

min

and

max

, respectively. This absolute duration is then com-

pared against every available operation’s processing

time t

and the operation O

sel

with the closest pro-

cessing time is selected, as formalized in eq. (2).

sel

← arg min

i,q



− (t

min

∗ (t

max

−t

min



(2)

Should multiple operations with identical processing

times exist, ties will be broken by applying the SPT

rule as a ﬁrst step and, if necessary, applying the LRM

rule in a second step. The proposed action space de-

sign allows for the use of RL agents with both dis-

crete or continuous action space. In the discrete case,

possible actions span the integers in a range [0, 10],

which represent relative durations. A continuous ac-

tion space modiﬁcation comprises the real numbers

instead of integers in a range [0, 10], thus potentially

enabling better precision when selecting operations.

Aside from the already mentioned action space above,

there is one special action with value −1, upon which

no operation will be selected at all, extending the

overall action space to a range of integers or real num-

bers [−1, 10].

4.2.3 Reward Shaping

Next to state and action spaces, a reward function is

necessary in order to apply RL. In this work, we de-

ﬁne a sparse reward function which only evaluates the

performance of the agent at the end of an episode.

While learning usually beneﬁts from more frequent

rewards, in this case intermediate rewards are difﬁ-

cult to deﬁne in such a way that they correspond to

the actual optimization objective, which can only be

accurately evaluated once an episode has terminated.

The sparse reward function r(T ) employed here

is deﬁned as in eq. (3), giving high rewards when the

makespan T of the solution is close to the optimal one

opt

and exponentially lower rewards as the quality of

the solution moves away from the optimal one. Coef-

ﬁcient γ deﬁnes the steepness of the reward gradient

and is set to 1.025 in this work.

r(T ) = 1000

opt

(3)

4.3 Evaluation Approach

The main evaluation criteria in this study are solu-

tion quality, solution speed, and scalability to big-

ger problem instances. We evaluate our proposed

RL approach against three alternative approaches: the

well-known OR-tools implementation of the CP-SAT

solver (Perron and Furnon, 2020), as well as two com-

mon priority rule heuristics: shortest processing time

(SPT) and longest processing time (LPT). This choice

is by no means an exhaustive representation of com-

mon JSP solution approaches and is meant to give an

estimation of the relative performance against a set of

alternative JSP solvers designed speciﬁcally for solu-

tion speed or optimality.

In this work three jsp instance sizes are used in

experiments:

• jsp instances with 6 machines, 6 jobs, 6 operations

per machine refereed to as 6x6x6 jsp instances.

• jsp instances with 10 machines, 10 jobs, 10 op-

erations per machine refereed to as 10x10x10 jsp

instances.

• jsp instances with 15 machines, 15 jobs, 15 op-

erations per machine refereed to as 15x15x15 jsp

instances.

The generation of JSP instances for training and eval-

uation is completely random with no selection in-

volved. Durations of operations are uniformly sam-

Manufacturing Control in Job Shop Environments with Reinforcement Learning

593

pled within a range [1, 11] time units. Each opera-

tion is randomly assigned to one of the machines. No

selection of JSP instances is performed. Each JSP

instance is solved with the CP-SAT, SPT, and LPT

solver. The makespan for each JSP instance as found

by the CP-SAT solver is considered to be a reference

value for reward calculation (see eq. (3)) and is used

as a performance benchmark in our study. Solution

quality is formalized as an optimality gap (OptGap)

and can bee seen as a function of difference between

achieved makespan (T ) and the reference makespan

found by CP-SAT solver (T

opt

) (see eq. (4)). Lower

values of the optimality gap correspond to better so-

lutions.

OptGap =

T − T

opt

· 100 (4)

5 EXPERIMENTS AND RESULTS

During the ﬁrst iteration of experiments we conduct

several training and evaluation runs on single JSP in-

stances of various sizes. This is meant to test the

capability of an RL agent to iteratively ﬁnd and im-

prove a JSP solution through exploration as well as to

test the applicability of RL agents with discrete and

continuous action spaces. Figure 2 demonstrates that

the new action space design is equally suitable for the

use with RL agents relying on discrete action spaces

(e.g. DQN) as well as RL agents operating on con-

tinuous action spaces (e.g. SAC). Due to space limi-

tations, the rest of the work presents results achieved

with DQN RL agents only. However, as demonstrated

here, comparable results can be achieved with an SAC

RL agent.

As a next step we look into using the same

RL setup on single JSP instances of 10x10x10 and

15x15x15 size (see Figure 3). Both runs demon-

strate a gradual improvement of reward per episode

during the training phase. For JSP sizes 10x10x10

and 15x15x15, the RL agent is given 1,500,000, and

2,000,000 time steps respectively to gradually im-

prove the dispatching and scheduling strategy. A

growing optimality gap can be seen with a growing

JSP size. Nevertheless, in both cases, the chosen RL

approach surpasses the common LPT and SPT heuris-

tics demonstrating the possibility of the chosen ap-

proach to scale to bigger JSP instances with sufﬁcient

training.

Depending on operations strategies, SPT and LPT

are well-known sequencing rules in terms of through-

put or utilization maximization. One of the success

factors of those priority rule heuristics is the possibil-

ity to generate a new production schedule in a short

period of time. A lower runtime required to ﬁnd a

feasible and near-to-optimal schedule increases the

planning ﬂexibility and allows to minimize the neg-

ative impact of unexpected changes and disturbances

in the production environment. Figure 4 demonstrates

the average time needed to solve 50 JSP instances of

sizes 6x6x6, 10x10x10 and 15x15x15 with the SPT

and LPT heuristics, the proposed RL agent, as well

as with the CP-SAT solver. While the SPT heuristic,

LPT heuristic and RL approach demonstrate a mod-

erate increase in required runtime to generate a JSP

solution with growing problem size, the CP-SAT run-

time quickly becomes infeasible for online schedul-

ing. A 15x15x15 JSP instance can require up to 2.2

hours of runtime for the CP-SAT solver. The longest

observed solution time for the investigated RL ap-

proach for a JSP instance of the same size is 40 sec-

onds, while the SPT and LPT heuristics consume up

to 14 seconds each. All runtime benchmarks are con-

ducted on the same hardware featuring two Intel Xeon

E5-2687W CPUs and 256 GB RAM.

Further analysis of RL agent runtimes demon-

strates that the time required for inference of the next

step by the RL agent based on the given production

state is bellow 1% of the calculation times spend on

state updates within the production simulation envi-

ronment. In future work the execution time of the

developed production simulation can be greatly in-

creased by using faster Python implementations such

as PyPy or by adopting process-based discrete-event

simulation frameworks such as SimPy.

The main interest of our work lies in the inves-

tigation of the generalization capabilities of the pro-

posed RL approach. To ensure the generalization ca-

pabilities of the RL agent, we conduct training on 900

JSP instances of size 6x6x6. It takes 2,000,000 train-

ing steps to learn a scheduling strategy surpassing the

SPT and LPT heuristics for unseen JSP instances. The

ﬁnal evaluation is conducted on 50 JSP instances not

seen during training. Each RL training is conducted

ten times with different ﬁxed random seeds. Eval-

uation results from all ten training runs are used in

the ﬁnal comparison against the alternative solution

approaches. Figure 5 depicts optimality gap distri-

butions for schedules found by the trained RL agent,

SPT, and LPT heuristic compared to the reference

JSP solutions found by the CP-SAT solver. On av-

erage, schedules generated by the trained RL agent

have 4.5% smaller optimality gap compared to the

next best SPT heuristic. 6% of generated RL solutions

are optimal, while no optimal schedules were found

by SPT or LPT heuristics. The biggest observed op-

timality gaps on the evaluation JSP set for RL agent,

ICAART 2021 - 13th International Conference on Agents and Artiﬁcial Intelligence

594

(a) DQN training for 200,000 time steps: RL agent in-

crementally learns better JSP solution

(b) DQN evaluation: RL agent has found a solution with

the shortest makespan

mentally learns better JSP solution

(d) SAC evaluation: RL agent has found a solution with

the shortest makespan

Figure 2: Training and evaluation of RL agents on a single 6x6x6 JSP instance: both RL agents with continuous and discrete

action space can be successfully used with the proposed action space design.

(a) Single 10x10x10 jsp instance: DQN agent found bet-

ter JSP solution comparing to SPT and LPT heuristics

(b) Single 15x15x15 JSP instance: DQN agent found

better JSP solution comparing to SPT and LPT heuris-

tics

Figure 3: DQN agent benchmarking on bigger single JSP instances.

SPT and LPT heuristics are 39%, 39%, and 69% re-

spectively. The best RL run reduces the biggest opti-

mality gap down to 26%.

To statistically investigate observed differences

in optimality gaps between different scheduling

methods, the Wilcoxon signed-rank test (Rey and

Neuh

auser, 2011) is used. It is a non-parametric sta-

tistical test that makes no data normality assumption

and is used to compare two related samples. In this

case, we do the pairwise comparison of the SPT and

Manufacturing Control in Job Shop Environments with Reinforcement Learning

595

Table 1: Statistical investigation of observed optimality gap differences: Wilcoxon signed-rank test p-values.

Priority

Rules

Run 1

Run 2

Run 3

Run 4

Run 5

Run 6

Run 7

Run 8

Run 9

Run 10

SPT 0.066 0.025 0.010 0.040 0.008 0.034 0.004 0.012 0.07 0.011

LPT 1.8E

−4

2.14E

−5

6.58E

−6

7.17E

−5

1.03E

−5

4.76E

−5

1.99E

−6

1.03E

−5

2.69E

−4

7.23E

−6

Figure 4: Comparison of average runtime per JSP instance

of various sizes.

Figure 5: Generalization test: optimality gaps for different

JSP solvers on a set of 50 unseen 6x6x6 jsp instances.

LPT heuristics with RL agents trained in ten indepen-

dent runs. All comparisons are conducted on the same

50 test JSP instances. The null hypothesis is stated as

follows: there is no difference between the medians of

the two observed populations, e.g. there is no differ-

ence in performance between the given priority rule

and RL agent. Table 1 provides resulting p-values for

the formulated null-hypothesis. Assuming a signiﬁ-

cance level of 0.05, no conclusion about the observed

difference between the LPT heuristic and RL runs 1

and 9 can be derived. For the remaining 8 compar-

isons, there is sufﬁcient evidence to support the claim

that the RL approach yields scheduling solutions with

lower optimality gaps compared to the SPT and LPT

heuristics.

6 CONCLUSION AND OUTLOOK

This work adapts methods of RL for job shop schedul-

ing. The main contribution of this work is the intro-

duction of a novel action space, which is indepen-

dent of the number of machines and orders allow-

ing for use of RL agents with continuous and dis-

crete action spaces. To the best of our knowledge,

this is the ﬁrst action space design independent of the

problem dimensionality. Additionally, we introduce a

new reward shape encouraging the learning of optimal

schedules through higher reward gradients for near-

optimal solutions. In several evaluations, we demon-

strate that the trained RL agent can ﬁnd consistently

better schedules for unseen JSPs compared to com-

mon priority rule approaches, and is orders of mag-

nitudes faster compared to state of the art constraint-

programming solvers such as the CP-SAT implemen-

tation from OR-tools. Our RL approach offers a good

balance between speed and solution quality which is

a crucial factor for online scheduling applications in

dynamic production environments.

One important direction of future work is trans-

ferring the achieved results to bigger, and hence more

realistic JSP instances. This work will concentrate on

three main directions: increasing RL training speed

and efﬁciency, enhancing reward design, as well as

developing new solution designs for the state-space

representation. To make extensive training of RL

agents possible, a runtime optimization of the devel-

oped production simulation should be conducted. Ad-

ditionally, distributed RL training can be adopted by

the deployment of such RL methods as Proximal Pol-

icy Optimization (Schulman et al., 2017). Our pro-

posed reward shape requires a precalculated optimum

for every JSP instance used for training. This can be

eliminated by adopting the ”ranked reward” idea pre-

sented by Laterre et al. (Laterre et al., 2018). Finally,

in addition to the action space, it is important to make

the state space representation agnostic to the problem

dimensionality as well, so that one RL agent can be

used for JSP problems of various sizes.

ICAART 2021 - 13th International Conference on Agents and Artiﬁcial Intelligence

596

ACKNOWLEDGEMENTS

Funded by the Deutsche Forschungsgemeinschaft

(DFG, German Research Foundation) under Ger-

many’s Excellence Strategy – EXC-2023 Internet of

Production – 390621612.

REFERENCES

Aydin, M. and

Oztemel, E. (2000). Dynamic job-

shop scheduling using reinforcement learning agents.

Robotics and Autonomous Systems, 33(2-3):169–178.

Buker, D. W. (2001). Inventory management and control.

In Maynard, H. B. and Zandin, K. B., editors, May-

nard’s industrial engineering handbook, McGraw-

Hill standard handbooks, pages 1591–1614. McGraw-

Hill, New York.

Dufﬁe, N., Bendul, J., and Knollmann, M. (2017). An ana-

lytical approach to improving due-date and lead-time

dynamics in production systems. Journal of Manufac-

turing Systems, 45:273–285.

ElMaraghy, H., Schuh, G., ElMaraghy, W., Piller, F.,

Sch

onsleben, P., Tseng, M., and Bernard, A.

(2013). Product variety management. CIRP Annals,

62(2):629–652.

Gabel, T. (2009). Multi-agent reinforcement learning ap-

proaches for distributed job-shop scheduling prob-

lems.

Gabel, T. and Riedmiller, M. (2008). Adaptive reactive job-

shop scheduling with reinforcement learning agents.

International Journal of Information Technology and

Intelligent Computing, 24(4):14–18.

Garey, M. R. and Johnson, D. S. (1979). Computers and

intractability, volume 174. freeman San Francisco.

Gyulai, D., Pfeiffer, A., Nick, G., Gallina, V., Sihn, W., and

Monostori, L. (2018). Lead time prediction in a ﬂow-

shop environment with analytical and machine learn-

ing approaches. IFAC-PapersOnLine, 51(11):1029–

1034.

Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. (2018).

Soft actor-critic: Off-policy maximum entropy deep

reinforcement learning with a stochastic actor. arXiv

preprint arXiv:1801.01290.

Howard, R. A. (1960). Dynamic programming and markov

processes.

Jacobs, F. R. (2011). Manufacturing planning and control

for supply chain management. McGraw-Hill, New

York, apics/cpim certiﬁcation ed. edition.

Kan, A. and Rinnooy, H. G. (2012). Machine schedul-

ing problems: classiﬁcation, complexity and compu-

tations. Springer Science & Business Media.

Konda, V. R. and Tsitsiklis, J. N. (2000). Actor-critic algo-

rithms. In Advances in neural information processing

systems, pages 1008–1014.

Kurbel, K. (2016). Enterprise Resource Planning und Sup-

ply Chain Management in der Industrie: Von MRP

bis Industrie 4.0. De Gruyter Studium. De Gruyter,

Berlin/Boston, 8., vollst.

uberarb. und erw. auﬂage

edition.

Laterre, A., Fu, Y., Jabri, M. K., Cohen, A.-S., Kas, D.,

Hajjar, K., Dahl, T. S., Kerkeni, A., and Beguir, K.

(2018). Ranked reward: Enabling self-play reinforce-

ment learning for combinatorial optimization.

odding, H. (2013). Handbook of Manufacturing Control.

Springer Berlin Heidelberg, Berlin, Heidelberg.

Mather, H. and Plossl, G. (1978). Priority ﬁxation versus

throughput planning. Journal of Production and In-

ventory Management, (19):27–51.

Perron, L. and Furnon, V. (2020). Or-tools.

Qu, S., Wang, J., Govil, S., and Leckie, J. O. (2016). Op-

timized adaptive scheduling of a manufacturing pro-

cess system with multi-skill workforce and multiple

machine types: An ontology-based, multi-agent rein-

forcement learning approach. Procedia CIRP, 57:55–

60.

Rey, D. and Neuh

auser, M. (2011). Wilcoxon-Signed-Rank

Test. Springer Berlin Heidelberg.

Schneckenreither, M. and Haeussler, S. (2018). Reinforce-

ment learning methods for operations research appli-

cations: The order release problem. In International

Conference on Machine Learning, Optimization, and

Data Science, pages 545–559.

Schuh, G., Prote, J.-P., Sauermann, F., and Franzkoch, B.

(2019). Databased prediction of order-speciﬁc transi-

tion times. CIRP Annals, 68(1):467–470.

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and

Klimov, O. (2017). Proximal policy optimization al-

gorithms. arXiv preprint arXiv:1707.06347.

Sotskov, Y. and Shakhlevich, N. V. (1995). Np-hardness of

shop-scheduling problems with three jobs. Discrete

Applied Mathematics, 59(3):237–266.

Sutton, R. S. and Barto, A. G. (2018). Reinforcement learn-

ing: An introduction. MIT press.

Waschneck, B., Reichstaller, A., Belzner, L., Altenm

uller,

T., Bauernhansl, T., Knapp, A., and Kyek, A.

(2018). Optimization of global production schedul-

ing with deep reinforcement learning. Procedia CIRP,

72(1):1264–1269.

Watkins, Christopher John Cornish Hellaby (1989). Learn-

ing from delayed rewards. King’s College, Cam-

bridge.

Zhang, W. and Dietterich, T. G. (1995). A reinforcement

learning approach to job-shop scheduling. In Proceed-

ings of the 14th international joint conference on Ar-

tiﬁcial intelligence-Volume 2, pages 1114–1120.

Zhang, W. and Dietterich, T. G. (1996). High-performance

job-shop scheduling with a time-delay td (łambda)

network. In Advances in neural information process-

ing systems, pages 1024–1030.

Zijm, H. and Regattieri, A. (2019). Manufacturing planning

and control systems. In Zijm, H., Klumpp, M., Regat-

tieri, A., and Heragu, S., editors, Operations, Logis-

tics and Supply Chain Management, pages 251–271.

Springer International Publishing, Cham.

Manufacturing Control in Job Shop Environments with Reinforcement Learning

597