
cal optima, further research suggests that approaches
based on machine utilization may hold more promise
for optimizing JSPs, particularly in terms of compu-
tational efficiency.
ACKNOWLEDGEMENTS
This work has been supported by the FAIR-
Work project (www.fairwork-project.eu) and has
been funded within the European Commission’s
Horizon Europe Programme under contract number
101069499. This paper expresses the opinions of
the authors and not necessarily those of the European
Commission. The European Commission is not liable
for any use that may be made of the information con-
tained in this paper.
REFERENCES
Bellemare, M., Srinivasan, S., Ostrovski, G., Schaul, T.,
Saxton, D., and Munos, R. (2016). Unifying count-
based exploration and intrinsic motivation. Advances
in neural information processing systems, 29.
Bła
˙
zewicz, J., Ecker, K. H., Pesch, E., Schmidt, G., Sterna,
M., and Weglarz, J. (2019). Handbook on scheduling:
From theory to practice. Springer.
Fisher, H. (1963). Probabilistic learning combinations of
local job-shop scheduling rules. Industrial scheduling,
pages 225–251.
Kemmerling, M. (2024). Job shop scheduling with neu-
ral Monte Carlo Tree Search. PhD thesis, Disserta-
tion, Rheinisch-Westf
¨
alische Technische Hochschule
Aachen.
Kemmerling, M., Abdelrazeq, A., and Schmitt, R. H.
(2024a). Solving job shop problems with neural
monte carlo tree search. In ICAART (3), pages 149–
158.
Kemmerling, M., L
¨
utticke, D., and Schmitt, R. H. (2024b).
Beyond games: a systematic review of neural monte
carlo tree search applications. Applied Intelligence,
54(1):1020–1046.
Nasuta, A., Kemmerling, M., L
¨
utticke, D., and Schmitt,
R. H. (2023). Reward shaping for job shop schedul-
ing. In International Conference on Machine Learn-
ing, Optimization, and Data Science, pages 197–211.
Springer.
Oren, J., Ross, C., Lefarov, M., Richter, F., Taitler, A.,
Feldman, Z., Di Castro, D., and Daniel, C. (2021).
Solo: search online, learn offline for combinatorial
optimization problems. In Proceedings of the in-
ternational symposium on combinatorial search, vol-
ume 12, pages 97–105.
Pathak, D., Agrawal, P., Efros, A. A., and Darrell, T. (2017).
Curiosity-driven exploration by self-supervised pre-
diction. In International conference on machine learn-
ing, pages 2778–2787. PMLR.
Samsonov, V., Kemmerling, M., Paegert, M., L
¨
utticke, D.,
Sauermann, F., G
¨
utzlaff, A., Schuh, G., and Meisen,
T. (2021). Manufacturing control in job shop environ-
ments with reinforcement learning. In ICAART (2),
pages 589–597.
Savinov, N., Raichuk, A., Marinier, R., Vincent, D.,
Pollefeys, M., Lillicrap, T., and Gelly, S. (2018).
Episodic curiosity through reachability. arXiv preprint
arXiv:1810.02274.
Schmidhuber, J. (1991). A possibility for implementing
curiosity and boredom in model-building neural con-
trollers.
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and
Klimov, O. (2017). Proximal policy optimization al-
gorithms. arXiv preprint arXiv:1707.06347.
Tassel, P. P. A., Gebser, M., and Schekotihin, K. (2021).
A reinforcement learning environment for job-shop
scheduling. In 2021 PRL Workshop–Bridging the Gap
Between AI Planning and Reinforcement Learning.
Zhang, C., Song, W., Cao, Z., Zhang, J., Tan, P. S., and Chi,
X. (2020). Learning to dispatch for job shop schedul-
ing via deep reinforcement learning. Advances in Neu-
ral Information Processing Systems, 33:1621–1632.
APPENDIX
# Discount factor
gamma: 0.99013
# Factor for trade-off of bias vs variance for Generalized
# Advantage Estimator
gae_lambda: 0.9
# Whether to normalize the advantage or not
normalize_advantage: True
# Number of epoch when optimizing the surrogate loss
n_epochs: 28
# The number of steps to run for each environment
# per update
n_steps: 432
# The maximum value for the gradient clipping
max_grad_norm: 0.5
# The learning rate of the PPO algorithm
learning_rate: 6e-4
policy_kwargs:
net_arch:
# Hidden layers of the policy network
pi: [90, 90]
# Hidden layers of the value function network
vf: [90, 90]
# Whether to use orthogonal initialization or not
ortho_init: True
# Activation function of the networks
activation_fn: torch.nn.ELU
optimizer_kwargs:
# For the Adam optimizer
eps: 1e-7
Listing 1: PPO hyperparameter tuning results for the ft06
instance.
Curiosity Driven Reinforcement Learning for Job Shop Scheduling
225