Ofﬂine Feature-Based Reinforcement Learning with Preprocessed Image

Inputs for Liquid Pouring Control

Stephan Pareigis

1 a

, Jesus Eduardo Hermosilla-Diaz

2 b

, Jeeangh Jennessi Reyes-Montiel

2 c

Fynn Luca Maaß

3 d

, Helen Haase

, Maximilian Mang

and Antonio Marin-Hernandez

2 e

Department of Computer Science, HAW Hamburg, Berliner Tor 7, 20099 Hamburg, Germany

Artiﬁcial Intelligence Research Institute, Universidad Veracruzana, Calle Paseo No. 112, Xalapa, Mexico

Department of Computer Science, Graz University, 8010 Graz, Inffeldgasse 16, Austria

Keywords:

Ofﬂine Reinforcement Learning, Pouring Liquid, Artiﬁcial Neural Network, Robust Control, UR5 Robot

Manipulator.

Abstract:

A method for the creation of a liquid pouring controller is proposed, based on experimental data gathered from

a small number of experiments. In a laboratory conﬁguration, a UR5 robot arm equipped with a camera near

the end effector holds a container. The camera captures the liquid pouring from the container as the robot

adjusts its turning angles to achieve a speciﬁc pouring target volume.

The proposed controller applies image analysis in a preprocessing stage to determine the liquid volume pouring

from the container at each frame. This calculated volume, in conjunction with an estimated target volume in

the receiving container, serves as input for a policy that computes the necessary turning angles for precise

liquid pouring. The data received on the physical system is used as Monte-Carlo episodes for training an

artiﬁcial neural network using a policy gradient method.

Experiments with the proposed method are conducted using a simple simulation. Convergence proves to be

fast and the achieved policy is independent of initial and goal volumes.

1 INTRODUCTION

Developing an optimal control for a non-linear sys-

tem usually requires a detailed model of the plant or

process. Reinforcement learning (Sutton and Barto,

2018) is a method used to develop a controller, pro-

vided the controlling agent has the opportunity to

search and experiment with different actions, receiv-

ing respective rewards.

Creating sufﬁciently detailed models of the sys-

tem to be controlled is often costly and difﬁcult. Often

times, the optimal control (policy) developed using a

simulation of the system does not work in practice

due to imprecise modeling of the system dynamics

and sensor data.

There are different approaches to mitigate the so-

https://orcid.org/0000-0002-7238-0976

https://orcid.org/0009-0001-0575-5096

https://orcid.org/0000-0003-3194-914X

https://orcid.org/0000-0002-4555-4870

https://orcid.org/0000-0002-7697-9118

called simulation to reality gap (Zhao et al., 2020).

One method may be to explicitly simulate the sim-to-

reality gap, i.e. by including artiﬁcial sensor noise.

However, this requires detailed knowledge about the

system as a whole to anticipate possible differences

between the simulation and the physical system.

Given this problem, it is usually better to collect

data and run experiments directly within the physical

system. However, this is usually not possible because

physical systems are often much slower than simu-

lations and therefore a large number of experiments

cannot be carried out, as is often needed in reinforce-

ment learning. In addition, safety aspects often play a

role, for example in autonomous driving, robotics or

aviation.

Strategies to cope with the problem that simula-

tions sometimes lack necessary precision and phys-

ical systems lack proper speed to run many experi-

ments, have been developed. Methods are used in

which a teacher demonstrates the experiments. In-

verse reinforcement learning (IRL) (Arora and Doshi,

2021), for example, tries to derive a reward function

320

Pareigis, S., Hermosilla-Diaz, J., Reyes-Montiel, J., Maaß, F., Haase, H., Mang, M. and Marin-Hernandez, A.

Ofﬂine Feature-Based Reinforcement Learning with Preprocessed Image Inputs for Liquid Pouring Control.

DOI: 10.5220/0012235700003543

In Proceedings of the 20th International Conference on Informatics in Control, Automation and Robotics (ICINCO 2023) - Volume 1, pages 320-326

ISBN: 978-989-758-670-5; ISSN: 2184-2809

from the demonstrations and use it to train a con-

troller.

We introduce a laboratory setup in which a few

predeﬁned experiments are carried out in the task of

pouring liquids. The data from those experiments is

stored accordingly as described in section 3.1. Later,

this data is used to train a reinforcement learning con-

troller in an ofﬂine manner. As the training algorithm

does not have direct access to the system, exploration

becomes impossible.

In the given laboratory setup, a camera is mounted

on top of the end effector of a UR5 robot arm. A con-

tainer with a liquid is held in the end effector. The task

is to turn the end effector such that a precise volume

of liquid is poured out. The system therefore con-

sists of a camera image as input space, and angles of

the end effector (one-dimensional) as output space. A

maximal reward is awarded if a given output volume

is met precisely.

The proposed controller uses two steps: In the ﬁrst

step, the camera image is analyzed using image pro-

cessing software. A measure for the liquid quantity

leaving the container is calculated. This measure can

be considered to be a feature of the liquid as seen in

the image.

In the second step, this feature is input into an ar-

tiﬁcial neural network (ANN) that serves as a policy

model for determining pouring angles. This two-step

process, utilizing pre-processed images, offers the ad-

vantage of a reduced input dimension for the ANN.

Consequently, a smaller ANN can be selected, re-

sulting in faster training and inference compared to

a high-dimensional input space.

As action space, we use discretized relative turn-

ing angles of the end effector. The fact that the an-

gles are chosen to be relative enables the method

to be independent of the initial volume in the pour-

ing container. This approach of relative and dis-

cretized rotation angles uses our idea from robust au-

tonomous driving, as we presented in (Pareigis and

Maaß, 2023).

There are a variety of publications on pouring liq-

uid. We just mention a few which are closely related

to our approach.

(Schenck and Fox, 2017) presents a solution for

pouring speciﬁc amounts of liquids based on imagery,

independent of the initial volume of liquid in the

source container. A camera ﬁlms the target container

and the images are fed into a two-stage neural net-

work.

(Moradi et al., 2021) apply Soft Actor-Critic

(SAC) with an Convolutional Neural Network (CNN)

as a reward approximate. The task containes pick-

ing up, moving and pouring the entire liquid from one

container to another.

An alternative approach employing the Actor-

Critic method is presented by (Tamosiunaite et al.,

2011). They integrate goal learning through an ap-

proximate function and shape learning using a Non-

linear Autoencoder (NAC) and a Probabilistic Infer-

ence Optimization (PI2) technique.

While the approaches mentioned earlier demon-

strate success within their speciﬁc contexts, chal-

lenges arise when applying them in real-world sce-

narios. For instance, the method proposed by (Ta-

mosiunaite et al., 2011) relies on prior knowledge of

the robot’s liquid quantity, which can vary. To ad-

dress this limitation, we will employ discretized rela-

tive angles as action space, ensuring that the pouring

process remains independent of the initial volume in

the source container.

Another limitation of both (Schenck and Fox,

2017) and (Moradi et al., 2021) is their reliance on

volume estimation of liquid in the recipient container

through imagery. Consequently, it becomes essen-

tial that the recipient container remains visible from

a speciﬁc angle or is transparent to allow for accurate

liquid volume assessment. This requirement parallels

related research on estimating liquid volumes from

single images, as observed in works such as (Cobo

et al., 2022) and (Liu et al., 2023).

The beneﬁt of just observing the liquid ﬂowing

out of the source container as in the proposed setup

is, that it can be poured into not or only partially ob-

servable containers. Additionally, it could be poured

onto surfaces or soils, where no accumulation of

liquid would be visible, for example when watering

plants.

The experimental setup is described in section 2.

Section 3 describes the proposed algorithm in three

steps: Data acquisition in section 3.1, the controller

architecture in section 3.2, and training the controller

in section 3.3. Simple experiments were performed

which are described brieﬂy in section 4. Section 5

concludes with a summary and remarks on current

work.

2 EXPERIMENTAL SETUP AND

REQUIREMENTS

A UR5 robot manipulator is equipped with an end ef-

fector holding a container, denoted as C, containing

an initial liquid volume of V

init

. As the end effector is

rotated, the liquid pours out. It is assumed that the re-

ceiving container, designated as R, is adequately sized

to capture all of the poured liquid.

Ofﬂine Feature-Based Reinforcement Learning with Preprocessed Image Inputs for Liquid Pouring Control

321

In Figure 1, the side view of the setup is illus-

trated, along with a reference coordinate system. The

Intel RealSense camera is mounted on the robotic

wrist and is aligned to capture a continuous video of

the liquid as it exits the container.

Figure 2 shows the system from the top and from

the front.

Figure 1: A local reference framework, denoted as O

, is

positioned at the center of the container base. The wrist is

initially in position 0

◦

(a) (b)

Figure 2: Experimental design: a) top view of the materials

used and b) front view of the pouring process.

The goal is to pour out a certain volume V

of liq-

uid. This requires an appropriate function α(t) for

steering the angle of the end effector in time. This

function cannot be computed ahead of time (open

loop control) because the initial volume of the liquid

in the container C is not known.

Consequently, container C needs to be rotated un-

til the liquid starts pouring out of it. Based on the vol-

ume observed by the camera positioned atop the end

effector, the angle α must be continuously adjusted as

to meet the desired goal volume, denoted as V

A scale is used to measure the weight w of the liq-

uid in the receiving container R at the end of the pour-

ing sequence. This way, the received volume of the

liquid can be measured within a certain precision. The

current weight of the liquid on the scale may not be

used in real-time in the feedback algorithm. It is as-

sumed that the scale may only be read after the pour-

ing process. The reason for this is that an ordinary

kitchen scale is used which has a delay and no digital

output.

It is furthermore assumed, that a simulation of the

setup is not available within a reasonable precision.

Only basic experiments with the real setup can be per-

formed. These experiments will take some time and

are slow to perform. However, experiments with the

real setup provide a ground truth dependency between

images, pouring angles and output volume.

The task shall be to design a set of experiments

with the laboratory setup and use this data in an ofﬂine

manner to develop a controller which will then be able

to pour out liquids with a given goal volume V

3 CONTROLLER

ARCHITECTURE

First, the method to collect data from the real labora-

tory setup is described. In section 3.2 the architecture

of the controller is presented. Section 3.3 describes

the training and setup of the controller.

3.1 Data Acquisition

The laboratory setup as shown in ﬁgure 2 is prepared

with an arbitrary amount V

init

of initial liquid in the

container C inside the gripper of the UR5. The angle

α is initially set to 0. The recipient (a regular coffee

cup) is initially empty and the scale is set to zero.

A pre-programmed movement of the end effector

is applied. The movement results from the function

α(t) := λ ·e

−(γ·(t−β)

)

(1)

applied to the angle of the end effector. λ, γ and

β are parameters to change the height and duration of

the curve. Figure 3 shows different angle curves in

time for various parameters λ, γ and β.

Basically, any kind of angle curve in time which

starts at angle 0 and returns to angle 0 could be used

for these experiments, e.g. λsin(γ·t). However, equa-

tion (1) has the advantage that the acceleration is slow

so as not to create unnecessary turbulence in the liquid

while pouring out.

After returning to zero, the resulting weight on the

scale is read and registered. Subsequently, one single

experiment consisting of a sequence of angles, cam-

era frames and a resulting weight shall be called ex-

perimental episode E.

During each experimental episode the images

taken from the camera together with the respective an-

gle α are stored. This way each experimental episode

E leads to a list L

of the following format:

= {(Image

, α(i))| i = 0, 1, 2,· ··}.

ICINCO 2023 - 20th International Conference on Informatics in Control, Automation and Robotics

322

(a)

Figure 3: Different angle functions in time as applied to

the end effector. These curves have the property that they

increase the angles very slowly in order to generate waves

as few as possible.

A typical frame rate for storing images from the cam-

era and respective angle could be 30 fps.

In each experimental episode the resulting weight

as read on the scale w

is also stored, such that a data

set

E = (L

, w

)

is obtained for every experimental episode.

3.2 Controller Architecture

We propose a two stage setup. In a ﬁrst stage the vol-

ume V

(estimated volume) of liquid leaving the con-

tainer as seen in a frame is calculated. OpenCV is

used to perform this preprocessing step.

As can be seen in ﬁgure 4 (c) and (d), the liquid

leaving the container has different forms depending

on the volume of the stream. The container always

has the same position in the image because the camera

is attached to the end effector. The chess pattern in

the background is used for demonstration purposes to

show the real angle of the end effector.

We propose a measure to describe the amount of

liquid leaving the container. Since the real ﬂow of

liquid cannot be measured, a 2-dimensional geomet-

ric approach is chosen. The details are described in

section 3.3.1.

In the second stage, the estimated volume V

(as

described above) of the liquid in each frame is used

as an input for a policy network. The policy network

receives the estimated volume V

and a required goal

volume D to be ﬁlled up as an input. The output of

the policy network is a discretized relative angle ∆α,

where α is the angle of the actuator of the robot arm

which holds the container with the liquid. The control

in each time step is then applied as α ← α + ∆α.

(a) (b)

Figure 4: Frames (a) to (d) show the pouring process from

the in-hand eye perspective. Pictures (c) and (d) show the

different forms of the pouring stream.

The total feedback control setup is then

Image → OpenCV → V

extracting information V

from the image, and then

using V

and D to receive ∆α

, D) → MLP → ∆α

where MLP is a Multilayer Perceptron.

In each frame, the remaining volume D to be

poured into the recepient is reduced by the volume

of the triangle

D ← (D −V

) (2)

As the current volume in the receiving container

cannot be seen by the camera, the variable D serves

as an estimation of the remaining volume to be ﬁlled

up.

3.3 Training of Controller

The data collected from the experimental setup serves

as the training data for both components of the con-

troller: ﬁrst, for estimating the liquid volume in each

frame as V

, and second, for training the angle con-

troller, which takes into account the frame’s volume

and the remaining goal volume D.

3.3.1 Volumes from Images

OpenCV is used to approximate the volume of the

poured liquid in each frame. This is done by calcu-

lating the 2d area of the triangle as shown in ﬁgure

4 (c) and (d). The y-axis aligns with the left side of

the container. The x-axis runs along the top edge of

Ofﬂine Feature-Based Reinforcement Learning with Preprocessed Image Inputs for Liquid Pouring Control

323

the glass. The yellow dotted line runs along the liq-

uid (intersection x-axis and y-axis). We denote with

the area of a triangle (yellow, red and green line)

in ﬁgure 4 (c) and (d). This way, a ﬁnite sequence of

values {V

, . . . ,V

} is obtained during an experi-

mental episode. All of the values V

contribute to the

ﬁnal output volume w

of an experimental episode.

The contribution of each V

depends on the number

of frames and frame rate applied. We deﬁne a nor-

malized contribution volume

of each frame i as

:= w

∑

j=1

Note that the normalized contribution volumes add up

to the total volume

∑

j=1

= w

These normalized contribution volumes are used as an

input to the MLP described in the next section.

3.3.2 Ofﬂine RL for Angle Control

Policy. The second part of the algorithm consists of a

multi-layer perceptron (MLP) which works as a pol-

icy

π : O −→ {−10, . . . , 10}

, v

t−1

, v

t−2

, d) 7→ ∆α

mapping the observation space O to a discretized rel-

ative angle ∆α of the end effector.

Observation Space. The observation space is de-

ﬁned as

O := [0,V

max

]

× [−D

max

, D

max

] (3)

The observation space O ⊆ R

× R is 4-dimensional

(four real-valued input neurons) where V

max

in equa-

tion 3 is the maximal normalized contribution volume

as measured in phase 1 of the algorithm, and D

max

the maximal volume of the receiving container.

The ﬁrst three arguments v

, v

t−1

, v

t−2

of the ob-

servation space are a sequence of the last three nor-

malized contribution volumes as described in section

3.3.1.

The second argument d of the observation space

stands for the volume which remains to be ﬁlled up.

Initially, if d = 0, the controller shall do nothing and

stay in the initial position. To activate the controller

and start the pouring process, d is set to the desired

goal volume

d ← V

(4)

to be poured into the receiving container. In each

frame d is reduced according to equation 2. If the

5.0

◦

3.5

◦

2.5

◦

1.5

◦

1.0

◦

0.7

◦

−5

◦

Figure 5: Distribution of the discretization of the relative

angles of the end effector. Relative angles close to zero are

discretized ﬁner, relative angles close to −5

◦

and 5

◦

are dis-

cretized coarser.

remaining volume to be poured is eventually 0, the

controller shall rotate the joint back to its initial posi-

tion.

Since it may happen that the receiving container

receives too much liquid, also negative values for d

shall be allowed.

Action Space. The action space consists of the fol-

lowing 21 output neurons

{i|i = −10, ··· , 10}.

Each output neuron is mapped to a discretized rela-

tive angle α

according to ﬁgure 5. The relative angle

will then be applied to the angle α of the end effector

according to α ← α + α

Figure 5 shows the discretization of the relative

angles. Relative angles close to zero are discretized

ﬁner to allow a control more precise.

Reward. A total reward function denoted as r is es-

tablished for every completed pouring process. It is

presumed that the pouring process concludes either

after a predeﬁned number of frames or when all the

liquid has been emptied from container C. Let d rep-

resent the remaining volume that needs to be ﬁlled by

the end of the pouring process. When d reaches 0, it

signiﬁes that the target volume V

has been precisely

achieved.

Deﬁne r(d) as

r(d) := e

−(δ·d

)

(5)

where δ < 1 is typically a small number which deﬁnes

the width of the reward function around the desired

goal value. Note that a reward is given only at the end

of the sequence.

Training. With the deﬁned observation space, action

space, and reward structure, a feedback controller can

be trained using reinforcement learning techniques.

In this scenario, only ofﬂine data, as outlined in sec-

tion 3.1, is accessible, and direct interaction between

ICINCO 2023 - 20th International Conference on Informatics in Control, Automation and Robotics

324

the agent and the physical system is not possible.

Consequently, traditional exploration methods are not

viable. Our proposed approach involves employing a

policy gradient method based on Monte-Carlo policy

gradient techniques, akin to the REINFORCE method

as described in (Williams, 1992).

We enrich the sequences obtained from the exper-

iments with the physical system described in section

3.1 to generate training sequences.

Training Data Generation. We propose the fol-

lowing method to generate a sufﬁciently rich set of

training data. A sequence from a single experimen-

tal episode E may be used as multiple episodes for

training. Given an experimental episode

= {(Image

, α(i)) : i}

together with a ﬁnal volume w

. Then training se-

quences are generated as follows:

1. From each Image

calculate the normalized con-

tribution volume of the liquid pouring from the

container as described in section 3.3.1 to obain a

sequence {

, . . . ,

2. For each i = 1, 2, . .. , k, calculate the relative an-

gles as ∆α(i) := α(i) − α(i − 1). Apply the dis-

cretization of the relative angles as explained in

ﬁgure 5 to obtain a sequence of values ∆α

∈

{−10, . . . , 10}

3. Choose an initial goal volume V

for this experi-

mental episode according to the following rule:

(κ) := κ · w

, κ = 0.5, . . ., 1.5. (6)

For each κ a different goal volume is created.

Therefore, for each κ a different training sequence

is created. E.g. for κ = 1.0, the goal volume

(κ) is equal to the volume w

from this exper-

imental episode. Therefore a perfect pouring pro-

cess is obtained which meets exactly the goal vol-

ume V

. For κ = 0.5, the goal volume V

This creates a training sequence, where too much

liquid is poured into the goal container:

is de-

sired, but w

is obtained. For κ = 1.5 we obtain

a training sequence in which too little liquid is

poured into the receiving container.

4. Calculate the sequence d

of remaining liquid vol-

ume to be poured. For i = 0 choose d

= V

(κ)

from equation 6 and calculate

i+1

:= d

−

to obtain a sequence {d

, d

, ·· · , d

5. Calculate the reward of the sequence using deﬁni-

tion 5

:= r(d

)

6. To create a baseline, take all total rewards R

E,i

from all experimental episodes and calculate the

mean R

and the variance σ

to receive normal-

ized rewards

E,i

− R

(7)

7. For every state ω

and action ∆α

from the exper-

imental episodes, take the one-hot encoded action

= (0, . . . , 0, 1, 0, . .. , 0) ∈ {0, 1}

, where the 1

stands at the position which represents the action

taken in the state ω

in the experimental episode.

8. Apply a softmax function to the output of the

multi-layer perceptron to receive a probability dis-

tribution p

for the 21 possible actions. Train the

artiﬁcial neural network in the respective state ω

applying

E,i

· I

, i.e. set the desired output Y (la-

bel) of the artiﬁcial neural network to

Y := (1 − λ ·

E,i

) · p

+ λ ·

E,i

· I

(8)

where λ ∈ (0, 1) is a learning factor. Use the cat-

egorical cross-entropy loss function to train the

neural network.

Observe, that equation 8 describes a function

{−10, . . . , 10} → R in which all values add up to

1, and negative values may occur. This is due to

and I

being probability distributions, and

E,i

may be negative.

The training of the multi-layer perceptron is done in

adequate batches with samples taken randomly from

the experimental episodes.

Observe, that REINFORCE is actually an on-

policy method. To use REINFORCE with ofﬂine

data, usually a correction using importance sampling

has to be made to account for the distributional shift,

because the sampling data is taken from a different

distribution as the one to be corrected for, see e.g.

(Liu et al., 2019), (Levine et al., 2020), (Kallus and

Uehara, 2020).

In the case described in this paper, the samples

from the experimental episodes do not correspond to

a particular policy. Therefore corrections cannot be

made or are difﬁcult to introduce. The method is ex-

pected to work similar to cross-entropy methods, see

e.g. (Kroese et al., 2005).

4 EXPERIMENTAL RESULTS

Experiments were conducted using a simple simula-

tion for pouring liquids. The general properties and

functioning of the algorithm could thus be proven.

A simulation has been implemented which gener-

ates volumes V (corresponding to V

as described in

Ofﬂine Feature-Based Reinforcement Learning with Preprocessed Image Inputs for Liquid Pouring Control

325

section 3) of triangles depending linearly on the angle

α and the remaining volume in the container.

A multi-layer perceptron with two hidden layers

and 2000 parameters is trained. To simplify the ex-

periments, only two actions are used: ∆α ∈ {−1, 1}.

A single experimental episode of the robot arm is

used as an ofﬂine training sequence, creating multiple

training sequences by applying random goal volumes

according to equation 6.

Experiments show a fast convergence. The result-

ing policy network pours liquid within the simulation

matching an arbitrary given goal volume.

5 CONCLUSION

A method has been outlined for constructing a con-

troller tasked with pouring a speciﬁed quantity of liq-

uid into a receiving container.

The approach comprises two primary steps: ﬁrst, a

preprocessing phase that extracts relevant image fea-

tures, followed by the implementation of a policy net-

work. Importantly, the policy network operates with

a low input dimension, as image preprocessing is ap-

plied using a separate image processing tool.

In addition, only ground truth data measured from

the real laboratory setup is used. Therefore, no simu-

lation of the setup is required to train the policy net-

work. The policy network is trained in an ofﬂine man-

ner using the data from the laboratory setup.

A valuable aspect of the proposed approach is its

capacity to derive multiple training sequences from a

single experimental sequence.

The method presented in this paper exhibits ro-

bustness in handling variations in the initial volumes

within the source container, achieved through control

of relative pouring angles. Moreover, the method ex-

clusively measures the liquid exiting the source con-

tainer, enabling its applicability in scenarios where

the liquid within the target container is not visible or

measurable, such as when watering plants.

The method has been implemented within a pour-

ing liquid simulation and shows fast convergence and

independence of goal volumes. Next, the data from

the physical system will be included to generate a

controller for the UR5 robot arm.

REFERENCES

Arora, S. and Doshi, P. (2021). A survey of inverse

reinforcement learning: Challenges, methods and

progress. Artiﬁcial Intelligence, 297:103500.

Cobo, M., Heredia, I., Aguilar, F., Lloret Iglesias, L.,

Garc

ıa, D., Bartolom

e, B., Moreno-Arribas, M. V.,

Yuste, S., P

erez-Matute, P., and Motilva, M.-J. (2022).

Artiﬁcial intelligence to estimate wine volume from

single-view images. Heliyon, 8(9):e10557.

Kallus, N. and Uehara, M. (2020). Statistically efﬁcient off-

policy policy gradients.

Kroese, D., Mannor, S., and Rubinstein, R. (2005). A tu-

torial on the cross-entropy method. Annals of Opera-

tions Research, 134.

Levine, S., Kumar, A., Tucker, G., and Fu, J. (2020). Ofﬂine

reinforcement learning: Tutorial, review, and perspec-

tives on open problems. CoRR, abs/2005.01643.

Liu, Y., Swaminathan, A., Agarwal, A., and Brunskill, E.

(2019). Off-policy policy gradient with state distribu-

tion correction.

Liu, Z., Liu, F., Zeng, Q., Yin, X., and Yang, Y. (2023). Es-

timation of drinking water volume of laboratory ani-

mals based on image processing. Scientiﬁc Reports,

13.

Moradi, H., Masouleh, M. T., and Moshiri, B. (2021).

Robots learn visual pouring task using deep reinforce-

ment learning with minimal human effort. pages 504–

510. Institute of Electrical and Electronics Engineers

Inc.

Pareigis, S. and Maaß, F. L. (2023). Improved Robust Neu-

ral Network for Sim2Real Gap in System Dynamics

for End-To-End Autonomous Driving. LNEE Series

published by Springer, to appear.

Schenck, C. and Fox, D. (2017). Visual closed-loop control

for pouring liquids. In 2017 IEEE International Con-

ference on Robotics and Automation (ICRA), pages

2629–2636.

Sutton, R. S. and Barto, A. G. (2018). Reinforcement Learn-

ing: An Introduction. The MIT Press, second edition.

Tamosiunaite, M., Nemec, B., Ude, A., and W

org

otter, F.

(2011). Learning to pour with a robot arm combining

goal and shape learning for dynamic movement prim-

itives. Robotics and Autonomous Systems, 59:910–

922.

Williams, R. J. (1992). Simple statistical gradient-following

algorithms for connectionist reinforcement learning.

Mach. Learn., 8(3–4):229–256.

Zhao, W., Pe

na Queralta, J., and Westerlund, T. (2020).

Sim-to-Real Transfer in Deep Reinforcement Learn-

ing for Robotics: a Survey. arXiv e-prints, page

arXiv:2009.13303.

ICINCO 2023 - 20th International Conference on Informatics in Control, Automation and Robotics

326