RoboMorph: In-Context Meta-Learning for Robot Dynamics Modeling
Manuel Bianchi Bazzi
1
, Asad Ali Shahid
2
, Christopher Agia
3
, John Alora
3
, Marco Forgione
2
,
Dario Piga
2
, Francesco Braghin
1
, Marco Pavone
3
and Loris Roveda
2,3
1
Department of Mechanical Engineering, Politecnico di Milano, Italy
2
Istituto Dalle Molle di Studi Sull’Intelligenza Artificiale (IDSIA USI-SUPSI),
Scuola Universitaria Professionale della Svizzera Italiana, DTI , Italy
3
Stanford University, U.S.A.
Keywords:
Transformers, In-Context Learning, Meta Learning, Transfer Learning, Deep Learning, Isaac Gym, Robot
Dynamics Modeling.
Abstract:
The landscape of Deep Learning has experienced a major shift with the pervasive adoption of Transformer-
based architectures, particularly in Natural Language Processing (NLP). Novel avenues for physical appli-
cations, such as solving Partial Differential Equations and Image Vision, have been explored. However, in
challenging domains like robotics, where high non-linearity poses significant challenges, Transformer-based
applications are scarce. While Transformers have been used to provide robots with knowledge about high-level
tasks, few efforts have been made to perform system identification. This paper proposes a novel methodology
to learn a meta-dynamical model of a high-dimensional physical system, such as the Franka robotic arm, using
a Transformer-based architecture without prior knowledge of the system’s physical parameters. The objective
is to predict quantities of interest (end-effector pose and joint positions) given the torque signals for each joint.
This prediction can be useful as a component for Deep Model Predictive Control frameworks in robotics. The
meta-model establishes the correlation between torques and positions and predicts the output for the complete
trajectory. This work provides empirical evidence of the efficacy of the in-context learning paradigm, suggest-
ing future improvements in learning the dynamics of robotic systems without explicit knowledge of physical
parameters. Code, videos, and supplementary materials can be found at project website.
1 INTRODUCTION
The field of Deep Learning has undergone a signif-
icant transformation with the widespread adoption
of Transformer-based architectures (Vaswani et al.,
2017), particularly impacting Natural Language Pro-
cessing (NLP) in generating text (Touvron et al.,
2023). Recently, new applications have emerged to
solve partial differential equations and image vision
tasks. However, in complex areas such as robotics,
(Shahid et al., 2022), where systems are highly non-
linear, the implementation of Transformer-based so-
lutions remains limited. Transformers have shown
success in providing high-level task knowledge to
robots, but there has been little progress in using
them for learning system dynamics or performing sys-
tem identification. This paper introduces a new ap-
proach to learning a meta-dynamical model for high-
dimensional physical systems, such as the Franka
robot arm, using a Transformer-based architecture
without prior knowledge of the system’s physical pa-
rameters. The goal is to accurately predict key quan-
tities (end-effector pose and joint positions) based on
the torque signals for each joint. Such predictions
are valuable for integration into Deep Model Predic-
tive Control frameworks, which are increasingly uti-
lized in robotics. The meta-model is given an initial
context, that establishes the relation between torques
and positions and predicts the output for the com-
plete trajectory. Using massively parallel simulations,
large datasets, representing different robot dynam-
ics, are generated in a simulated physics environment
(Isaac Gym) to train the meta-model. The effective-
ness of this learned model is demonstrated across var-
ious types of control inputs. This work demonstrates
the use of transformer-based models in learning robot
dynamics without any explicit knowledge of physical
parameters.
Bazzi, M., Shahid, A., Agia, C., Alora, J., Forgione, M., Piga, D., Braghin, F., Pavone, M. and Roveda, L.
RoboMorph: In-Context Meta-Learning for Robot Dynamics Modeling.
DOI: 10.5220/0012945500003822
Paper published under CC license (CC BY-NC-ND 4.0)
In Proceedings of the 21st International Conference on Informatics in Control, Automation and Robotics (ICINCO 2024) - Volume 2, pages 149-156
ISBN: 978-989-758-717-7; ISSN: 2184-2809
Proceedings Copyright © 2024 by SCITEPRESS Science and Technology Publications, Lda.
149
1.1 Related Work
In the domain of Partial Differential Equations
(PDEs) resolution, Transformers are gaining momen-
tum compared to Physics Informed Neural Networks
(PINNs) and Recurrent structures such as LSTMs
(Schmidhuber et al., 1997). Transformer-based ar-
chitectures, coupled with Fourier-Neural-Operators
(Li et al., 2020), demonstrate promising capabilities
(Yang et al., 2023). Despite the limited existing lit-
erature on meta-learning in robotics, (Gupta et al.,
2022) demonstrates the adaptability of Transformers
in learning general controllers. Furthermore, Trans-
formers and meta-learning have notably found appli-
cations in the domain of Fault Diagnosis (Chen et al.,
2023). (Lin et al., 2022) discusses considerations re-
garding Transformers and highlights a key challenge:
their inefficiency in processing long sequences, pri-
marily due to the computational and memory com-
plexities associated with the self-attention module.
1.2 Contribution
This paper proposes an approach that utilizes
transformer-based architectures to learn a meta-
dynamical model of a robot arm. Diverse train-
ing datasets are generated using massively paral-
lel simulations in the Isaac Gym simulation frame-
work (Makoviychuk et al., 2021). To generate tra-
jectories for robots in simulation, specific families of
Operational Space Control (OSC) tasks were selected,
with domain randomization applied to certain param-
eters. After learning the meta-model, this work inves-
tigates the following aspects:
1. Hyper-Parameters and Context: Evaluation of
how hyper-parameters, such as the number of
multi-attention heads, dimension of compressed
information (d
model
), and context length influence
prediction accuracy.
2. In-Distribution and Out-of-Distribution Per-
formance: Examining the meta-model’s ability
to generalize in an out-of-distribution regime, and
the required training dataset dimension to achieve
acceptable results.
3. Transfer Learning on Out-of-Distribution Sce-
narios: Analysis of the impact of pre-training on
the transfer of knowledge across different distri-
butions of control actions, including zero-shot and
few-shot learning scenarios.
2 METHODS
This paper aims to develop an approach that relies on
a black-box model-free simulation of a system with-
out any information regarding the robotic system.
2.1 Preliminaries
Black Box Model: Black box models describe the
input-output behavior of a system without explicitly
modeling its internal mechanisms. These models are
estimated directly from experimental data.
Simulation Model: Simulation models predict sys-
tem outputs solely from inputs, without relying on
past measured data. They are valuable for designing
controllers and emulating physical systems.
Model-Free Simulation: In this approach no specific
model is needed for the system at hand; the meta-
model M
φ
receives input/output sequences (u
(i)
1:m
, y
(i)
1:m
)
up to time step m (context) and a test input sequence
(query) u
(i)
m+1:N
from time m + 1 to N to produce the
corresponding output sequence ˆy
(i)
m+1:N
:
ˆy
(i)
m:N
= M
φ
(u
(i)
1:m1
, y
(i)
1:m1
, u
(i)
m:N
). (1)
The meta-model indeed is trained by minimizing the
mean squared error (MSE) between the simulated
open-loop prediction and the ground truth.
2.2 Model Architecture
The standard Transformer architecture has been
adapted to handle real-valued input/output sequences
generated by dynamical systems, instead of the se-
quences of symbols (word tokens) typically used in
natural language modeling. Specifically, compared
to plain GPT-2 (Radford et al., 2019), the initial to-
ken embedding layer is replaced by a linear layer with
n
u
+n
y
inputs and d
model
outputs, while the final layer
is replaced by a linear layer with d
model
inputs and n
y
outputs. This modification allows the Transformer to
effectively process continuous-valued sequences, typ-
ical of dynamical systems, aligning the model archi-
tecture with the requirements of system identification
tasks rather than natural language processing.
The architecture is visualized in Figure 1 and con-
sists of (i) an encoder that processes u
1:m
, y
1:m
(with-
out causality restriction) and generates an embedding
sequence ζ
1:m
; (ii) a decoder that processes ζ
1:m
and
test input u
m+1:N
(the latter with causality restriction)
to produce the sequence of predictions ˆy
m+1:N
.
ICINCO 2024 - 21st International Conference on Informatics in Control, Automation and Robotics
150
Figure 1: Encoder-decoder Architecture (Forgione et al.,
2023).
3 EXPERIMENTAL
FRAMEWORK
3.1 Overview
The strength of the proposed approach lies in its flexi-
bility, allowing us to arbitrarily choose the dimensions
of inputs and outputs. In the analyzed experimental
framework, we learn a mapping from a 7-dimensional
input to a 14-dimensional output:
Input: joint torques applied to the 7 degrees of
freedom (DoFs) Franka robot in Nm.
Output:
end effector (EE) Cartesian coordinates (x, y, z)
in [m];
EE quaternion (X,Y, Z,W );
joint positions (q
0
. . . q
6
) in [rad].
Context: 20% of the entire simulation.
Total Time Window: 1000 steps 16.7 s.
The core idea is to train a meta-model by varying
the range of parameters influencing the dynamics of
the Franka robot. Figure 2 shows the context and the
input used to generate the predictions for complete
trajectory.
Figure 2: Proposed meta-model uses context and input to
perform the prediction. Context length is highlighted in
green color.
3.2 Domain Randomization
The classic domain randomization approach simu-
lates a broad distribution across domains or Markov
Decision Processes (MDPs), aiming to train a model
robust enough to perform well in real-world condi-
tions (Hospedales et al., 2020). Our proposed ap-
proach focuses on modeling a real Franka robot arm
using this consideration: Handling real-world scenar-
ios involves addressing deviations from nominal val-
ues, such as uncertainties in joint friction and damp-
ing characteristics. Below, we highlight some founda-
tional aspects of the proposed domain randomization:
Each link’s mass is randomized according to a
uniform distribution. This strategy increases the
complexity of the problem, prevents overfitting in
the model, and enhances overall robustness.
Initial joint positions are randomized to facili-
tate meta-learning of the system. This variation
aims to enhance the explorability of the robot’s
workspace, rather than strictly mapping specific
control actions or trajectories of the robotic arm.
Unlike typical robotic controllers in Franka Con-
trol Interface (FCI ), which compensates for grav-
ity terms and internal joint frictions by default, our
control actions consider those components. We
virtually handle the torque from the motor side,
emphasizing that the model needs to gain a deeper
understanding of the underlying physical system.
3.2.1 Object of Randomization
Initial Joint Positions: Initial joint positions are ran-
domized around the mid-positions.
Mass of the Links: Each link’s mass is uniformly
distributed around its nominal value, with a symmet-
ric variation of a certain percentage (±x %).
Joint Stiffnesses: Stiffness and damping of joints are
randomized and constrained to fixed values.
Center of Masses: The center of mass is varied along
three dimensions.
Frequency (for Random Inputs): This parameter
serves as the main frequency component of the overall
signal. Its value ranges from 0.1 and 0.25 Hz.
3.3 Task Definition
Tasks are categorized into two main families: syn-
thetic inputs or trajectories derived from Operational
Space Control (OSC), spanning from direct torque
commands to desired Cartesian positions. Figure 3
shows the example cartesian trajectories generated
with two types of control inputs.
RoboMorph: In-Context Meta-Learning for Robot Dynamics Modeling
151
3.3.1 Synthetic Random Input
In this case, the torque is generated directly by a spe-
cific mathematical function and applied in Nm. Two
types of torque inputs, namely multi-sinusoidal and
chirp, are considered to generate torque profiles for
each joint:
Multi-sinusoidal:
u
i
=
A
0
A
1
A
2
A
3
i
cos(w
0
t)
sin(w
1
t)
cos(w
2
t)
sin(w
3
t)
i
, (2)
w
0
w
1
w
2
w
3
=
w
0
1.5 ·w
0
2 ·w
0
3 ·w
0
, (3)
w
0
= 2π f
0
, f
0
f
m
1.5
, 1.5 · f
m
, (4)
A
0
···A
3
[f
m
·15, f
m
·15] . (5)
Chirp:
u
i
= A
i
cos(w
1
·( 1 +
1
4
·cos( w
2
·t)) ·t + φ), (6)
φ [π, π], q
0
[ 0.5, 0.5], A
i
[ 4, 4], (7)
w
1
= 2π f
1
, w
2
= 2π f
2
, (8)
f
1
[ f
m
, f
m
·1.5 ], f
2
[
f
m
1.5
, f
m
·2 ]. (9)
Figure 4 illustrates random control actions for two
joints generated with multi sinusoidal, with each color
representing a different robot. It is important to note
that joint 1 experiences higher gravity compensation
compared to joint 0. As a result, the randomness ap-
pears less pronounced but it also varies considerably
across the workspace.
Figure 3: 3D trajectories in Cartesian coordinates [m].
3.3.2 Operational Space Control (OSC)
OSC tasks depend on two control gains, K
p
and K
d
which regulate the responsiveness of the control ac-
tion in relation to the error with respect to the desired
trajectory x
d
. The control input u is defined as:
u = g(q) + J
T
ee
(q)
M
ee
· ¨x
d
+C
ee
· ˙x
d
+ K
p
·(x
d
x) + K
d
·(˙x
d
˙x)
.
Two operational space tasks are defined as illus-
trated in Figure 3:
Circle Task: circle trajectories on YZ plane, with
different radii and same frequencies.
Spiral Task: spiral trajectories on the XY plane
performed top-down or down-top, at different fre-
quencies and with different radii.
3.4 Data Generation in Isaac Gym
Isaac Gym (Makoviychuk et al., 2021) is NVIDIAs
prototype physics simulation environment for RL re-
search. It allows developers to experiment with end-
to-end GPU-accelerated RL for physical systems. To
the best of the author’s knowledge, there is no pre-
vious contribution mentioning Isaac Gym as a simu-
lation environment outside of the RL domain, indeed
the use of Isaac Gym in this work to generate large
datasets for transformer-based model training is quite
novel. Ensuring a trade-off between randomness and
feasibility regarding workspace position and required
torque is crucial to prevent issues during the training
stage. For these reasons, three features/sub-functions
have been incorporated to handle spurious robot in-
stances:
Self-Collision and Floor-Collision Detection:
To ensure self-collision and floor-collision scenar-
ios are managed effectively, a net contact force
Figure 4: Multi-sinusoidal torque profiles for joint 0 and
joint 1, respectively, and for 20 robots each.
ICINCO 2024 - 21st International Conference on Informatics in Control, Automation and Robotics
152
Figure 5: Collision detection visualization.
tensor is employed to filter out colliding robots
from the data buffer. Figure 5 shows the detection
of such collisions.
Position and Torque Saturation Check: In this
scenario training data should ideally consist of nu-
merical functions that are free from singularities
or unrealistically high values. At each time step,
simulations that result in joint positions reach-
ing their limits or torques which saturate, are ex-
cluded from the training dataset.
Exclusion of Quaternion Error: In certain sim-
ulations, unexpected changes in quaternion acqui-
sition have been observed, possibly due to internal
reference shifts within Isaac Gym or singularities.
These sudden changes occurred infrequently and
were straightforwardly excluded from the dataset.
The data generation algorithm is shown in Algo-
rithm 1.
Data: num robots, timesteps
Result: Generated data, black-listed robots
Setting up simulation parameters;
Generating random torques for all robots’ DoFs;
for i 1 to num robots do
Randomization of the dynamical parameters;
end
while t timesteps do
Apply torques and step simulation;
if saturation or collision or sel f collision
then
add to the black-list of robot idx;
else
Store full pose and torques in buffer;
end
end
Remove black-listed robots;
Save tensors;
Algorithm 1: Data Generation - Higher Level.
3.5 Dataset Composition
Figure 6 illustrates the composition of datasets used
for training and fine-tuning based on the type of con-
trol action and the number of simulations. Each
dataset has specific mass and joint randomization
bounds.
Figure 6: Dataset composition.
3.6 Training Loss
As highlighted by (Kirsch et al., 2024), Transformer
training often exhibits a substantial plateau problem.
One of the solutions proposed by the authors is to in-
crease the batch size to reduce the plateau problem.
In this work, each “simulation batch” is composed of
over 3000 classes of Franka robots (each with differ-
ent dynamics), while each training batch consisted of
16 robots. This approach aimed to promote learning-
to-learn as opposed to mere memorization. The loss
function for an example training is shown in Figure 7.
4 EVALUATION APPROACH
Each model trained on a specific dataset is tested on
different simulations, which may be In or Out of Dis-
tribution with respect to the training dataset. Different
metrics have been used to compare these prediction
performances.
Figure 7: Loss function of an example dataset (for every
100 iterations).
RoboMorph: In-Context Meta-Learning for Robot Dynamics Modeling
153
Figure 8: First evaluation approach.
4.1 Metrics
Indexes are calculated over time for each coordinate,
then averaged across the output dimensions n
y
, and
finally averaged across a batch of X robots. This type
of evaluation is illustrated in Figure 8. The following
indexes are used:
Coefficient of Determination: R
2
=
1
n
i=1
(y
i
ˆy
i
)
2
n
i=1
(y
i
¯y)
2
;
Root Mean Square Error: RMSE =
q
1
n
n
i=1
(
ˆ
y
i
y
i
)
2
;
Normalized Root Mean Square Error: NRMSE =
RMSE
σ
y
;
Fit Index: FI = 100 ·
1
q
n
i=1
(y(t)ˆy(t))
2
n
i=1
(y(t)¯y(t))
2
!
.
The main drawback of this approach is that R
2
for
“flat” and relatively small values tends to penalize the
single robot prediction metrics. In the OSC tasks,
most of the joints experience small variations com-
pared to multi-sinusoidal and chirp signals and for
this reason, averaging over the output dimensions n
y
and then over batches leads to really low R
2
mean val-
ues. For this reason another evaluation approach was
used to investigate the performance of OSC tasks. In-
stead of averaging over output dimensions n
y
and then
over different robots, several trajectories are merged
sequentially coordinate by coordinate, and then in-
dexes are computed on these merged trajectories as
shown in Figure 9. Predictions are performed sepa-
rately but they’re considered merged only for the cal-
culation of metrics. The aim is to better demonstrate
the overall capability in different scenarios rather than
evaluating the local precision as in the first evaluation
approach.
R
2
tends to assume values very close to 1.0 on the
coordinates which show already good capabilities lo-
cally. To better evaluate the prediction performance
Figure 9: Second evaluation approach.
of the models, the fit index has been chosen as the
primary metric, due to its better differentiation even
when R
2
values are particularly high and similar.
5 RESULTS
Influence of the training context: Tests were per-
formed for contexts ranging from 5% to 50%, with
20% used as a reference for all tests. The quadratic
dependence of computational and memory complex-
ity on sequence length in self-attention is a limiting
factor for Transformers. Generally, Transformers can
accept context lengths different from those used dur-
ing training, provided that the test context lengths are
smaller than the training ones. Accordingly, it is also
possible to use the same context length while varying
the prediction horizon to a smaller number of steps,
effectively changing the context length in percentage.
Varying the prediction horizon while keeping the
context window the same as in training reveals an ini-
tial transition error. However, overall performance
improves as the test horizon approaches the training
horizon as shown in Figures 10 and 11, indicating that
the model requires the knowledge of whole trajectory
patterns to perform optimally.
Figure 10: Different prediction horizons and with the same
test context as in training.
ICINCO 2024 - 21st International Conference on Informatics in Control, Automation and Robotics
154
Figure 11: R
2
in a 200 steps context.
Hyper-Parameters Tests: The hyper-parameter val-
ues tuned through trial and error are given below.
Most other parameters were fixed as per (Forgione
et al., 2023). Batch size is not mentioned as an inde-
pendent parameter, due to computational constraints.
Loss Function: MSE / Huber;
Model Dimension (d
model
:) 192 / 384;
Number of Multi-Attention Heads: 8 / 12;
Number Layers: 10 / 12 / 16.
Given the training and test dataset from the same
distribution, it is possible to investigate how these pa-
rameters affect the quality of predictions. Table 1
shows an example demonstrating the impact of the
number of layers on performance, using the first eval-
uation approach. The results highlight the best per-
formance achieved with the 12 layers model. Varying
the model dimension and number of multi-attention
heads showed comparable results without any signif-
icant performance differences. The results are avail-
able in the appendix on project website.
Analysis of Model Performance: The performance
analysis of the optimal model is presented in Table 2
(with related color-convention in Table 3), with fur-
ther details provided below:
In Distribution: For bounded tasks, good results
can be achieved on in-distribution tests within 2
hours. However, as the number of tasks and fre-
quencies increases, the required training time and
Table 1: Impact of the number of layers on prediction accu-
racies.
10 Layers
TestA R
2
σ
R
2
RMSE σ
RMSE
test0 0.682 0.343 0.0530 0.0236
test1 0.750 0.292 0.0465 0.0218
test2 0.755 0.293 0.0458 0.0212
12 Layers
TestA R
2
σ
R
2
RMSE σ
RMSE
test0 0.713 0.287 0.0521 0.0242
test1 0.787 0.219 0.0454 0.0219
test2 0.793 0.212 0.0448 0.0213
16 Layers
TestA R
2
σ
R
2
RMSE σ
RMSE
test0 0.700 0.303 0.0531 0.0258
test1 0.761 0.262 0.0468 0.0239
test2 0.767 0.255 0.0462 0.0234
Table 2: Comparison between zero-shot, fine-tuning on the
spiral, and training from scratch.
Zero-Shot FT Spiral (2.5h) Scratch (9h)
R
2
Fit Index R
2
Fit Index R
2
Fit Index
x 0.98 86 0.998 96.09 0.988 88.92
y 0.99 89.78 1 97.89 0.996 93.47
z -3.014 -100.34 0.867 63.57 -1.254 -50.13
X 0.986 88.01 0.999 97.56 0.982 86.54
Y 0.989 89.51 1 98.04 0.998 95.72
Z 0.137 7.11 0.953 78.37 0.493 28.76
W 0.191 10.05 0.965 81.39 0.572 34.54
q
0
0.999 96.77 1 98.71 0.998 95.15
q
1
0.966 81.56 0.983 86.8 0.875 64.67
q
2
0.847 60.94 0.997 94.55 0.991 90.37
q
3
-0.669 -29.18 0.969 82.44 0.627 38.9
q
4
0.988 88.96 1 97.87 0.996 93.44
q
5
0.871 64.13 0.999 96.77 0.838 59.81
q
6
0.981 86.11 1 98.05 0.997 94.28
Table 3: Fit index color convention.
Fit Index Range
90
80 - 89.99
60 - 79.99
30 - 59.99
< 30
model dimensions naturally increase.
Slightly out of Distribution: The model showed
good results, supporting the hypothesis of gen-
eralization rather than mere memorization of the
training dataset. Specifically regarding mass ran-
domization, its impact appears to be secondary to
the frequency of the tested signal. Transformers
demonstrate greater effectiveness when applied to
higher frequencies compared to their training fre-
quencies, as opposed to lower frequencies.
Out of Distribution: The first column of Table 2
displays the performance of the model, which was
trained on a mixed dataset containing both multi-
sinusoidal and chirp signals, and tested on a spiral
task. The performance varies significantly across
different output dimensions, with some showing
good results and others performing poorly.
Fine-Tuning on Variable Spiral: Subsequent ex-
aminations involved a model trained extensively
for 16 hours on a range of different frequen-
cies, including both multi-sinusoidal and chirp
signals. This model was then fine-tuned on a Spi-
ral task for 2.5 hours (following the ideas in (Piga
et al., 2024)). Results were compared to a model
trained from scratch for 9 hours (Table 2, third
column). The results show that fine-tuned model
perform significantly better than the one trained
from scratch.
Additional Analysis: Even after fine-tuning the Spi-
ral task, challenges persist in the z-coordinate, likely
due to the intrinsic complexity of the problem. Spi-
rals, varying in both upward and downward direc-
tions, exhibit distinct characteristics in terms of fre-
RoboMorph: In-Context Meta-Learning for Robot Dynamics Modeling
155
quencies and elevation rates: combinations of larger
radii and slow elevations differ significantly from
tasks involving fast elevation changes and small radii.
Fine-tuning on the same task family (multi-
sinusoidal) with controlled joint randomization led to
a significant increase in overall prediction accuracy,
depending on the number of examples provided.
Fine-tuning on Spiral tasks and zero-shot testing
on the Circle task resulted in poor performance. This
underscores the difficulty in predicting OSC tasks
generally, particularly during initial transient phases.
Direct fine-tuning on the Circle task yielded better
results compared to fine-tuning on Spiral tasks and
testing on Spiral. This observation suggests that the
model’s effectiveness is constrained to in-distribution
(ID) and slightly out-of-distribution (OOD) tasks.
6 CONCLUSIONS
This work tackles the problem of learning a meta-
model of robot dynamics using an encoder-decoder
Transformer architecture. The challenge lies in the
simulation domain, where the meta-model accurately
predicts complex systems over long sequences based
on a 20% context and 80% prediction of the over-
all trajectory. The results indicate that Transformer-
based models can learn dynamics in a zero-shot or
few-shot fashion within control action distributions,
suggesting their potential for use in robotics. The
results also highlight fine-tuning is advantageous in
these scenarios and more practical compared to train-
ing such models from scratch.
Current limitations are mainly related to general-
ization. Following a black-box approach to gener-
alize a single robotic arm independently of the type
of control action appears structurally unfeasible. Re-
sults show distinctions between in-distribution (ID)
and out-of-distribution (OOD) control actions, high-
lighting the critical role of the model’s inputs.
For future work, the results presented pave the
way for pre-compensating control actions in unknown
systems, particularly where estimating parameters
such as payload, joint stiffness, and damping is chal-
lenging. This approach can be extended beyond
control distributions to encompass Transfer Learning
across diverse robot morphologies.
ACKNOWLEDGMENTS
This paper has received funding from the Hasler
Foundation under the GENERAI (GENerative
Robotics AI) Project.
REFERENCES
Chen, C., Wang, T., Liu, C., Liu, Y., and Cheng, L.
(2023). Lightweight convolutional transformers en-
hanced meta-learning for compound fault diagnosis of
industrial robot. IEEE Transactions on Instrumenta-
tion and Measurement, 72:1–12.
Forgione, M., Pura, F., and Piga, D. (2023). From sys-
tem models to class models: An in-context learning
paradigm. IEEE Control Systems Letters, 7:3513–
3518.
Gupta, A., Fan, L., Ganguli, S., and Fei-Fei, L. (2022).
Metamorph: Learning universal controllers with
transformers. arXiv preprint arXiv:2203.11931.
Hospedales, T., Antoniou, A., Micaelli, P., and Storkey, A.
(2020). Meta-learning in neural networks: A survey.
Kirsch, L., Harrison, J., Sohl-Dickstein, J., and Metz,
L. (2024). General-purpose in-context learning
by meta-learning transformers. arXiv preprint
arXiv:2212.04458.
Li, Z., Kovachki, N., Azizzadenesheli, K., Liu, B., Bhat-
tacharya, K., Stuart, A., and Anandkumar, A. (2020).
Fourier neural operator for parametric partial differen-
tial equations. arXiv preprint arXiv:2010.08895.
Lin, T., Wang, Y., Liu, X., and Qiu, X. (2022). A survey of
transformers. AI Open.
Makoviychuk, V., Wawrzyniak, L., Guo, Y., Lu, M., Storey,
K., Macklin, M., Hoeller, D., Rudin, N., Allshire, A.,
Handa, A., and State, G. (2021). Isaac gym: High
performance gpu-based physics simulation for robot
learning. CoRR, abs/2108.10470.
Piga, D., Pura, F., and Forgione, M. (2024). On the adap-
tation of in-context learners for system identification.
In Proceedings of the 20th IFAC Symposium on Sys-
tem Identification, Boston, MA.
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D.,
Sutskever, I., et al. (2019). Language models are un-
supervised multitask learners. OpenAI blog, 1(8):9.
Schmidhuber, J., Hochreiter, S., et al. (1997). Long short-
term memory. Neural Comput, 9(8):1735–1780.
Shahid, A. A., Piga, D., Braghin, F., and Roveda, L. (2022).
Continuous control actions learning and adaptation for
robotic manipulation through reinforcement learning.
Autonomous Robots, 46(3):483–498.
Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux,
M.-A., Lacroix, T., Rozi
`
ere, B., Goyal, N., Hambro,
E., Azhar, F., et al. (2023). Llama: Open and ef-
ficient foundation language models. arXiv preprint
arXiv:2302.13971.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,
L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I.
(2017). Attention is all you need. Advances in neural
information processing systems, 30.
Yang, L., Liu, S., Meng, T., and Osher, S. J. (2023). In-
context operator learning with data prompts for differ-
ential equation problems. Proceedings of the National
Academy of Sciences, 120(39):e2310142120.
ICINCO 2024 - 21st International Conference on Informatics in Control, Automation and Robotics
156