answers invariance changing the point of record of
the camera. We can observe events that are equal as
regard the trajectory and are different as regard the
semantics: these can hide in practice their actual
meaning in small differences.
6.1 Methodology
In order to try the effectiveness of the parameters in
an unambiguous recognition of the action, the
motion of the wrist of subjects while they are eating,
drinking and smoking was recorded. The movies
have been recorded always to the same distance,
using a fixed experimental setting. This has made so
that the trajectories were much similar between
them. The recorded sample consists of 25 subjects.
We have chosen simple actions to carry out and
to study: Eat, Drink and Smoke.
In each action, the movement is always of the
arm, always to leave from the bottom to the up, until
arriving and stopping themselves on the mouth, and
then from here, to return behind until stopping itself,
caught up the desk newly.
The technical apparatus consists of two
analogical television cameras: one of these was in a
frontal position, whilst the other was in a lateral
position respect to the subject. There are also one
desk, one chair, and finally cracker, water and
cigarettes. Above all were fixed the relevant
positions, so that the features unchanged. We
arranged the desk and the position of the chair
relatively to it.
Then it has been taken the centre of the desk and
in connection to this, one camera is blocked to 5
meters of distance and 53° approximately of angle-
shot as regards the lateral position, whilst the second
is blocked to 5 meters and concerns the frontal
position. It is moreover fixed on the table the start-
position of the objects (cracker, water, and
cigarettes), so the gesture had a precise fixed point
of reference.
Finally, in order to obtain a metric calibration of
the digital images, in order to have a correlation
between the real distances and the resolution in pixel
after the digitalisation, a graded bar has been placed
(it was 30 centimetres long).
During the recording, each subject wore a
bracelet with two white markers, obtaining a high
gradient of brightness in the point of the body that
mainly characterises the movement.
To extract the interesting data from the movies
we digitised them (card of acquisition ATI All in
Wonder 128), to 20 frame for second with a
resolution of 320x240 pixels and with a colour depth
of 24 bits. Then we transform them in grey-scale
levels. We compressed the video sequences through
the algorithm of Run Length Encoding: that replaces
the sequences of identical pixel with the indication
of the number of epochs in which the pixel is
repeated, followed from the value of the pixel itself.
Each of the 50 movies (25 for each camera) has
been decomposed in 3 shorter movies
(correspondent to a particular action), every one of
which has been then analysed singularly.
7 EXPERIMENTAL RESULTS
During the learning, in all the experiments in order
to achieve comparable results we fixed the same
number of iteration epochs and the same learning
rate. In particular the number of iterations was fixed
to 15000 epochs and the learning rate was η = 0.1.
Remembering that each of the three actions was
subdivided in a sequence of ten steps, we determined
the mean square error for each step after 15000
iterations, or rather in the end of the learning. The
figure 2 shows the trend of the mean square error
after 15000 iterations: the black line refers to the
learning by the three actions of 25 subjects of lateral
view, whilst the grey line refers to the learning by
the three actions of 25 subjects of frontal view.
Considering the training with the 25 subjects of
the lateral view, the average error calculates on all
the sequence is 0.04. If we divide the sequence in
two phases (going and return): the average error of
going is 0.006, whist the average error of the return
is 0.08.
Moreover training the networks with the 25
subjects of the frontal view the average error is 0.03.
Dividing the sequence in two phases the average
error for the going is 0.01, whilst the error is 0.05 for
the return phase. These values are understandable if
we consider the “kind” of action. In order to have
more possible similar actions, all the subjects started
the movement from a fixed point, caught up the
mouth and then returned in the fixed starting point.
Initially the action is loaded of meaning both for the
aim linked to the action and for the object grasped
by the end. The heaviness of a water glass is
certainly different from a cigarette. So, in the going
the actions are quite different, instead in the return
they show a conjunction of meaning: all the
movement have only to reach a fixed point. The
consequence is the increase of the average error, so
the not ambiguous recognition of a particular action
is closely related to the development of the
sequence.
SIGMAP 2008 - International Conference on Signal Processing and Multimedia Applications
360