a set of prototypes, f
i
, resulting in an extremely sparse
encoding: only the coefficient associated to the near-
est prototype equals one. Non-negative Matrix Fac-
torization (NMF) (Zhao et al., 2008) constrains basis
vectors and encodings to non-negative values to avoid
cancellations of features and facilitates their inter-
pretability. Non-negative Matrix Factorization with
Sparseness Constraints (NMF-SC) (Hoyer, 2004) is
based on NMF and additionally enforces sparseness
on the encodings and/or the basis components.
3 APPLICATION TO GRASPING
We compare the presented decomposition approaches
in a grasping scenario to investigate their ability to
find local inter-modal correlations. Based on a dataset
of successful grasps applied to a set of cups, a com-
pact set of basis components is calculated. In a sub-
sequent application step, partial observations are aug-
mented by a reconstruction of the missing modalities.
To this end, encodings are computed based on exist-
ing modalities and missing ones are predicted from
the corresponding linear combination of basis com-
ponents. Finally, the best grasp can be chosen and
realized by a robot hand.
3.1 Capturing of Grasping Data
To gather multimodal information of human grasping
processes, the Manual Interaction Lab was created at
CITEC, Bielefeld (Maycock et al., 2010). For the
work presented in this paper, data from three modal-
ities were captured: hand postures (motion-tracking
coordinates), color video images and depth images.
16 different cup-like objects were selected to record
grasping sequences belonging to three different grasp
types: cup grasped by handle, from above, or from the
side. 413 grasp configurations were captured, com-
prising 8-9 grasps per object and grasp type.
3.2 Preprocessing of Grasping Data
The captured raw sensor data was synchronized and
preprocessed to obtain suitable input data for the
grasp selection task.
Visual Modalities. The grasp for a particular ob-
ject is first and foremost determined by the shape of
the object. A preliminary study using color images
for decomposition resulted in basis components dom-
inated by colors and textures. Hence, we decided to
extract the object silhouette from these images, i.e.
those pixels constituting the object shape. We also re-
moved constant background pixels from all color and
Figure 1: Modalities: (a) Color video image. (b) Object
silhouette and contact areas. (c) Swiss Ranger depth image.
(d) Visualization of Vicon coordinates. (e) Grasp type.
depth images, replacing them with zero values. Thus,
the decomposition approaches do not need to explic-
itly model these irrelevant image parts. Contact re-
gions on the object silhouette were identified by com-
paring images before and after establishing the grasp.
All depth and color images were centered, cropped
to the foreground region, and resized for normaliza-
tion purposes. The image sizes of the sparse input
modalities were 144 × 100 for the object silhouettes
and contact regions and 61 × 46 for the depth images.
Hand Posture. Hand posture sequences, obtained
from tracking markers on all finger segments and sub-
sequent calculation of the associated hand posture
(Maycock et al., 2011), can be utilized in two man-
ners: using the whole grasping trajectory or the fi-
nal grasp posture only. In preliminary studies, we
found that complete trajectories can be reconstructed
in many cases. However, different grasping speeds
and large variations of hand trajectories prior to ac-
tual grasping sometimes lead to visible dilatation ef-
fects in the reconstructed trajectories. Dynamic Time
Warping (M
¨
uhlig et al., 2009) could compensate for
asynchronous execution speeds and might in future
work allow direct “replay” on a robot hand. In this
paper, only the final hand pose is considered, adding
a 27 × 3 dimensional vector of marker positions to the
input data.
Grasp Type. To distinguish the three employed
grasp types, we could learn three individual sets of
basis components, F
i
, employing appropriate subsets
of the training data. However, this strongly reduces
the number of data samples available for decomposi-
tion. Alternatively, a single decomposition could be
applied to the entire training set comprising all grasp
types, which often leads to an interference of basis
components corresponding to different grasp types.
In order to choose a particular grasp type, we aug-
mented all input vectors by an additional modality,
employing three-dimensional unit vectors to indicate
the grasp type. Then, we can explicitly request a par-
ticular grasp type by providing the corresponding unit
vector as an additional input to the search process.
This prevents simultaneous activation of basis com-
ponents belonging to different grasp types, thus re-
ducing co-activation of ambiguous local grasps. Fur-
ICPRAM 2012 - International Conference on Pattern Recognition Applications and Methods
586