of considered gestures increases. Our Topolet spec-
ified the sequence of possible Poselets (and not pos-
tures) within a gesture. One Topolet was trained for
each layer and a concrete relation between different
Poselets of that layer is introduced. That led to con-
struction of semi-gestures (or Topolets). The Topolets
of all layers together captured the temporal informa-
tion between different postures. Therefore, the entire
temporal information of any gesture consisted of sev-
eral but limited (eight) Topolets in a parallel mode.
This limited number of Topolets remained the same
for any configurations and for any number of gestures,
and thus the issue of linear growth was resolved. Ad-
ditionally, our Topolet enhanced the structure of the
training, decoding, and evaluation algorithms. For ex-
ample, conventional HMM theoretically requires infi-
nite number of examples for one gesture to reliably
train the parameters using the complex Baum-Welch
algorithm. Our method investigated the topological
relation of the Poselet points and, therefore, reliable
relations between the postures were established us-
ing one example. After all, within a vision-based
approach (RGB Camera), and in dealing with (po-
tentially) infinite number of (randomly created) ges-
tures, we achieved a notable 3fps in Exp4 and 4fps
in Exp5. We used a naive contour extraction method
which was an expensive part of our process ( Table
4). By employing a real-time contour extraction al-
gorithm (deep net (Bertasius and Torresani, 2015))
we could achieve higher fps on this time consuming
process.
In addition to those proven advantages, our pro-
posed method could introduce a number of other po-
tential benefits. The atomic structures could be used
to construct a comprehensive grammar between the
Poselets, Topolets, postures, and gestures. That is, if
we consider each gesture as a sentence and each pos-
ture as a word, one Poselet could be viewed as a word
syllable (with each DoF of the Poselet as a letter).
Then each of the trained Topolet specifies how each of
those Poselets could evolve as the gesture proceeds in
time. Therefore, the system will be capable of form-
ing a flexible description-based specification for hand
gesture database (Wang et al., 2012). Furthermore,
in the process of gestures transition, there might be
some transient (false) gestures depending on the con-
text of the application, where the system should per-
form no specific action. With our comprehensive ges-
ture’s temporal model, one could deactivate some of
the points’ cloud in each layer and acquire the Topo-
lets on a smaller space. That feature could be used to
concretely adopt the proposed Topolet to different ap-
plication domains. The evaluation of these statements
will be the topics of future works.
ACKNOWLEDGMENT
I would like to thank the Europ
¨
aischer Sozialfonds f
¨
ur
Deutschland (ESF) scholarship and the Professorship
of Graphische Datenverarbeitung und Visualisierung
at TU-Chemniz who made this research possible.
REFERENCES
Bertasius, G. and Torresani, L. (2015). DeepEdge : A
Multi-Scale Bifurcated Deep Network for Top-Down
Contour Detection. In CVPR.
Bourdev, L. and Malik, J. (2009). Poselets: Body Part De-
tectors Trained using 3D Human Pose Annotations.
IEEE Int Conf on Com Vis, pages 1365–1372.
Bradski, G. (2000). The OpenCV Library. Dr. Dobb’s Jour-
nal of Software Tools.
Dadgar, A. and Brunnett, G. (2018). Multi-Forest Classifi-
cation and Layered Exhaustive Search Using a Fully
Hierarchical Hand Posture / Gesture Database. In VIS-
APP, Funchal.
MacQueen, J. B. (1967). Some Methods for Classification
and Analysis of Multivariate Observations. Berkeley
Symp on Maths, Statis, Prob, 1(233):281–297.
Meshry, M., Hussein, M. E., and Torki, M. (2016). Linear-
Time Online Action Detection from 3D Skeletal Data
using Bags of Gesturelets. IEEE WACV.
Rabiner, L. (1989). A Tutorial on Hidden Markov Models
and Selected Applications in Speech Recognition.
Sangjun, O., Mallipeddi, R., and Lee, M. (2015). Real Time
Hand Gesture Recognition Using Random Forest and
Linear Discriminant Analysis. Proc of the Inter Conf
on Human-Agent Interaction, (October):279–282.
Sharp, T., Keskin, C., Robertson, D., Taylor, J., Shotton,
J., Kim, D., Rhemann, C., Leichter, I., Vinnikov, A.,
Wei, Y., Freedman, D., Kohli, P., Krupka, E., Fitzgib-
bon, A., and Izadi, S. (2015). Accurate, Robust, and
Flexible Real-time Hand Tracking. ACM Conf on Hu-
man Factors in Comp Sys (CHI), pages 3633—-3642.
Starner, T. E. and Pentland, A. (1995). Visual Recognition
of American Sign Language Using Hidden Markov
Models. Media, pages 189–194.
Wang, J., Liu, Z., Wu, Y., and Yuan, J. (2012). Mining Ac-
tionlet Ensemble for Action Recognition with Depth
Cameras. pages 1290–1297.
Yang, J., Xu, Y., and Chen, C. S. (1994). Gesture Inter-
face: Modeling and Learning. IEEE Proc Int Conf on
Robotics and Automation, pages 1747–52 vol.2.
Yao, B. and Fei-Fei, L. (2010). Grouplet: A Structured Im-
age Representation for Recognizing Human and Ob-
ject Interactions. IEEE Proc CVPR, pages 9–16.
Yu, G., Liu, Z., and Yuan, J. (2015). Discriminative Orderlet
Mining for Real-Time Recognition of Human-Object
Interaction. Lec Notes in Comp Sci (AI & Bioinf),
9007:50–65.
VISAPP 2019 - 14th International Conference on Computer Vision Theory and Applications
164