Figure 15 and 16 indicate the effectiveness of our
context-aware visual processing approach. During
the speech scenarios, both the coarse level blobs and
the refined level blobs are tracked. Figure 15 shows
the tracking results of the participant’s raising hand
process, and the speaker is tracked stably for a long
time span in Figure 16.
Figure 15: Tracking results while the participant is raising
his right hand.
Figure 16: Tracking results of the presenter.
However, our tracking algorithm might generate
errors when two participants walk across each other
and one of the heads is occluded for a short period.
Cooperative reasoning structure is considered to
improve our approach in future work.
7 CONCLUSIONS
In this paper, a bottom-up and top-down integrated
visual framework is proposed for human-centered
processing in dynamic context. Coarse level visual
cues are extracted concerning human presence and
states in meeting scenarios, based on which context
analysis is performed through Bayesian reasoning
approach. Context information is then applied to
control refined visual modules in a top-down style.
Besides, a novel hypothesis-verification method is
adopted for robust detection and long-term stable
tracking of human objects. Experimental results
have validated our approach. Spatial-temporal
analysis of hierarchical context model is considered
for the future extension of our work.
ACKNOWLEDGEMENTS
The work described in this paper is supported by
CNSF grant 60673189 and CNSF grant 60433030.
REFERENCES
McCowan, I., Perez, D., Bengio, S., Lathoud, G., Barnard,
M., Zhang, D., 2005. Automatic Analysis of
Multimodal Group Actions in Meetings. In IEEE
Trans. on Pattern Analysis and Machine Intelligence
(PAMI’05), Vol. 27, No. 3.
Zhang, D., Perez, D., Bengio, S., McCowan, I., 2006.
Modeling individual and group actions in meetings
with layered HMMs. In IEEE Trans. on Multimedia,
Vol. 8, No. 3.
Waibel, A., Schultz, T., Bett, M., Denecke, Malkin, R.,
Rogina, I., Stiefelhagen, R., Yang, J., 2003. SMaRT:
The Smart Meeting Room Task at ISL. In Proc. IEEE
Intl. Conf. on Acoustics, Speech, and Signal
Processing 2003 (ICASSP'03), Vol. 4: 752–755.
Stiefelhagen, R., Yang, J., Waibel, A., 2002. Modeling
Focus of Attention for Meeting Indexing Based on
Multiple Cues. In IEEE Trans. on Neural Networks,
Vol. 13, No. 4.
Trivedi, M., Huang, K., Mikic, I., 2005. Dynamic Context
Capture and Distributed Video Arrays for Intelligent
Spaces. In IEEE Trans. on Systems, Man, and
Cybernetics—PART A: Systems and Humans, Vol. 35,
No. 1.
Hakeem, A., Shah, M., 2004. Ontology and taxonomy
collaborated framework for meeting classification. In
Proc. 17th Intl. Conf. on Pattern Recognition 2004
(ICPR'04), Vol. 4: 219-222.
Hames, M., Rigoll, G., 2005. A Multi-Modal Graphical
Model for Robust Recognition of Group Actions in
Meetings from Disturbed Videos. In Proc. IEEE Intl.
Conf. on Image Processing 2005 (ICIP'05).
Song, X., Nevatia, R., 2004. Combined face-body tracking
in indoor environment. In Proc. 17th Intl. Conf. on
Pattern Recognition 2004 (ICPR'04), Vol. 4: 159-162.
Wang, Y., Liu, Y., Tao, L., Xu, G., 2006. Real-Time
Multi-View Face Detection and Pose Estimation in
Video Stream. In Proc. 18th Intl. Conf. on Pattern
Recognition (ICPR'06), Vol. 4: 354-357.
Jiang, L., Zhang, X., Tao, L., Xu, G., 2006. Behavior
Analysis Oriented Consistent Foreground Object
Detection. In Proc. 2rd Chinese Conf. on Harmonious
Human Machine Environment 2006 (HHME'06).
Park, S., 2004. A Hierarchical Graphical Model for
Recognizing Human Actions and Interactions in Video.
PhD Thesis.
VISAPP 2007 - International Conference on Computer Vision Theory and Applications
38