could be applied to the task of input data generation,
however, to be able to concentrate on the main prob-
lem, we have utilized an existing convenient tool that
could simplify this auxiliary task – Kinect.
Kinect is an input device to the Microsoft Xbox
360 game console that enables a novel interaction
of players in terms of video, movement, and sound.
This device contains a color VGA camera and a VGA
depth image sensor with 2048 possible distance val-
ues. The basic capability of Kinect is to track skele-
tons of up to two players. The skeleton is represented
by 20 joints that cover the whole body. This was
an important attribute of Kinect, since localization of
representative parts of the body enables to solve the
problem of the image segmentation required by the
HTM network. Kinect can be connected to a stan-
dard PC as an input device and operated using official
SDK
2
by Microsoft
3
. The SDK enables acquisition
of four types of data: RGB video, depth video, skele-
tons, and audio. For the purposes of our project, it
was sufficient to use the first three data types. Be-
cause of the data resolution limitation for the depth
image (QVGA resolution), it was necessary to warp it
into the VGA video resolution (see Section 5.3).
5 DATA GENERATION AND
STORAGE
5.1 ATM Simulator and Data
Synchronization (Kinect – ATM-S)
In our project, we decided to substitute a real ATM
by a software model that could operate on a note-
book and imitate real interactions with the user be-
ing scanned by the Kinect. After the analysis of solu-
tions currently available on the Internet, we decided to
implement our own software ATM simulator (ATM-
S). The ATM-S is designed as a state automaton that
records state changes and user inputs by means of a
system of events.
However, ATM-S events (that can be used to trig-
ger face recognition in real world applications) as well
as the three data types, generated by Kinect for each
frame, occur asynchronously. To synchronize them,
we have created a program called Logger. This pro-
gram subscribes to events generated by both Kinect
and ATM-S, synchronizes them according to their
time stamps, and stores them in a synchronous format.
2
Software development kit.
3
SDK is available at http://www.microsoft.com/en-
us/kinectforwindows/download/http://www.microsoft.com
/en-us/kinectforwindows/download/
The Logger is also used for reduction of the amount
of data coming from Kinect.
5.2 Acquisition of Input Data
The acquisition of input data was carried out in an
office environment where ATM-S was set up. The
system included a notebook with the running ATM-
S and Logger. Among various anomalous behaviors
(listed in Section 2), we focused on the acquisition of
images of faces. Kinect was placed so that the skele-
tons could be tracked for a person using ATM-S while
the RGB camera was able to capture his/her face as
well as hands. Participants of the experiments accom-
plished several trials, in which they simulated normal
and anomalous behavior while using ATM-S.
5.3 Data Processing
Due to different resolutions and camera positions,
the pixels with the same positions in the depth and
RGB images do not correspond to the same point in
the real scene. The deviation is greater the closer is
the object to the camera plane. For mapping of the
RGB image to the geometry of the depth image, the
SDK provides a special transformation map that how-
ever downgrades the RGB image to QVGA resolu-
tion. However, to preserve the VGA resolution of the
RGB image, we implemented a backmapping warp
transformation that maps the input depth image into
the geometry of the RGB image.
The next step was to segment out the faces from
the RGB images (using the skeleton data and depth
image). To remove most of the background informa-
tion, we cut out the face area based on the positions
obtained from the skeleton. Due to the properties of
the RGB camera and the distance of the user, the face
images had resolution of 128×128 pixels. To sup-
press the rest of the background, a gray-scale mask
was generated from the depth image of the tracked
user. The potential artifacts in the generated mask
were removed by application of the binary morpho-
logical closing. To prevent from forming of artificial
edges around masked object, the resulted mask was
then smoothed by a Gaussian filter. To remove the
neck from the face image, we also applied a Black-
man window mask. The final segmented image was
obtained by pixel-wise multiplication of the face area
RGB image by the prepared mask. An illustrative ex-
ample of this process is depicted in Fig. 2.
Since currently HTM accepts only gray-scale im-
ages, the obtained RGB images were finally trans-
formed to the gray-scale format.
ICPRAM2013-InternationalConferenceonPatternRecognitionApplicationsandMethods
514