overlay of the instructor’s contour-extracted images
on the board-stream would result in no correlation
between the actual area of focus on the board and the
area that the instructor is pointing to in the overlayed
video stream. Direct overlay would result in
misalignment of the physical positioning of the
instructor’s contour over the board; the overlayed-
stream in the remote classrooms would not reflect the
actual area on the board that the instructor is pointing
to in the classroom that he/she is physically present.
The following steps are involved in the
instructor-contour extraction and overlay process.
1) Instructor-contour Extraction
To extract the instructor-contour, we employ
Microsoft’s Kinect Sensor with OpenNI2 and NITE
(Davison., Andrew., 2012) API frameworks. The
output of this step is a sequence of frames with only
the instructor’s body profile, henceforth referred to as
Instructor-Contour-Extracted frames or ICE-frames.
These frames are of Kinect camera’s perspective.
2) Calibration of the Transformation Matrix
Since the ICE-frames and board-frames (individual
frames from the board-stream) have different
perspective, the ICE frames cannot be overlayed on
the board frames without transformation. This
process of translating an image from one perspective
to another is called Image-Registration, and is
achieved through a transformation matrix called the
Homography- Matrix. Details of obtaining the
Homography-Matrix are discussed in section 3.1.
3) Image-registration of the Instructor-contour-
extracted Frames
Reshaping of an image to align perspectives with
another image is done by the process of warping.
The ICE-frames are warped using the Homography-
Matrix and the obtained frames correspond to the
perspective of the board-frames (
Wolberg, George.,
1990
). Thus they are henceforth called Instructor-
Contour-Extracted-Warped frames or ICEW-frames.
4) Encoding and Transmission of Board-frames
and ICEW-frames
Board-frames are high resolution frames (
Lavrov,
Valeri. 2004
) and they are encoded and transmitted as
board-stream over the network to the remote
location at 3 frames per second. The ICEW-frames
contain the instructor’s contour and are highly
dynamic. They are thus encoded and transmitted at
30 frames per second. The ICEW-frames are
encoded with instructor’s audio and transmitted as
ICEW-stream.
5) Decoding and Overlay
The ICEW-stream and the board-stream are decoded
at the receiving end to obtain frames. The frames are
buffered appropriately to match for time stamps of
network packets before the overlay process. Ten
ICEW-frames are overlayed on one board-frame and
these are henceforth called overlayed-frames. The
overlayed-frames are played-back at 30 frames per
second at each of the remote participants locations.
To evaluate the system, 20 participants were
subjected to an experiment. The experiment
involved an array of numbers on the board with the
instructor pointing at random numbers on the board.
The remote students had to identify the numbers in
the shortest time possible. The experiment was
conducted over two sessions; 1) with two screens -
one displaying the instructor-stream and the other
displaying the board-stream and 2) with one screen –
displaying the instructor’s contour extracted frames.
Results were calculated based on both accuracy and
average time required for identification. Objective
and subjective results are presented in section 4.
2 RELATED WORK
People detection is the first step involved in the
extraction of people’s body contour. From the depth
map of the scene provided by the Kinect’s infrared
sensor, the 3D contour of a body is extracted. Work
done by Krishnamurthy, Sundar Narayan delves into
the intricacies of contour extraction (Krishnamurthy,
Sundar Narayan, 2011).
Work done by Lowe, David G on scale invariant
feature transforms describes methods to find the
matching features for image correspondences. This is
one of the seminal works in computer vision wherein
he describes the computation of high frequency
keypoints and their associated descriptor vectors for
each of the images. The paper also covers feature
matching between keypoints after comparing the
descriptors. Strong matches are found and the
algorithm calculates the best matching keypoint pairs
between the images (Lowe, David G., 2004).
Ramponi, G describes methods for interpolation of
image pixels facilitating warping for better perceptual
rendition of images. His technique involves a linear
approach to bilinear interpolation thus saving largely
on computation costs (Ramponi, Giovanni., 1999).
3 PROPOSED SOLUTION
The proposed solution involves enhancing the
presence of the instructor in the remote classroom.