geometric transformations the system would need to
deal with. These are:
1. Pan.wmv: The camera is held with the image
plane parallel to the display and translated (in-
plane) from left to right at a constant distance of
30cm from the display.
2. Zoom.wmv: The camera is held central and par-
allel to the display and moved from a distance of
10cm to 50cm from it. This tests out-of-plane
translations (or equivalently, zooming out).
3. Rotate.wmv: The camera is held central and par-
allel to the display at a distance of 30cm, while it
is rotated (in-plane) through 180 degrees.
4. Skew.wmv: The camera starts by pointing to-
wards the centre of the display at a distance of
30cm. The camera then follows a roughly circu-
lar arc through 30 degrees, rotating about the dis-
play’s y-axis, until it is 5cm from the screen and
at a 60 degree angle to the screen.
Given an input video, our evaluation framework
computes the homography between the frames of the
video and the background image, exactly as the live
system would. We record timing data and the com-
puted homography for each frame processed and save
this data to disk for analysis. We record the time it
takes for each component of the system to execute as
well as the time taken for each frame to be processed.
To facilitate the measuring of accuracy, we man-
ually selected the pixel in the reference image cor-
responding to the pixel at the centre of each frame
in each of the test videos, as accurately as possible.
These manually selected points correspond to where
the cursor should be on the remote display if the sys-
tem was being used as a pointing system. These
points serve as the ground truth position of the cur-
sor and allow the accuracy of the point projected on
to the remote display to be measured. More specif-
ically, for any frame of a test video, upon comput-
ing the homography, the cursor position (central cam-
era pixel) projected by this homography can be deter-
mined. This can be compared to the manually deter-
mined (ground-truth) cursor position for the frame to
give an indication of accuracy. Accuracy is measured
as the Euclidean distance, in pixels, between the com-
puted and manually selected cursor positions. This
measure is computed for every frame of every video
and saved to disk for analysis.
We evaluated the accuracy of system 1, where
the best 50% of feature matches are used to compute
the homography, and system 2, which employs sam-
ple consensus. An overview of average performance
across all videos for the two systems is shown in fig-
ure 4. The sample consensus approach (system 2)
shows superior performance, particularly in the dif-
ficult ‘skew’ video.
Examining the detail of the accuracy on a frame-
by-frame basis, we see that for (i) translating, (ii)
zooming and (iii) rotating in-plane, the error is ac-
ceptably small and always within five pixels. How-
ever, in the ‘skew’ video sequence as the angle of
camera approaches 30 degrees from the remote dis-
play normal, the errors can become large, particularly
in system 1. This is because SURF features are not
invariant to general projective transformations and so
the quality of feature matching deteriorates.
geometric transformations the system would need to
deal with. These are:
1. Pan.wmv: The camera is held with the image
plane parallel to the display and translated (in-
plane) from left to right at a constant distance of
30cm from the display.
2. Zoom.wmv: The camera is held central and par-
allel to the display and moved from a distance of
10cm to 50cm from it. This tests out-of-plane
translations (or equivalently, zooming out).
3. Rotate.wmv: The camera is held central and par-
allel to the display at a distance of 30cm, while it
is rotated (in-plane) through 180 degrees.
4. Skew.wmv: The camera starts by pointing to-
wards the centre of the display at a distance of
30cm. The camera then follows a roughly circu-
lar arc through 30 degrees, rotating about the dis-
play’s y-axis, until it is 5cm from the screen and
at a 60 degree angle to the screen.
Given an input video, our evaluation framework
computes the homography between the frames of the
video and the background image, exactly as the live
system would. We record timing data and the com-
puted homography for each frame processed and save
this data to disk for analysis. We record the time it
takes for each component of the system to execute as
well as the time taken for each frame to be processed.
To facilitate the measuring of accuracy, we man-
ually selected the pixel in the reference image cor-
responding to the pixel at the centre of each frame
in each of the test videos, as accurately as possible.
These manually selected points correspond to where
the cursor should be on the remote display if the sys-
tem was being used as a pointing system. These
points serve as the ground truth position of the cur-
sor and allow the accuracy of the point projected on
to the remote display to be measured. More specif-
ically, for any frame of a test video, upon comput-
ing the homography, the cursor position (central cam-
era pixel) projected by this homography can be deter-
mined. This can be compared to the manually deter-
mined (ground-truth) cursor position for the frame to
give an indication of accuracy. Accuracy is measured
as the Euclidean distance, in pixels, between the com-
puted and manually selected cursor positions. This
measure is computed for every frame of every video
and saved to disk for analysis.
We evaluated the accuracy of system 1, where
the best 50% of feature matches are used to compute
the homography, and system 2, which employs sam-
ple consensus. An overview of average performance
across all videos for the two systems is shown in fig-
ure 4. The sample consensus approach (system 2)
shows superior performance, particularly in the dif-
ficult ‘skew’ video.
Examining the detail of the accuracy on a frame-
by-frame basis, we see that for (i) translating, (ii)
zooming and (iii) rotating in-plane, the error is ac-
ceptably small and always within five pixels. How-
ever, in the ‘skew’ video sequence as the angle of
camera approaches 30 degrees from the remote dis-
play normal, the errors can become large, particularly
in system 1. This is because SURF features are not
invariant to general projective transformations and so
the quality of feature matching deteriorates.
Figure 4: Mean pixel error results for the four movies.
On examining the time to process a frame, there
were no significant differences between any of the
videos. We also noted that the mean frame rate, across
all frames, is 0.64 frames per second. This is clearly
too slow for a direct interaction device; the cursor will
only update every 1.56 seconds which would result in
a very low level of usability. Within our timing re-
sults, we found that feature extraction took on aver-
age 17.8% of the time, feature matching took 74.4%
of the time, whilst computing the homography took
only 7.8% of the time.
6 USABILITY EVALUATION
In our system, both responsiveness and accuracy are
preferred. However, this is not currently possible and
some trade-off has to be made. In order to determine
where a good balance between speed and accuracy
lies, a user test was conducted. This was also used
to quantitatively measure perceived usability.
We evaluated three techniques to improve the sys-
tem update rate: (i) bounding the time over which fea-
ture matching can take place; (ii) reducing the sensed
PECCS_2011_82_CR - LATEX - Correcção de Volumes
JVARELA - 10JAN2011
Figure 4: Mean pixel error results for the four movies.
On examining the time to process a frame, there
were no significant differences between any of the
videos. We also noted that the mean frame rate, across
all frames, is 0.64 frames per second. This is clearly
too slow for a direct interaction device; the cursor will
only update every 1.56 seconds which would result in
a very low level of usability. Within our timing re-
sults, we found that feature extraction took on aver-
age 17.8% of the time, feature matching took 74.4%
of the time, whilst computing the homography took
only 7.8% of the time.
6 USABILITY EVALUATION
In our system, both responsiveness and accuracy are
preferred. However, this is not currently possible and
some trade-off has to be made. In order to determine
where a good balance between speed and accuracy
lies, a user test was conducted. This was also used
to quantitatively measure perceived usability.
We evaluated three techniques to improve the sys-
tem update rate: (i) bounding the time over which fea-
ture matching can take place; (ii) reducing the sensed
PECCS 2011 - International Conference on Pervasive and Embedded Computing and Communication Systems
212