system. Then we discuss the image processing and
matching methods used, followed by a description of
the feedback control algorithm for guiding the user
toward the object so that she/he can grasp it. Next,
experimental results are presented and a summary is
given.
2 OUTLINE OF THE SYSTEM
The grasping support system proposed in this paper
is based on visual feedback, but unlike most visual
feedback systems proposed to date, the actuator in the
system is a person whose vision is impaired. That is,
the grasping of the desired object is carried out by the
person’s hand, and the camera providing information
about the environment is also moved by the person’s
hand. At the start of the grasping action, the human
actuator holding the camera in one hand is placed at
a certain distance from the target object, and the cam-
era captures the target scene. The system recognizes
the target object in the obtained image and makes a
judgement about its location on the image. Based
on this location, a command is generated for the pur-
pose of providing auditory (voice) instructions to the
human actuator about how to make the next camera
move in a way that the camera will approach the tar-
get object without loosing track of the target object
in the field of view. The human actuator then moves
the camera according to the command, which results
in further approaching the target object by some finite
distance. Then the camera captures a new image, and
the same cycle is repeated until the camera and human
actuator will be close enough for the human actuator
to be able to grasp the object with her/his hand. A
schematic view of this visual-auditory feedback loop
is shown in Fig. 1, where the visual feedback loop is
indicated by the sequence {light field → camera →
image analysis → audio–based command generation
→ human actuator → camera motion → } (along the
dashed line). The connection between human actua-
tor and camera is meant to indicate that the camera is
manipulated by the person’s hand. This feedback loop
includes the scene in front of the camera (i. e. the
shelf with products) because the camera motion oc-
curs within the scene and the camera senses the light
field originating in the scene.
There is a second loop which is comprised of the
sequence {camera → compass → audio–based com-
mand generation → human actuator → } in which a
compass sensor for measuring the camera’s heading
is included. This heading sensor is mechanically at-
tached to the camera; it is used to assist in keeping
the camera’s direction stable. The second loop is also
a feedback loop including the human actuator, but it
is not based on vision. It works in parallel to the first
loop and involves only auditory feedback.
Since the camera is moved by a person who is
unable to observe the camera due to his/her visual
impairment, there inevitably will be some amount of
camera position jitter. This jitter leads to image blur if
the camera’s exposure time is set to about 1/60 second
(which is not unusual under dim scene illumination
sometimes found in stores). In order to stabilize the
camera, we mount the camera on a monopod, which
stabilizes the camera in vertical direction. The mono-
pod allows the person to adjust the camera’s height.
The reason for choosing a monopod over a tripod is
that the tripod would be too bulky to handle, whereas
a monopod is slim and yet providing sufficient, al-
though not absolute stability.
The visual-auditory feedback process described
above involves two different phases: 1. the gradual
approach toward the target object from some distance,
and 2. the actual grasping of the object by the per-
son’s hand. In the following, we explain the details
of Phase 1 and Phase 2, as well as the details of the
image processing and object recognition methods that
are common to both phases.
3 IMAGE PROCESSING AND
MATCHING
In a system as outlined above, recognition of target
objects in the images and estimating their position in
the image coordinate system is crucial to the success
of the system. Of similar importance is the identifica-
tion of the user’s hand and the estimation of the hand
position during the final stage of the grasping process.
3.1 Recognition of Target Object and
Position Estimation
For the recognition of objects, stable and descriptive
features of the object have to be extracted from the
image. We use the SIFT feature vectors (Lowe, 2004)
for this purpose because these features are relatively
stable to scale and orientation changes of the pro-
jected object images. As the user moves the cam-
era closer to the target object, the object size on the
images increases, and there are also camera direction
changes within a certain angular range because the
camera is mounted on a monopod held by the user’s
hand, which cannot keep the camera completely sta-
ble. SIFT feature vectors are designed to cope with
this situation, although the time needed for computing
VISAPP2014-InternationalConferenceonComputerVisionTheoryandApplications
76