Grasping Guidance for Visually Impaired Persons based on Computed

Visual-auditory Feedback

Michael Hild and Fei Cheng

Graduate School of Engineering, Osaka Electro-Communication University,

Hatsu-cho 18-8, Neyagawa 572-8530, Osaka, Japan

Keywords:

Visual Feedback, Grasping, Human as Actuator, Commands-by-Voice.

Abstract:

We propose a system for guiding a visually impaired person toward a target product on a store shelf using

visual–auditory feedback. The system uses a hand–held, monopod–mounted CCD camera as its sensor and

recognizes a target product in the images using sparse feature vector matching. Processing is divided into two

phases: In Phase1, the system acquires an image, recognizes the target product, and computes the product

location on the image. Based on the location data, it issues a voice–based command to the user in response to

which the user moves the camera closer toward the target product and adjusts the direction of the camera in

order to keep the target product in the camera’s ﬁeld of view. When the user’s hand has reached grasping range,

the system enters Phase 2 in which it guides the user’s hand to the target product. The system is able to keep

the camera’s direction steady during grasping even though the user has a tendency of unintentionally rotating

the camera because of the twisting of his upper body while reaching out for the product. Camera direction

correction is made possible due to utilization of a digital compass attached to the camera. The system is also

able to guide the user’s hand right in front of the product even though the exact product position cannot be

determined directly at the last stage because the product disappears behind the user’s hand. Experiments with

our prototype system show that system performance is highly reliable in Phase 1 and reasonably reliable in

Phase 2.

1 INTRODUCTION

Visually impaired persons are faced with many dif-

ﬁcult tasks in their everyday lives. One of them is

the task of shopping in a store. At ﬁrst, the person

must navigate to the store, then navigate within the

store to the shelf with the target product, and ﬁnally

she/he must approach the target product and grasp it

with her hand. For visually unimpaired persons, all of

these tasks involve the person’s visual capability, but

a visually impaired person has to rely on other means

to carry out these tasks. In this paper we propose a

system for supporting a visually impaired person to

carry out the product grasping step of the shopping

task. The system uses a camera for sensing the scene

and recognizing the product which the user wishes to

buy, and based on the recognition results it guides the

user toward the product by issuing voice commands to

the user according to which she/he moves the camera.

In this way, the system provides feedback about the

product’s relative location based on visual analysis,

but the commands derived from the analysis results

are conveyed to the user in auditory form. Solutions to

the grasping problem based on visual feedback have

been investigated in the past, but those efforts focused

on the implementation of the grasping capability for

robots, where the actuator element in the visual feed-

back loop is some kind of electro–mechanical device

which moves the robot’s hand toward the object to be

grasped. (Chinellato and del Pobil, 2005) In the pro-

posed system, the actuator in the feedback loop is a

human being who in many ways is more difﬁcult to

control than an electro–mechanical device.

There is a large body of literature on visual feed-

back applications most of which are related to the

ﬁeld of robotics and control. There are also many

publications discussing supporting technologies for

visually impaired persons based on vision techniques;

the bulk of it focuses on navigation in various envi-

ronments, obstacle detection, and detection of spe-

ciﬁc objects. The paper in (T. Winlock and Belongie,

2010) proposes a method for detecting grocery on

shelves for assisting visually impaired persons, but

the aspect of assisting the person to grasp the desired

item is not treated.

In the next section we provide the outline of the

Hild M. and Cheng F..

Grasping Guidance for Visually Impaired Persons based on Computed Visual-auditory Feedback.

DOI: 10.5220/0004653200750082

In Proceedings of the 9th International Conference on Computer Vision Theory and Applications (VISAPP-2014), pages 75-82

ISBN: 978-989-758-009-3

 2014 SCITEPRESS (Science and Technology Publications, Lda.)

system. Then we discuss the image processing and

matching methods used, followed by a description of

the feedback control algorithm for guiding the user

toward the object so that she/he can grasp it. Next,

experimental results are presented and a summary is

given.

2 OUTLINE OF THE SYSTEM

The grasping support system proposed in this paper

is based on visual feedback, but unlike most visual

feedback systems proposed to date, the actuator in the

system is a person whose vision is impaired. That is,

the grasping of the desired object is carried out by the

person’s hand, and the camera providing information

about the environment is also moved by the person’s

hand. At the start of the grasping action, the human

actuator holding the camera in one hand is placed at

a certain distance from the target object, and the cam-

era captures the target scene. The system recognizes

the target object in the obtained image and makes a

judgement about its location on the image. Based

on this location, a command is generated for the pur-

pose of providing auditory (voice) instructions to the

human actuator about how to make the next camera

move in a way that the camera will approach the tar-

get object without loosing track of the target object

in the ﬁeld of view. The human actuator then moves

the camera according to the command, which results

in further approaching the target object by some ﬁnite

distance. Then the camera captures a new image, and

the same cycle is repeated until the camera and human

actuator will be close enough for the human actuator

to be able to grasp the object with her/his hand. A

schematic view of this visual-auditory feedback loop

is shown in Fig. 1, where the visual feedback loop is

indicated by the sequence {light ﬁeld → camera →

image analysis → audio–based command generation

→ human actuator → camera motion → } (along the

dashed line). The connection between human actua-

tor and camera is meant to indicate that the camera is

manipulated by the person’s hand. This feedback loop

includes the scene in front of the camera (i. e. the

shelf with products) because the camera motion oc-

curs within the scene and the camera senses the light

ﬁeld originating in the scene.

There is a second loop which is comprised of the

sequence {camera → compass → audio–based com-

mand generation → human actuator → } in which a

compass sensor for measuring the camera’s heading

is included. This heading sensor is mechanically at-

tached to the camera; it is used to assist in keeping

the camera’s direction stable. The second loop is also

a feedback loop including the human actuator, but it

is not based on vision. It works in parallel to the ﬁrst

loop and involves only auditory feedback.

Since the camera is moved by a person who is

unable to observe the camera due to his/her visual

impairment, there inevitably will be some amount of

camera position jitter. This jitter leads to image blur if

the camera’s exposure time is set to about 1/60 second

(which is not unusual under dim scene illumination

sometimes found in stores). In order to stabilize the

camera, we mount the camera on a monopod, which

stabilizes the camera in vertical direction. The mono-

pod allows the person to adjust the camera’s height.

The reason for choosing a monopod over a tripod is

that the tripod would be too bulky to handle, whereas

a monopod is slim and yet providing sufﬁcient, al-

though not absolute stability.

The visual-auditory feedback process described

above involves two different phases: 1. the gradual

approach toward the target object from some distance,

and 2. the actual grasping of the object by the per-

son’s hand. In the following, we explain the details

of Phase 1 and Phase 2, as well as the details of the

image processing and object recognition methods that

are common to both phases.

3 IMAGE PROCESSING AND

MATCHING

In a system as outlined above, recognition of target

objects in the images and estimating their position in

the image coordinate system is crucial to the success

of the system. Of similar importance is the identiﬁca-

tion of the user’s hand and the estimation of the hand

position during the ﬁnal stage of the grasping process.

3.1 Recognition of Target Object and

Position Estimation

For the recognition of objects, stable and descriptive

features of the object have to be extracted from the

image. We use the SIFT feature vectors (Lowe, 2004)

for this purpose because these features are relatively

stable to scale and orientation changes of the pro-

jected object images. As the user moves the cam-

era closer to the target object, the object size on the

images increases, and there are also camera direction

changes within a certain angular range because the

camera is mounted on a monopod held by the user’s

hand, which cannot keep the camera completely sta-

ble. SIFT feature vectors are designed to cope with

this situation, although the time needed for computing

VISAPP2014-InternationalConferenceonComputerVisionTheoryandApplications

human

)

audio-based

command

generation

human

actuator

camera

motion

grasping

action

light

field

)

camera

compass

image

analysis

product

info,

start/stop

scene

) same

person

Figure 1: Visual-auditory feedback loop with human actuator.

them is said to be relatively long. As the computation

time of SIFT feature vectors is not critical in the con-

text of our system, we settled for SIFT, although there

are other possible choices, for example the SURF fea-

ture vectors. (H. Bay and Gool, 2008)

First, the objects to be recognized have to be rep-

resented in terms of SIFT feature vectors. For this

purpose we take the image of the objects from a rel-

atively far distance (for example 1.5 m) and extract

SIFT vectors within a rectangular window placed on

the characteristic part of the object. The set of ex-

tracted vectors v

, j = 1, 2, ...., J is called template

in this paper and is used as the object representation.

Note that the position information of the vectors is not

saved, which is different from the usual usage of the

term template. SIFT vectors are extracted from the

intensity image.

During Phase 1 and Phase 2, at each processing

step scene images are acquired by the camera and

SIFT vectors v

, k = 1, 2, ...., K are extracted from

the entire image for the purpose of using them for the

recognition of the target object in the image. Con-

cretely, the SIFT vectors included in the target ob-

ject’s template, v

, are matched to all vectors v

an exhaustive search which is expressed by Eqn.(1).

, v

}

min

= min

D(v

, v

), ∀ j, k (1)

The vector that is considered to be the best match to

vector v

is found by selecting that vector v

from all

vectors extracted from the image for which the dis-

tance D(·, ·) is minimal. In addition to this, we re-

quire that the matched distance must be smaller than

a preset threshold D

, i. e. we require D(v

, v

) < D

for a match to be valid. Carrying out this matching

process for each vector in the template generates a set

of matched vector pairs ⟨v

, v

⟩, where j = 1, 2, ..., J

with some of the j–numbers possibly missing, and for

a given j there is only one number k from the range

k = 1, 2, ..., K.

Figure 2: Line segments between matched template vectors

and image vectors.

The set of matched vector pairs ⟨v

, v

⟩ may in-

clude some pairs that are not correct. Such vector

pairs are labeled as mismatch. In order to ﬁnd and

eliminate such vectors, we require that the spatial

relationships between matched vector pairs must be

consistent. Concretely, we execute the following pro-

cedure:

First, we combine the scene image and the template

image such that the template image is located above

the right–upper corner of the scene image as is shown

in the example of Fig.2. In this combined image,

the positions of matched vector pairs ⟨v

, v

⟩ can be

thought of as being connected by straight line seg-

ments, where each line segment is described by a

pair of parameters (α, l), which represent the angle

between the line segment and the x–axis of the im-

age coordinate system, and the line segment’s length,

respectively. If we assume that the orientation of

the template image and the orientation of the object

in the scene image are not radically different and

each matched feature vector pair has been correctly

GraspingGuidanceforVisuallyImpairedPersonsbasedonComputedVisual-auditoryFeedback

matched, the histogram H(α) of the angles α

of all

matched feature vector pairs, as well as the histogram

H(l) of the line segment lengths l

of all matched

feature vector pairs will be unimodal and the two his-

tograms will not include statistical outliers. On the

other hand, the parameters (α, l) of vector pairs which

are mismatches will show up in these histograms as

outliers. Consequently, mismatched vector pairs can

be identiﬁed by identifying those outliers. For out-

lier identiﬁcation we use the Least Median of Squares

(LMS) method (Rousseeuw and Leroy, 1987) and

compute the robust mean location α

of angles α and

the robust mean location l

of lengths l, as well as the

robust standard deviations σ

and σ

, and determine

the outlier values for the α and l value distributions.

Those matched vector pairs

⟨

, v

⟩ that correspond

to identiﬁed outliers of either α or l are then elim-

inated from the set of matched feature vector pairs.

Values are identiﬁed as outliers if they do not satisfy

the following conditions:

(α

− 2.5 · σ

) < α < (α

+ 2.5 · σ

) (2)

− 2.5 · σ

) < l < (l

+ 2.5 · σ

)

Let the number of matched feature vector pairs

after outlier elimination be M. We consider a target

product represented by its template as recognized in

the scene image, if condition (M > N

) is satisﬁed.

denotes the minimal, absolutely necessary number

of matched feature vector pairs for acknowledging the

target object’s recognition.

Once the target object has been recognized, the

product’s position on the image is computed as the

mean of all positions of those feature vectors on the

image that have been matched correctly.

3.2 Hand Position and Distance

We extract hand region pixels based on skin color

chromaticity and use histograms H(x) and H(y) of the

locations of such pixels to determine the hand region.

The hand position (x

, y

) is determined close to the

ﬁnger tips of the right hand.

The distance D between camera and target product

is estimated from the size u of the region spanned by

the matched feature vectors on the image, and the size

of the same region of the template feature vectors

at distance D

= 1.5 m as

D =

· D

(3)

4 FEEDBACK CONTROL

ALGORITHM IMPLEMENTING

GUIDANCE FOR GRASPING

At the start of the product grasping process, the vi-

sually impaired user stands in front of the product

shelf and subsequently has to be guided toward the

target product through system–generated voice com-

mands. In Phase 1 of this process, she is guided so

that she can approach the target object step–by–step.

In this phase, the user moves her body together with

the monopod–mounted camera, where both hands of

the user rest on the upper part of the monopod. We

envision that Phase 1 takes place in a distance range

between 1.5 m and the interval [0.4, 0.28] m, which is

close enough for the user to extent her hand to the tar-

get product. In Phase 2, the user’s body is stationary,

and her left hand holds the monopod, while her right

hand is guided to the target product through system–

generated voice commands. The user may also have

to rotate the camera by her left hand in both phases.

A detailed description of the control algorithm used

to accomplish this is provided in the next two subsec-

tions.

4.1 Phase 1: Guiding the User toward

the Target Product

At the start of Phase 1, the user has already selected

the target product and is standing in front of the prod-

uct shelf. She holds the camera mounted on the mono-

pod in her hands and awaits voice commands from

the system. The camera is roughly pointed toward

the shelf. This is the situation when the algorithm for

guiding the user toward the target product is started.

But before stating the algorithm, we introduce some

essential issues necessary for its explanation.

During the approach toward the target product,

the camera direction must be kept pointed toward the

product. This means that the projected image of the

product must appear in the center of the image. In or-

der to test whether the product is projected to the im-

age center, we divide the image plane into ﬁve sectors

as is shown in Fig.3. If the mean position of the prod-

uct’s matched feature vectors is located in the rectan-

gle in the image center, the product is judged to be in

the center of the image. If it is in sector 1, the cam-

era needs to be rotated to the right and ⟨Command:

Turn the camera to the right⟩ is issued; if it is in sec-

tor 3, ⟨Command: Turn the camera to the left⟩ is is-

sued; if it is in sector 2, ⟨Command: Turn the camera

upward⟩ is issued; and if it is in sector 4, ⟨Command:

Turn the camera downward⟩ is issued. One of these

VISAPP2014-InternationalConferenceonComputerVisionTheoryandApplications

commands is substituted for ⟨Command: Move the

camera to X⟩ in step (5) of the algorithm stated be-

low.

width

height

central

rectangle

1: move the camera right

2: move the camera up

3: move the camera left

4: move the camera down

Figure 3: Sectors for determining the direction of camera

rotation.

Since rotation of the camera is executed by the left

hand of the visually impaired user, there is always the

possibility that the camera is rotated too far, in which

case the product would no longer be projected onto

the image plane. This happens particularly often with

left/right rotations, and when the camera is close to

the product. In order to prepare for the recovery from

such a situation, we acquire the ﬁrst scene image dur-

ing the initialization step at the start of Phase1, extract

feature vectors and determine the position of the prod-

uct on the image. Then we set two windows, one to

the left of the product, W

, and one to the right of the

product, W

. The size of these windows is the same

as the size of the product label. Next we extract fea-

ture vectors in each of these windows and save them

for later use. These feature vectors are used in step

(8) in the algorithm described below. As these feature

vectors represent information about areas to the left

and right of the product label, they will be useful for

recovery from situations in which the camera rotation

had gone too far.

The Control Algorithm for Phase 1 can be

stated as follows:

1. Initialization: Prepare the template vector set cor-

responding to the target product selected by the

user. Then acquire the ﬁrst image of the scene and

carry out matching between template and image

feature vectors. If the target product can be recog-

nized, extract SIFT feature vectors in the windows

and W

and save them.

2. Acquire a scene image and carry out matching.

3. If the target product could not be recognized, go

to step (8), else proceed to step (4).

4. Compute the product’s position on the image.

5. If the position is not in the image center, issue

⟨Command: Move the camera to X⟩, wait 5 sec-

onds and return to step (2). Else proceed to step

(6).

6. Compute distance D between camera and target

product.

7. Based on distance D, select one of the four cases:

• If (0.28 ≤ D ≤ 0.4)m holds, terminate Phase 1

and start Phase 2.

• If (D > 0.5)m, issue ⟨Command: Move

forward⟩, wait 5 seconds and return to step (2).

• If (0.4 < D ≤ 0.5)m, issue ⟨Command: Move

slightly forward⟩, wait 5 seconds and return to

step (2). I. e., the camera must not be moved too

close to the target object, which would result in

a blurred image.

• If (D < 0.28)m, issue ⟨Command: Move

backward⟩, wait 5 seconds and return to step

(2).

8. Recovery attempt from lost-object-situation:

First, match feature vectors from window W

(see

step (1)) and the vectors from the entire scene

image, and count the number of matched vector

pairs, n

. If (n

> N

) (with N

being some thresh-

old) holds, consider the product to be recognized

and issue ⟨Command: Turn camera to the right⟩,

wait 5 seconds and return to step (2). Else con-

tinue to step (9).

9. Match the feature vectors from window W

and

the vectors from the entire scene image, and count

the number of matched vector pairs, n

. If (n

) holds, consider the product to be recognized

and issue ⟨Command: Turn camera to the left⟩,

wait 5 seconds and return to step (2). Else proceed

to step (10).

10. Issue ⟨Statement: Recognition failed⟩.

For some reason, the product could not be rec-

ognized in spite of the recovery attempt, which

means that Phase 1 should be discontinued now,

and re–started.

4.2 Phase 2: Guiding the User’s Hand

to the Target Product

In Phase 2, the system guides the user’s right hand to

the target product to enable her to grasp the product.

The control algorithm for this phase has to account for

two peculiarities of the grasping process which will be

described before we state the algorithm.

Camera Direction Shift due to Body Twisting

The distance between the user’s hand and the prod-

uct at the end of Phase 1 will be in the range (0.28 ≤

D ≤ 0.4) m. If the user’s right hand is located at the

far end of this range, the user will have to stretch out

her arm, which often is accompanied by a twisting of

GraspingGuidanceforVisuallyImpairedPersonsbasedonComputedVisual-auditoryFeedback

her body. This is due to the fact that the upper part

of human body unintentionally turns slightly to the

left when the user stretches out her right arm. At the

same time, the user strives to keep the spatial rela-

tionship between the camera and her body rigid, and

therefore the camera, too, will make a slight turn to

the left. This will cause the product’s image to shift

on the image plane, and sometimes so much that the

product will completely shift out of the camera’s ﬁeld

of view. This is all the more the case as the camera is

now very close to the product. As a countermeasure

to this problem, we attach a digital compass sensor to

the camera in order to directly measure the camera di-

rection before and after the user’s reaching out to the

product. If the direction shift is too large, the system

issues commands to the user so that she can rotate the

camera back to its former direction.

Impossibility of Product Position Estimation

In Phase 2, the user’s hand must be guided by the sys-

tem such that the hand position and the product posi-

tion become close on the image plane. As the hand

will be in front of the product at this ﬁnal stage, ob-

serving and determining the product position directly

is impossible. As an indirect method for estimating

the product position on the image plane, we introduce

window W

having corner points [(0, 0); (x

, h)],

where (x

, y

) is the hand position and h is the height

of the image. I. e., the window always covers part of

the left side of the image up to the hand position. If

we match product template feature vectors and im-

age feature vectors extracted within window W

and

count the successful matches as n

, we can distinguish

three fundamental cases with respect to n

• There are almost no matches in W

, i.e. (n

4). Here, the hand position must be to the left of

the product area and hence the hand needs to be

moved to the right.

• There are almost as many matches in W

as the

number of feature vectors n contained in the prod-

uct template, i.e. (n

· n). Here, the hand po-

sition must be to the right of the product area and

hence the hand needs to be moved to the left.

• The number of matches in W

is about half of the

number of vectors contained in the template, i.e.

(4 ≤ n

≤

· n). Here, the hand position is some-

where on the product area and hence the hand is

in a position from where it can grasp the product.

The Control Algorithm for Phase 2 can be

stated as follows:

1. Measure the camera’s heading using the compass

sensor and save this value as base heading α

2. Issue ⟨Command: Move your right hand between

camera and product⟩ to the user. Wait for 5 sec-

onds and issue ⟨Command: Move your right

hand toward the product⟩. The user responds by

moving her right hand until she touches a prod-

uct. The reason for using two separate commands

here is to keep the commands simple and easy to

execute.

3. Measure the camera’s heading, α

, again and

compute the rotation angle ∆α = α

− α

4. Based on angle ∆α, select one of the following

four cases:

• If (∆α < −5

◦

), issue ⟨Command: Rotate the

camera CW with your left hand⟩, wait 5 sec-

onds and return to step (3).

• If (−5

◦

≤ ∆α < −2

◦

), issue ⟨Command:

Small camera rotation CW with your left hand⟩,

wait 5 seconds and return to step (3).

• If (∆α > 5

◦

), issue ⟨Command: Rotate the

camera CCW with your left hand⟩, wait 5 sec-

onds and return to step (3).

• If (2

◦

< ∆α ≤ 5

◦

), issue ⟨Command: Small

camera rotation CCW with your left hand⟩,

wait 5 seconds and return to step (3).

• If (|∆α| ≤ 2

◦

), acquire next image, extract the

hand region, determine its area A

and proceed

to step (5).

5. If the hand region area A

is not large enough,

issue ⟨Command: Hand detection failed. Move

your right hand to the left⟩, wait 5 seconds and re-

turn to step (3). Else if the hand region area A

large enough, compute the hand position (x

, y

)

and proceed to step (6).

6. Match the feature vectors within the window and

count the number of matched vectors, n

Based on the value of n

, select one of the follow-

ing three cases:

• If (n

< 4), issue ⟨Command: Move your right

hand to the right⟩, wait 5 seconds and return to

step (3).

• If (n

n), issue ⟨Command: Move your

right hand to the left⟩, wait 5 seconds and re-

turn to step (3).

• If (4 ≤ n

≤

n), determine the upper and lower

boundary y

max

, y

min

of the target product and

proceed to step (7).

7. Based on the (vertical) hand position y

, select

one of the following three cases:

• If (y

< y

min

), issue ⟨Command: Move your

right hand downward⟩, wait 5 seconds and re-

turn to step (3).

VISAPP2014-InternationalConferenceonComputerVisionTheoryandApplications

• If (y

> y

max

), issue ⟨Command: Move your

right hand upward⟩, wait 5 seconds and return

to step (3).

• If (y

min

≤ y

max

), issue ⟨Command: Grasp

the product⟩, and terminate the process.

5 EXPERIMENTAL RESULTS

In order to test the system proposed in this paper, we

constructed a prototype consisting of a workstation

with an Intel CPU (3.07 GHz clock) running under

Linux (Ubuntu 10.10, 64 bits) and with 2 GB main

memory, a CCD camera Chameleon CMLN-13S2C

(Pointgrey) with a lens of 8 mm focal length, a mono-

pod by Manfrotto 334B, and a digital compass mod-

ule using a 2-axis magnetic ﬁeld sensor (HM6352

by Honeywell, module order-made by EYEDEA of

Kobe, Japan). We calibrated the digital compass such

that 0

◦

output coincided approximately with magnetic

north heading in Japan (2012). The digital compass

has a resolution of 0.1

◦

, and its output proves to be a

linear function of the rotation angle when it is rotated

in the geomagnetic ﬁeld. Commands from the sys-

tem to the user are generated through OS commands

issued by the control program in order to play prere-

corded voice ﬁles.

We set up a shelf on which bottles of seven kinds

of soft drinks were placed, as shown in Fig.4. The

templates of these products were generated in ad-

vance. The user involved in the experiment was not

visually impaired, but he was wearing an eye cover

which completely shut out all light to his eyes. He

was placed in front of the shelf at approximately 1.5

m distance at random positions at the start of each ex-

perimental run.

Figure 4: Scene of the experiment.

We carried out 30 experimental runs and recorded

important data at each processing step. The data about

the product templates is shown in Table 1, where No.

is the template number, N

is the number of vectors

extracted within the template, size is the template size

in pixels, and M is the number of vectors that could be

matched to the (ﬁrst) image taken at 1.5 m distance.

Table 1: Template data.

No. N

size [pix.] M

1 85 88× 59 28

2 238 88× 208 49

3 222 80× 230 36

4 217 83× 232 36

5 97 93× 57 10

6 129 86× 105 12

7 104 93× 59 25

Only a fraction of the original number of vec-

tors could be matched for each template, i. e. M <

. However, as M > N

(with N

= 8, the minimal

number of matched vector pairs required for positive

recognition), all templates could be recognized in the

ﬁrst image.

During Phase 1, all target products could be rec-

ognized at each step in all 30 runs. On average 11

steps were carried out during Phase 1. The graph in

Fig.5 shows how the x-coordinate of the product posi-

tions on the image plane evolved with distance D dur-

ing three different experimental runs. The start of the

process is at the right end of each trace. The advances

toward the product as well as the camera rotations can

clearly be observed.

Figure 5: Product positions x vs. distance D during three

Phase 1 runs.

The positions of the seven products on the image

plane at the end of Phase 1 are shown as dots in Fig.6.

All 30 positions are within the central rectangle on the

image, which means 100% success rate for Phase 1.

Out of 30 runs, 26 runs were successfully com-

pleted until the end of Phase 2. I. e., the execution

of Phase 2 was unsuccessful in 4 runs. In Phase 2,

the hand position and the product position on the im-

age must come close enough to make grasping possi-

GraspingGuidanceforVisuallyImpairedPersonsbasedonComputedVisual-auditoryFeedback

Figure 6: The positions of 30 products at the end of Phase

ble. Fig.7 shows the spatial relationship of the 26 suc-

cessful hand positions and the product. The product

boundary is shown by two vertical dotted lines (which

have been normalized). The four failures were caused

by insufﬁcient extraction of the hand region due to

similarity of skin color and product label color (in

chromaticity terms). Thus, the system by and large

was successful, but it includes a weakness in the hand

region extraction method.

Figure 7: The right–hand positions of 26 runs at the end of

Phase 2 relative to product boundaries.

6 CONCLUSIONS

In this paper we proposed a system for guiding a visu-

ally impaired person toward a target product on a store

shelf using visual–auditory feedback. The system is

able to cope with the body twisting phenomenon due

to utilization of a digital compass, and it can guide

the user’s hand to the product even though the prod-

uct position cannot be determined directly at the last

stage. Experiments with our prototype system show

that it is highly reliable in Phase 1, but needs some

improvement to the hand region extraction algorithm

of Phase 2.

REFERENCES

Chinellato, E. and del Pobil, A. P. (2005). Vision and grasp-

ing: Humans vs. robots. In IWINAC 2005, LNCS 3561

(J. Mira and J. R. Alvares (Eds.)), pages 366–375.

Springer-Verlag, Berlin Heidelberg.

H. Bay, A. Ess, T. T. and Gool, L. V. (2008). Speeded up ro-

bust features (surf). Computer Vision and Image Un-

derstanding, 110:346–359.

Lowe, D. G. (2004). Distinctive image features from scale-

invariant keypoints. International Journal of Com-

puter Vision, 60:91–110.

Rousseeuw, P. J. and Leroy, A. M. (1987). Robust regres-

sion and outlier detection, chapt. 3. John Wiley &

Sons, New York.

T. Winlock, E. C. and Belongie, S. (2010). Toward

real–time grocery detection for the visually impaired.

Computer Vision and Pattern Recognition Workshops

(CVPRW), pages 49–56.

VISAPP2014-InternationalConferenceonComputerVisionTheoryandApplications