Image Transformation of Eye Areas for Synthesizing Eye-contacts

in Video Conferencing

Takuya Inoue

, Tomokazu Takahashi

, Takatsugu Hirayama

, Yasutomo Kawanishi

Daisuke Deguchi

, Ichiro Ide

, Hiroshi Murase

, Takayuki Kurozumi

and Kunio Kashino

Graduate School of Information Science, Nagoya University, Furo-cho, Chikusa-ku, Nagoya-shi, Aichi, Japan

Information Strategy Ofﬁce, Nagoya University, Furo-cho, Chikusa-ku, Nagoya-shi, Aichi, Japan

NTT Communication Science Laboratories, NTT Corporation, 3-1, Morinosato-Wakamiya, Atsugi-shi, Kanagawa, Japan

Keywords:

Video Conferencing, Eye Contact, Gaze Classiﬁcation.

Abstract:

Recently, the spread of Web cameras has facilitated video-conferencing. Since a Web camera is usually located

outside the display while the user looks at his/her partner in the display, there is a problem that they cannot

establish eye contact with each other. Various methods have been proposed to solve this problem, but most of

them required speciﬁc sensors. In this paper, we propose a method that transforms the eye areas to synthesize

eye contact using a single camera that is commonly implemented in laptop computers and mobile phones.

Concretely, we implemented a system which transforms the user’s eye areas in an image to his/her eye image

with a straight gaze to the camera only when the user’s gaze falls in a range that the partner would perceive

eye contact.

1 INTRODUCTION

Recently, the spread of Web cameras has facilitated

video-conferencing. Many users usually feel it un-

natural while communicating when they cannot es-

tablish eye contact with each other. This is because

the camera cannot be positioned at the same location

as the eyes of the partner. Since the importance of eye

contact in video conferencing is suggested (Muhlbach

et al., 1985), it is better to be somehowsynthesized for

enabling natural communication.

There are software/hardware solutions to achieve

eye contact in video-conferencing. As a hardware so-

lution, Kollarits et al. have proposed a method which

uses a half-mirror screen (Kollarits et al., 1995).

However, this hardware is quite large and it takes time

for installation. As software solutions, there are two

approaches which use either multiple-cameras or a

single camera.

Yang and Zhang applied View Morphing to

synthesize the face images captured by two cam-

eras (Yang and Zhang, 2004). It requires robust

and accurate feature extraction for various appearance

changes to densely associate facial feature points of

the images captured by the two cameras. Kuster et

al. also proposed a method that makes use of an RGB

camera and a depth camera (Kuster et al., 2012). It

synthesizes an image which establishes eye contact by

performing an appropriate 3D transformation of the

head geometry. Since this method synthesizes the im-

age in accordance with the position of the chin, which

is actually difﬁcult to locate accurately, the size of the

forehead often becomes inappropriate.

On the other hand, methods using a single camera

have been proposed. Giger et al. proposed a method

which utilizes a 3D facial model (Giger et al., 2014).

It also requires a depth camera for generating a 3D fa-

cial model. Yip proposed a method that utilizes afﬁne

transformation and an eye model to rectify the face

and the eyes to establish eye contact (Yip, 2005). It

utilizes only one camera, but it requires that the user

put the camera in front of the display in the setup

phase. These methods require the user to use an addi-

tional camera or move the camera to a speciﬁc posi-

tion for the video conferencing. Therefore, it is difﬁ-

cult to be used with laptops or mobile phones. In con-

trast, Solina and Ravnik proposed a method that ro-

tates an image around the horizontal axis to establish

eye contact with only one camera (Solina and Ravnik,

2011). Since it rotates the whole image without con-

Inoue, T., Takahashi, T., Hirayama, T., Kawanishi, Y., Deguchi, D., Ide, I., Murase, H., Kurozumi, T. and Kashino, K.

Image Transformation of Eye Areas for Synthesizing Eye-contacts in Video Conferencing.

DOI: 10.5220/0005668702730279

In Proceedings of the 11th Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2016) - Volume 3: VISAPP, pages 275-281

ISBN: 978-989-758-175-5

275

(a) Original (b) Transformed

Figure 1: Example of an image pair before/after image

transformation by the proposed method.

User Partner

Transformed image

Figure 2: Setting of the proposed system.

sidering the 3D structure, the face image becomes dis-

torted and it cannot physically generate a natural face

with appropriate angle that the partner perceives eye

contact.

In recent years, video conferencing with laptop

computers or mobile phones has become common.

However, these devices are usually equipped with a

single frontal RGB camera, and it is difﬁcult to make

use of additional sensors. Therefore, we propose a

system for synthesizing eye contact using only a sin-

gle frontal RGB camera. It is known that we humans

are sensitive to other’s gaze to the periphery of our

eyes and less sensitive to other gaze directions. Based

on this characteristic, eye contact can be achieved by

transforming only the eye areas as shown in Fig. 1.

We named the range that the partner perceives eye

contact the perceptual range of eye contact. The

proposed system transforms the user’s eye areas im-

age to his/her eye image with a straight gaze to the

camera only when the user is looking at a range that

the partner would perceive eye contact.

According to Uono and Hietanen, the range is ap-

proximately four degrees (Uono and Hietanen, 2015).

According to Anstis et al., whenever the gaze direc-

tion is equal to or more than ten degrees outward

from the eyes, we perceive that his/her gaze angle is

larger than the actual angle, assuming that the part-

ner’s head does not rotate (Anstis et al., 1969). Mean-

while, it is known that whenever it is within four de-

grees, the angle is perceived smaller than the actual

angle. Therefore, the proposed video-conferencing

system performs eye areas transformation only when

the user is looking at the perceptual range of eye con-

tact. Otherwise, it outputs the original image.

As shown in Fig. 2, by sending an image with the

transformed eye areas to both sides of a video confer-

ence, eye contact is realized.

Our contributions to realize the system are as fol-

lows:

1. Gaze classiﬁcation: Technique for detecting

whether or not an user is looking at the percep-

tual range of eye contact from a face image.

2. Image transformation: Technique for generating a

face image to establish eye contact by transform-

ing the eye areas.

The rest of the paper describes our solution to each

of these in detail in Section 2, reports evaluation re-

sults in Section 3, and concludes the paper in Sec-

tion 4.

2 IMAGE TRANSFORMATION

OF EYE AREAS

Fig. 3 shows the process ﬂow of the proposed system.

First, the system extracts the feature points in the orig-

inal image. Next, the system detects whether or not

the user is looking at the perceptual range of eye con-

tact from the original image. If the user is considered

to be looking within the range, the system replaces the

user’s eye areas with his/her eye areas of the reference

image by image transformation. The system needs to

capture the reference image of the user when the user

is directly looking at the camera beforehand. The sys-

tem also needs to capture some training images of the

user for the gaze classiﬁcation.

2.1 Feature Points Extraction

The system extracts six feature points from each con-

tour of left and right eye areas by a state-of-the-art

face tracker (Saragih et al., 2011) as shown in Fig. 4.

Since the face tracker is not so accurate, the extracted

feature points between adjacent frames are located

at slightly different positions. This jitter affects the

image transformation of eye areas. Therefore, the

transformed image sequence will suffer from unnat-

ural motion around the eye areas. To avoid this prob-

lem, the sum of squares distance d of feature points

between adjacent frames is deﬁned as

d = Σ

i=1

(t−1)

− x

(t)

, (1)

VISAPP 2016 - International Conference on Computer Vision Theory and Applications

276

Original image

Image transformation

Output image

Reference image

Gaze classification

Feature points extraction

Temporal smoothing

Outside the perceptual

range of eye contact

Within the perceptual

range of eye contact

Figure 3: Process ﬂow of the proposed method.

Figure 4: Extracted feature points and triangular patch seg-

ments.

Figure 5: Example of an eye areas image.

where x

(t)

denotes the position of feature point i in

the t-th frame. If the distance is less than a threshold,

we suppose that the feature points have not moved

from the previous frame, so the positions of the fea-

ture points in the previous frame are used instead of

the detected ones in the current frame.

2.2 Gaze Classiﬁcation

The gaze classiﬁcation is a process of determining

whether the user is gazing at the perceptual range of

eye contact or not. If the gaze falls within the range,

the proposed system outputs the transformed image to

synthesize eye contact, otherwise outputs the original

image.

Our proposed gaze classiﬁcation consists of two

phases; a training phase which builds a classiﬁer us-

ing training images and a classiﬁcation phase which

determines whether the user is looking at the percep-

tual range of eye contact or not. The training images

must be collected beforehand for each user.

2.2.1 Training Phase

In the training phase, the system segments the eye ar-

eas in each of the training images, extracts the image

feature and constructs a classiﬁer.

• Eye Areas Image Segmentation

Firstly, images when the user is either looking at

the perceptual range of eye contact (positives) or

not (negatives) are collected. Considering actual

use, it should take at most one minute for this task.

Then rectangles bounding the six feature points

are segmented and combined into an eye areas im-

age as shown in Fig. 5. Data augmentation is per-

formed by applying translation and aspect ratio

normalization to the segmented images. Finally,

all images in the training dataset are normalized

to their average size.

• Feature Extraction and Classifier Training

Since most visual characteristics appear along the

contour of the iris, we extract an edge based fea-

ture; Histograms of Oriented Gradients (HOG)

Image Transformation of Eye Areas for Synthesizing Eye-contacts in Video Conferencing

277

(Dalal and Triggs, 2005) from the eye areas im-

age. To make the feature robust to illumination

variation, histogram equalization is performed be-

fore the feature extraction. As a classiﬁer, we

make use of a Support Vector Machine (SVM)

classiﬁer (Cortes and Vapnik, 1995) which pro-

vides high performance for binary classiﬁcation

tasks.

2.2.2 Classiﬁcation Phase

Similar to the training step, the system extracts the

features and determines whether the user is looking at

the perceptual range of eye contact or not using the

classiﬁer. When the eyes are not fully open while

blinking, we assume that the user is not looking at

a speciﬁc location. Thus, before the classiﬁcation, to

reject such situations, eye openness is calculated ac-

cording to the distance between the feature point at

the top and the bottom of the eye areas.

2.3 Temporal Smoothing

The system performs the gaze classiﬁcation in each

frame independently. Since the classiﬁcation results

have a possibility to be unstable over time, it is better

to apply temporal smoothing to the binary sequence

over time. This is realized by the majority vote of ﬁve

sequential frames before and after the current frame.

2.4 Image Transformation

The system synthesizes the eye areas of the refer-

ence image to the input image using triangulation and

afﬁne transformation.

The triangulation refers to the six feature points

shown in Fig. 4. First, the eye areas in the reference

image and the input image are segmented as shown

in Fig. 4. Next, the triangles in the reference image

are deformed by afﬁne transformation. Finally the de-

formed triangle patches are synthesized into their cor-

responding patches in the input image. Alpha blend-

ing is applied to make the synthesized image look

more natural.

3 EXPERIMENTS

In section 3.1, we evaluate the quality of eye contact

while watching the various video transformed by the

proposed method and a comparative method through

a subjective experiment. In section 3.2, we investi-

gate the timing when the subjects perceive eye con-

Table 1: Quality of eye contact.

Original

Proposed

method

Comparative

method

Mean

1.60 3.65 1.70

Variance

1.17 1.00 1.09

tact while carefully observing four different image se-

quences frame-by-frame.

3.1 Qualitative Evaluation

We evaluated whether the subjects perceived eye con-

tact with the subjects in the video transformed by the

proposed method. We captured ﬁve subjects look-

ing at the perceptual range of eye contact or else-

where. The proposed method and the comparative

method (Solina and Ravnik, 2011) were then applied

to the videos. With regard to the comparativemethod,

the angle of rotation around the horizontal axis was

set to 20 degrees. Twelve subjects evaluated three dif-

ferent videos, which were the original video and the

videos transformed by applying the proposed method

and the comparative method. Then they graded the

quality of eye contact with the subject in the video on

a scale of 1 for “no eye contact” to 5 for “eye contact”.

Table 1 shows the quality of eye contact graded by

the evaluators. The proposed method showed higher

score than the original and the comparative method.

3.2 Timing of Eye Contact

We investigated the timing when the subjects per-

ceived eye contact while watching four different im-

age sequences frame-by-frame. We installed two syn-

chronized cameras; one camera on top of the display,

and the other at the center of the display. The lat-

ter intended to capture face images in the ideal situ-

ation where we have real-world face-to-face commu-

nication. We captured a subject looking at the center

camera or elsewhere. The proposed method and the

comparative methods were then applied to the image

sequence captured by the camera installed on top of

the display. Twelve subjects evaluated a set of four

kinds of image sequences (Fig 6); the original images

captured from the common camera position, i.e., on

top of the display, the images transformed by apply-

ing the proposedmethod and the comparativemethod,

and the original images captured from the ideal cam-

era position, i.e., the center of the display. They were

then asked to evaluate whether or not they perceived

eye contact with the subject in the image sequences

frame-by-frame.

Fig. 7 shows the percentage of the subjects who

perceived eye contact. Regarding the percentage of

VISAPP 2016 - International Conference on Computer Vision Theory and Applications

278

(a) Original image (Top)

(Proposed method)

(d) Original image (Front)

(Comparative method)[Normal camera setting] [Ideal camera setting]

(b) Transformed image

Figure 6: Examples of four kinds of images prepared for the experiment.

100

1 21 41 61 81 101 121 141 161 181

Original image (Top)

Transformed image

(Proposed method)

Original image (Front)

Transformed image

(Comparative method)

Percentage of eye contact

Frame

Figure 7: Percentage that the subjects perceived eye contact for each frame.

the images captured from front (the ideal camera set-

ting) as ground-truth of the percentage of eye con-

tact, the average difference from the ground-truth was

43.9% for the original images captured from the top

(normal camera setting), 14.6% for the images trans-

formed by applying the proposed method, 39.3% for

the images transformed by applying the comparative

method.

4 DISCUSSIONS

4.1 Qualitative Evaluation

As it can be seen in Table 1, the comparative method

could not improve the quality. This was because it

rotated the whole image without considering the 3D

structure. In contrast, the proposed method could im-

prove the quality because it transformed only the eye

areas in the original image.

Image Transformation of Eye Areas for Synthesizing Eye-contacts in Video Conferencing

279

4.2 Timing of Eye Contact

In Fig. 7, the subjects hardly perceived eye contact

while watching the original images captured from the

normal camera setting even when they did so while

watching the original images captured from the ideal

camera setting. Also, for the images transformed by

applying the comparative method, the subjects per-

ceived eye contact more than for the original images

captured from the normal camera setting, but still,

most subjects did not perceive eye contact. In con-

trast, for the images transformed by applying the pro-

posed method, the subjects perceived eye contact at

approximately the same timings as for the images

captured from the ideal camera setting. Thus, we con-

ﬁrmed that an user could establish eye contact with

the partner in a video conference that makes use of

the proposed method as natural as in real-world face-

to-face communications.

The proposed method failed eye contact around

the 5th frame and the 50th frame. The gaze classi-

ﬁcation in the proposed method judged that the user

was not looking at the perceptual range of eye contact

although the user actually looked at the range. There-

fore, the system did not transform the eye areas. To

achieve more natural communication, it is necessary

to develop a method to improve the accuracy of gaze

classiﬁcation.

4.3 Gaze Classiﬁcation Performance

We evaluate the accuracy of gaze classiﬁcation. We

conducted an experiment to compare the method us-

ing HOG with a baseline method using intensity as a

feature.

We used the same system settings as the qualita-

tive evaluation. We set a camera on the top of a 24-

inch display with a resolution of 1,920 × 1,200 pixels.

The resolution of the camera was 1,280 × 980 pixels

and the frame rate was 20 fps. Training images of ﬁve

subjects were captured by the camera at a distance of

50 cm from the display. We showed the subjects 486

white points shown in Fig. 8 one-by-one and asked

them to look at each of the points and captured a face

image at each point. Fig. 9 shows examples of im-

ages in the dataset. We deﬁned the perceptual range

of the eye contact as the area surrounded by the red

rectangle indicated in Fig. 8 according to the ﬁnding

by Uono et al.; approximately four degrees from the

center (Uono and Hietanen, 2015). We trained a clas-

siﬁer that determines whether the gaze of the subject

fell in the rectangle or not.

HOG is represented as a feature vector which con-

sists of the gradient histograms and intensity is de-

Camera

360

250

300

200

650

120

100

(pixels)

Figure 8: Target points set for collecting the dataset.

Figure 9: Examples of images in the dataset.

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

1.0

Proposed method

(HOG)

Baseline method

(Intensity)

Precision

Recall

Figure 10: Precision-recall curve of gaze classiﬁcation.

scribed as a feature vector which consists of raw pixel

values. For the evaluation, we performed ten-fold

cross-validation for each subject.

Fig. 10 shows the precision-recall curve of the

gaze classiﬁcation. The proposed method achieved

higher accuracy than the baseline method. The base-

VISAPP 2016 - International Conference on Computer Vision Theory and Applications

280

line method failed when a slight misalignment of

the segmentation of the eye area occurred. In con-

trast, the proposed method succeeded even when the

slight misalignment occurred because the HOG fea-

ture could be extracted robustly even in case of slight

translation or rotation.

5 CONCLUSIONS

Since a Web camera is usually located outside the dis-

play while the user looks at his/her partner in the dis-

play, there is a problem that they cannot establish eye

contact with each other.

In this paper, we proposed a system for synthe-

sizing eye contact using a single camera. The pro-

posed system transformed eye areas of an user only

when the user’s gaze falls in the range that the partner

should perceive eye contact.

The training phase may impose the users a trou-

blesome task. To solve this issue, we can apply an

online gaze calibration method using click events in

daily use of a computer mouse like in (Sugano et al.,

2015) to capture the training images.

Our system runs at 5 fps for an input video with

a resolution of 1,280 × 960 pixels on a standard

consumer computer equipped with an Intel Core i7

3.59GHz CPU, and 8GB RAM. However, the system

can be faster by shrinking the input video size or par-

allelizing the process.

Our current system is not adapted for users wear-

ing glasses. Future work includes improving the gaze

classiﬁcation by introductionof other features and im-

plementing the proposed method on an actual video

conferencing system.

ACKNOWLEDGEMENTS

Parts of this research were supported by MEXT,

Grant-in-Aid for Scientiﬁc Research.

REFERENCES

Anstis, S., Mayhew, J., and Morley, T. (1969). The percep-

tion of where a face or television ‘portrait’ is looking.

American J. of Psychology, 82(4):474–489.

Cortes, C. and Vapnik, V. N. (1995). Support vector net-

works. Machine Learning, 20(3):273–297.

Dalal, N. and Triggs, W. (2005). Histograms of oriented

gradients for human detection. In Proc. of the 2005

IEEE Computer Society Conf. on Computer Vision

and Pattern Recognition, volume 1, pages 886–893.

Giger, D., Bazin, J., Kuster, C., Popa, T., and Gross, M.

(2014). Gaze correction with a single webcam. In

Proc. of the 2014 IEEE Int. Conf. on Multimedia and

Expo, pages 68–72.

Kollarits, R., Woodworth, C., and Ribera, J. (1995). An eye-

contact cameras/display system for videophone appli-

cations using a conventional direct-view LCD. In Di-

gest of 1995 SID Int. Symposium, pages 765–768.

Kuster, C., Popa, T., Bazin, J., Gotsman, C., and Gross, M.

(2012). Gaze correction for home video conferencing.

ACM Trans. on Graphics, 31(6):174:1–174:6.

Muhlbach, L., Kellner, B., Prussog, A., and Romahn, G.

(1985). The importance of eye contact in videotele-

phone service. In Proc. of the 11th Int. Symposium on

Human Factors in Telecommunications, number O-4,

pages 1–8.

Saragih, J., Lucey, S., and Cohn, J. (2011). Deformable

model ﬁtting by regularized landmark mean-shift. Int.

J. of Computer Vision, 91(3):200–215.

Solina, F. and Ravnik, R. (2011). Fixing missing eye-

contact in video conferencing systems. In Proc. of the

33rd Int. Conf. on Information Technology Interfaces,

pages 233–236.

Sugano, Y., Matsushita, Y., Sato, Y., and Koike, H. (2015).

Appearance-based gaze estimation with online cal-

ibration from mouse operations. IEEE Trans. on

Human-Machine Systems, 45(6):750–760.

Uono, S. and Hietanen, J. (2015). Eye contact perception in

the West and East: A cross-cultural study. Plos one,

10(2):e0118094.

Yang, R. and Zhang, Z. (2004). Eye gaze correction

with stereovision for video-teleconferencing. IEEE

Trans. on Pattern Analysis and Machine Intelligence,

26(6):956–960.

Yip, B. (2005). Face and eye rectiﬁcation in video confer-

ence using afﬁne transform. In Proc. of the 2005 IEEE

Int. Conf. on Image Processing, volume 3, pages 513–

516.

Image Transformation of Eye Areas for Synthesizing Eye-contacts in Video Conferencing

281