Digital Assisted Communication

Paula Escudeiro

1,2

, Nuno Escudeiro

1,3

, Marcelo Norberto

1,2

, Jorge Lopes

1,2

and Fernando Soares

1,2

Departamento de Engenharia informática - Instituto Superior de Engenharia do Porto, Porto, Portugal

GILT – Games, Interaction and Learning Technologies, Porto, Portugal

INESC TEC - Laboratory of Artificial Intelligence and Decision Support, Porto, Portugal

Keywords: Sign Language, Blind, Deaf, Kinect, Sensor Gloves, Translator.

Abstract: The communication with the deaf community can prove to be very challenging without the use of sign

language. There is a considerable difference between sign and written language as they differ in both syntax

and semantics. The work described in this paper addresses the development of a bidirectional translator

between several sign languages and their respective text, as well as the evaluation methods and results of

those tools. A multiplayer game is using the translator is also described on this paper. The translator from sign

language to text employs two devices, namely the Microsoft Kinect and 5DT Sensor Gloves in order to gather

data about the motion and shape of the hands. This translator is being adapted to allow the communication

with the blind as well. The Quantitative Evaluation Framework (QEF) and the ten-fold cross-validation

methods were used to evaluate the project and show promising results. Also, the product goes through a

validation process by sign language experts and deaf users who provide their feedback answering a

questionnaire. The translator exhibits a precision higher than 90% and the projects overall quality rates are

close to 90% based on the QEF.

1 INTRODUCTION

Promoting equal opportunities and social inclusion of

people with disabilities is one of the main concerns of

the modern society and also a key topic in the agenda

of the European Higher Education.

The emergence of new technologies combined

with the commitment and dedication of many

teachers, researchers and the deaf community is

allowing the creation of tools to improve the social

inclusion and simplify the communication between

hearing impaired people and the rest.

Despite all the efforts there are still a lot of

improvements to be done for this matter. For

example, in the public services, it is not unusual for a

deaf citizen to need assistance to communicate with

an employee. In such circumstances it can be quite

complex to establish communication. Another critical

area is education. Deaf children have significant

difficulties in reading due to difficulties in

understanding the meaning of the vocabulary and the

sentences. This fact together with the lack of

communication via sign language in schools severely

compromises the development of linguistic,

emotional and social skills in deaf students.

The VirtualSign project intends to reduce the

linguistic barriers between the deaf community and

those not suffering from hearing disabilities.

The project aims to improve the accessibility in

terms of communication for people with disabilities

in speech, hearing and also the blind. ACE also

encourages and supports the learning of sign

language.

The sign language, like any other living language,

is constantly evolving and becoming effectively a

contact language with listeners, increasingly being

seen as a language of instruction and learning in

different areas, a playful language in times of leisure,

and professional language in several areas of work

(Morgado and

Martins, 2009).

2 LINGUISTIC ASPECTS

The sign language involves a set of components that

make it a rich and hard to decode communication

channel. Although it is not as formal and not as

structured as written text it contains a far more

complex way of expression. When performing sign

language, we must take account of a series of

Escudeiro, P., Escudeiro, N., Norberto, M., Lopes, J. and Soares, F.

Digital Assisted Communication.

DOI: 10.5220/0006377903950402

In Proceedings of the 13th International Conference on Web Information Systems and Technologies (WEBIST 2017), pages 395-402

ISBN: 978-989-758-246-2

395

parameters that define the manual and non-manual

components. The manual component includes:

 Configuration of the hand. In Portuguese sign

language there is a total of 57 identified hand

configurations.

 Orientation of the palm of the hand. Some pairs

of configurations differ only in the palm’s

orientation.

 Location of articulation (gestural space).

 Movement of the hands.

 The non-manual component comprises:

 Body movement. The body movement is

responsible for introducing a temporal context.

 Facial expressions. The facial expressions add

a sense of emotion to the speech.

3 RELATED WORK

In the last two decades a significant number of works

focusing on the development of techniques to

automate the translation of sign languages with

greater incidence for the American Sign Language

(Morrissey and Way, 2005), and the introduction of

serious games in the education of people with speech

and/or hearing disabilities (

Gameiro et al., 2014) have

been published.

Several of the methods proposed to perform

representation and recognition of sign language

gestures, apply some of the main state-of-the-art

techniques, involving segmentation, tracking and

feature extraction as well as the use of specific

hardware as depth sensors and data gloves.

The collected data is classified by applying a

random forests algorithm (Biau, 2012), yielding an

average accuracy rate of 49,5%.

Cooper et al. (

Cooper et al., 2011) use linguistic

concepts in order to identify the constituent features

of the gesture, describing the motion, location and

shape of the hand. These elements are combined

using HMM for gesture recognition. The recognition

rates of the gestures are in the order of 71,4%.

The project CopyCat (

Brashear et al., 2010) is an

interactive adventure and educational game with ASL

recognition. Colorful gloves equipped with

accelerometers are used in order to simplify the

segmentation of the hands and allow the estimation of

motion acceleration, direction and the rotation of the

hands. The data is classified using HMM, yielding an

accuracy of 85%.

ProDeaf is an application that does the translation

of Portuguese text or voice to Brazilian sign language

(ProDeaf, 2016). This project is very similar to one of

the main components used on the VirtualSign game,

which is the text to gesture translation. The objective

of ProDeaf is to make the communication between

mute and deaf people easier by making digital content

accessible in Brazilian sign language. The translation

is done using a 3D avatar that performs the gestures.

ProDeaf already has over 130 000 users.

Showleap is a recent Spanish Sign language

translator (Showleap, 2016), it claims to translate sign

language to voice and voice into sign language. So far

Showleap uses the Leap motion which is a piece of

hardware capable of detecting hands through the use

of two monochromatic IR cameras and three infrared

LEDs and showleap uses also the Myo armband . This

armband is capable of detecting the arm motion,

rotation and some hand gestures through

electromyographic sensors that detect electrical

signals from the muscles of the arm. So far Showleap

has no precise results on the translation and the

creators claim that the product is 90% done

(Showleap, 2015).

Motionsavvy Uni is another sign language

translator that makes use of the leapmotion

(Motionsavvy, 2016). This translator converts

gestures into text and voice and voice into text. Text

and voice are not converted into sign language with

Uni. The translator has been designed to be built into

a tablet. Uni claims to have 2000 signs on launch and

allows users to create their own signs.

Two university students at Washington University

won the Lemelson-MIT Student Prize by creating a

prototype of a glove that can translate sign language

into speech or text (University of Washington, 2016).

The gloves have sensors in both the hands and the

wrist from where the information of the hand

movement and rotation is retrieved. There is no clear

results yet as the project is a recent prototype.

4 VirtualSign TRANSLATOR

VirtualSign aims to contribute to a greater social

inclusion for the deaf through the creation of a bi-

direction translator between sign language and text.

In addition a serious game was also developed in

order to assist in the process of learning sign

language.

The project bundles three interlinked modules:

 Translator of Sign Language to Text:

module responsible for the capture,

interpretation and translation of sign language

gestures to text. A pair of sensors gloves (5DT

Data Gloves) provides input about the

configuration of the hands while the Microsoft

WEBIST 2017 - 13th International Conference on Web Information Systems and Technologies

396

Kinect provides information about the

orientation and movement of the hands. The

Figure 1 shows its interface.

Figure 1: Sign to text translator interface.

 Translator of Text to Sign Language: (Figure

2): This module is responsible for the

translation of text to sign language. The

gestures are performed by an avatar based on a

defined set of parameters that are created using

the VirtualSign Studio (VSS). The VSS

provides an interface to the users to create all

the gestures and that information is stored on

the VirtualSign server and is reused for the

translations.

Figure 2: Text to sign translator interface.

 Serious Game: Module responsible for the

didactic aspects which integrates the two

modules above described into a serious game.

The system architecture as a whole has two

main components. The main component is the

game client that includes the game module and

the VirtualSign translator. Then there is the

Web Server component that hosts all the web

services needed for the game. Those web

services have access to the server database

where the players’ information is kept. The

game clients communicate with each other

using Unity network commands and with the

web server through HTTP requests.

5 SIGN LANGUAGE TO TEXT

The translation process between gesture and text is

done by combining the data received through the

Kinect and the data received from the 5DT gloves.

To simplify, we consider that a word corresponds

to a gesture in sign language.

A gesture comprises a sequence of configurations

from the dominant hand, each associated with

(possibly) a configuration of the support hand, and a

motion and orientation of both hands. Each element

of the sequence is defined as an atom of the gesture.

The beginning of a gesture is marked by the adoption

of a configuration by the dominant hand. In the case

of a configuration change, two scenarios may arise:

the newly identified configuration is an atom of the

sequence of the gesture in progress or the acquired

atom closes the sequence of the gesture in progress

and signals the beginning of a new gesture that will

start with the following atom.

5.1 Hand Configuration

When we speak of sign language, we must mention

that each Country, sometimes each Region, has its

own Sign Language. In Portuguese Sign

Language(PSL), there are a total of 57 hand

configurations, reduced to 42, since 15 pairs differ

only in the orientation, as is the case of the letters M

and W.

The configuration assumed by the hand is

identified through classification – a machine learning

setting by which one (eventually more) category from

a set of pre-defined categories is assigned to a given

object. A classification model is learned from a set of

labelled samples. Then, this model is used to classify

in real time new samples as they are acquired.

5.1.1 Hand Configuration Inputs

In order to obtain the necessary information to

identify the configuration of each hand, 5DT data

gloves (5DT, 2011) are used. Each glove has 14

sensors, placed in specific places of the joints of the

hand and it is possible to obtain data at a rate of 100

samples per second.

To increase robustness in reading data and reduce

the weight of that noise, a set of sensor data is only

maintained (for further classification) if that data is

stable for a pre-defined period of time, after having

detected a significant change.

Digital Assisted Communication

397

5.1.2 Classification

The classification is made from labelled samples, and

then the program classifies the new samples in real

time. After the sample is obtained the data is passed

thought the classification.

To improve the result of this classification

process, the 42 hand configurations were divided into

different groups. The differential factor between each

group was the fingers with greater relevance for this

group of configurations.

As seen in Figure 3, in group 3, the configurations

similar to the hand-opening gesture were grouped. On

the other hand, in Figure 4, we see that in group 2, the

configurations related to the thumb raising.

Figure 3: Group 3 - Open hand related configurations.

Figure 4: Group 2 - Thumb raising hand configurations.

Three individuals, named A, B and C performed

the tests. Each with a dataset with 10 samples for each

existing configuration, with a total of 1260 samples.

This samples were then crossed with datasets, using 2

of the individuals as training and the remainder as

test, using a k-nearest neighbors (KNN) classifier

(Zhang et al., 2006).To reduce the variance of our

estimates we have used 10-fold cross validation.

In the following tables we see the results of the

tests performed.

Table 1: Correct prediction rate in the group classifier test.

A point to take into consideration is the fact that

intermediate (fake) configurations that constitute

only noise may occur during the transition between

two distinct configurations.

As example we can see in Figure 5 the transition

from the configuration corresponding to the letter "S"

to the configuration corresponding to the letter "B",

where we obtain as noise an intermediate

configuration associated that matches the hand

configuration for number "5" in PSL.

Figure 5: Transition from configuration S to configuration

B, through the intermediate configuration (noise) 5.

Intermediate configurations differ from the others

by the time component, i.e., intermediate

configurations have a shorter steady time, which is a

constant feature that may be used to distinguish

between a valid configuration and a noisy,

intermediate configurations. Thus, we use

information about the dwell time of each

configuration as an element of discrimination by

setting a minimum execution (steady) time below

which configurations are considered invalid.

5.2 Hand Motion and Orientation

To obtain information that allows characterizing the

movement and orientation of the hands we use the

Microsoft Kinect.

The skeleton feature allows tracking up to four

people at the same time, with the extraction of

characteristics from 20 skeletal points in a human

body, at a maximum rate of 30 frames per second.

Of the 20 points available only 6 are used, in

particular the points corresponding to the hands,

elbows, hip and head.

The information about the motion is only saved

when a significant movement happens, i.e. when the

difference between the position of the dominant hand

(or both hands), and the last stored position is greater

than a predefined threshold.

Therefore, when a significant movement is

detected we save an 8-dimensional vector

corresponding to the normalized coordinates of each

hand (x

, y

, z

) and the angle that characterizes its

orientation. If the gesture is performed just with the

dominant hand, the coordinates and angle of the

support hand assume the value zero. The coordinates

are normalized by performing a subtraction of the

vector that represents the hand position (x

, y

, z

)

by the vector that defines the central hip position (x

, z







,



,













,



,













,



,





To be capable of getting the orientation made, the

WEBIST 2017 - 13th International Conference on Web Information Systems and Technologies

398

angular coefficient is defined to a straight line of the

intersection of the hand and the elbow.

In summary, for each configuration assumed by

the dominant hand, a set of vectors characterizing the

motion (position and orientation) of the hands are

recorded.

5.3 Evaluation

The main focus in performance is in the capacity of

the model to predict correctly the words being

represented through sign language.

This evaluation counts with 15 words, these 15

words have 2260 samples, but only 750 samples were

used to train the SVM (Steinwart and Christmann,

2008). Each one of the 15 words has 50 examples, the

same number was used for all of them to ensure a

uniform distribution of the classes.

This gestures have been obtained by different

users. The recall is a measure to select all the

instances of a particular class and the precision is the

percentage of correct predictions to a particular class.

The classes (words) used are “Olá”(1), “Adeus”(2),

“Sorrir”(3), “Segredo”(4), “Floresta”(5), “Sol”(6),

“Flor”(7), “Aluno”(8), “Escola”(9), “Casa”(10),

“Aulas”(11), “Desenho”(12), “Amigo”(13),

“Pais”(14) and “Desporto”(15).

The confusion matrix shown in Table 2 clearly

shows the effectiveness of the classifier that achieves

an average precision above 99% for this test scenario.

Table 2: Confusion Matrix.

6 TEXT TO SIGN LANGUAGE

The translation of any text to sign language is a quite

demanding task due to the specific linguistic aspects.

Such as any other language, the sign language has

its own grammatical aspects that must be taken into

consideration. Those details are taken into

consideration in the VSS.

A server based syntax and semantics analyser is

being developed to improve the accuracy of the

translation. Deaf people usually have a hard time

reading and writing their language. The avatar used

for the translations was created with Autodesk Maya

as well as its hand animations.

The Figure 2 shows the avatar body that has

identical properties to a human one in order to get the

best accuracy possible performing the gestures.

Other aspects were taken into consideration, such

as the background that must create a contrast with the

avatar. That contrast is needed so all the gestures and

the movements can be easily understood by the deaf.

6.1 Structure

The text to sign language translator module is divided

in several parts. In order to improve the distribution

and efficiency of all the project components their

interconnections were carefully planned.

The connection to the Kinect and data gloves is

based on sockets. This protocol was also used for the

PowerPoint Add-in, the text from the powerpoint will

be sent to the translator, thus the avatar will translate

all the text to sign language. The Add-in will send

each word on the slide, highlighting it and waiting for

the reply in order to continue so that the user knows

what is being translated at the time.

The database contains all the parameters and the

corresponding text. During the translation process the

application will search the database for the word that

came as input. When the text is found in the database

the parameters containing the data required for the

avatar to translate will be sent to the application. If

there is no match in the database, the avatar will then

translate the word letter by letter.

6.1.1 Architecture

The Figure 6 shows the two main modules, which

carry out the steps needed in the translation process:

The first module (text recognition) converts the

written text into signals, which is represented by an

animated character.

The second module translates the gestures of sign

language into text. In this process we used two

devices: The Kinect for motion recognition and the

gloves for the recognition of static hand

configurations.

Digital Assisted Communication

399

Figure 6: Translator architecture.

6.2 VirtualSign Studio

The VSS is the tool where users can create the

gestures for each word. VSS main interface is

represented on Figure 7.

Figure 7: VirtualSign Studio interface.

In order to use this application the user only needs

to put the user and the password to login. There are

two types of users, editors and validators.

The editor is capable of creating words to be

introduced in the data-base. To create the words the

editor has access to all the linguistic aspects

represented by the avatar. There is also a possibility

to import a word already made.

The validator has the possibility of validating

words, the validator imports words that need to be

validated and can then use the preview system to see

if the word is correct, if the word is correct then the

validator only needs to click in the validate button and

after that moment this word starts to be used to

translate the texts.

The validator can also set a word in a review state

where the editor can fix it.

6.3 Text to Sign Results

Anonymous questionnaires were made to obtain

information about the quality of translation of the

translation between gesture to text and text to gesture.

The text to gesture already has a large database with

arround 600 words.

Figure 8: Text to sign results.

The Figure 8 presents the results obtained about

the precision of the 600 gestures that have been

analyzed by four specialists of PSL. The range of ages

of the specialist that have answered these

questionnaires went from 39 to 64 years old.

7 SERIOUS GAME

The digital games provide a remarkable opportunity

to overcome the lack of educational digital content

available for the hearing impaired community.

Playing a game as the name suggest has a great leisure

aspect that can’t be found in conventional educational

means. Educational game researcher James Gee (Gee,

2003) shows how good game designers manage to get

new players to learn their long, complex, and difficult

games.

Two games were created for the project.

A single player game where the player controls a

character that interacts with various objects and non-

player characters with the aim of collecting several

gestures from the Portuguese Sign Language

(Escudeiro, 2014).

The first game is played in first person view in

which the player controls a character on the map.

Each map represents a level and each level has several

objects that represent signs scattered through the map

for the player to interact with.

The second game created is a multiplayer game.

The game consists of a first person puzzle game and

requires two players to cooperate in order to get

through the game.

7.1 Game Concept

VirtualSign games aim to be an educational and

social integration tool to improve the user’s sign

language skills and allow them to communicate with

WEBIST 2017 - 13th International Conference on Web Information Systems and Technologies

400

others using those skills.

The game is a first person puzzle game, where the

puzzles are based on simple form objects such as

cubes and spheres. The Player will be motivated into

solving the puzzles using the surrounding

environment. The objects can be moved by the user

and there is also intractable objects such as buttons

and switches. However, some of those objects are

only accessible to one of the players, thus, creating

the need to cooperate with each other in order to finish

each level.

After the players complete the level, the time they

took to complete will be registered on an in game

worldwide ranking table with others players times to

incite competition. There is also a personal score that

each player achieves so there is competition between

the two players as well.

The communication between the players is shown

as gestures in real time but there will be also a text

chat where the previous messages can be accessed.

The Figure 9 shows the input application and the

avatar translating after receiving a message.

Figure 9: VirtualSign Cooperation game interface during

gameplay.

7.2 Game Results

Ten testers (8 male and 2 female) aged between 23

and 65 tested the game and answered surveys.

However, due to the need of using the VS translator

only 5 of those testers were able to try that feature and

answer one of the questions. The link with the form

and instructions was given to the testers. The forms

were answered anonymously. The survey of the beta

test phase had 20 questions of which the last one was

an optional written answer about possible

improvements to the game.

The questions graded the game between 1 and 5

with 1 being the worst and 5 being the best. The final

average was 4.5 out of 5 at the end of the testing

phase.

8 VISUAL IMPAIRED

INCLUSION

Besides the social inclusion of the deaf with the

VirtualSign translator, the ACE project also aims to

reach visually impaired people. With the use of voice

as input and output for the translator ACE aims to

provide the means for a deaf person to talk with a

blind one and vice-versa.

So far there is a game under development ACE for

the blind, which gives the player feedback trough

sound. The voice features are also being integrated in

VirtualSign in order to achieve ACE final goal.

9 CONCLUSIONS

The VirtualSign system is prepared to work with

several distinct sign languages making it possible to

have deaf people from different countries

understanding each other.

The machine learning techniques used to process

the inputs from Kinect and the 5dt Gloves are able to

identify the signs being represented with high

accuracy.

VirtualSign can be applied in places with public

attendance to facilitate the communication among

deaf and non-deaf people. It is naturally accepted that

having assistance to understand sign language in

places like fire departments, police stations,

restaurants, museums and airports, among many

others, will be of clear added value in the promotion

of equal opportunities and social inclusion of the deaf

and hearing impaired.

The VirtualSign translator was tested by several

users. The estimated accuracy of the conversion from

gestures to text reaches values of 97%.

For future work on the translator gesture to text a

system with the same precision and accuracy without

using gloves is being planned.

Also an intermediate system of semantics

between sign language and the written text is being

created. The semantic and syntax are very important

because the translation between the languages is not

direct.

There is also the blind voice recognition and

synthetization that is being developed as mentioned

before and there is a game for the blind under

development. We also aim to create a way for both

deaf and blind to create digital arts through an

application using all the progress made so far.

Digital Assisted Communication

401

ACKNOWLEDGEMENTS

This work was supported by FCT - ACE - Assisted

Communication for Education (ref: PTDC/IVC-

COM/5869/2014) and European Union- Erasmus + I-

ACE International Assisted Communication for

Education 2016-1-PT01-KA201-022812.

REFERENCES

Morgado, M., Martins, M., 2009. Língua Gestual

Portuguesa. Comunicar através de gestos que falam é o

nosso desafio, neste 25. º número da Diversidades,

rumo à descoberta de circuns-tâncias propiciadoras

nas quais todos se tornem protagonistas e

interlocutores de um diálogo universal., 7.

Morrissey, S., Way, A., 2005. An example-based approach

to translating sign language.

Gameiro, J., Cardoso, T., Rybarczyk, Y., 2014. Kinect-

Sign, Teaching sign language to “listeners” through a

game. Procedia Technology, 17, 384-391.

Biau, G., 2012. Analysis of a random forests

model. Journal of Machine Learning

Research, 13(Apr), 1063-1095.

Cooper, H., Pugeault, N., & Bowden, R., 2011. Reading the

signs: A video based sign dictionary. In Computer

Vision Workshops (ICCV Workshops), 2011 IEEE

International Conference on (pp. 914-919). IEEE.

Steinwart, I., & Christmann, A., 2008. Support vector

machines. Springer Science & Business Media.

Brashear, H., Zafrulla, Z., Starner, T., Hamilton, H., Presti,

P., & Lee, S., 2010. CopyCat: A Corpus for Verifying

American Sign Language During Game Play by Deaf

Children. In 4th Workshop on the Representation and

Processing of Sign Languages: Corpora and Sign

Language Technologies.

ProDeaf, 2016. Solutions. [Online] Available at:

http://www.prodeaf.net/en-us/Solucoes [Accessed

December 2016].

Showleap, 2015. Showleap blog. [Online] Available at:

http://blog.showleap.com/2015/12/2016-nuestro-ano/

[Accessed December 2016]. 79.

Showleap, 2016. Showleap. [Online] Available at:

http://www.showleap.com/ [December July 2016].

Motionsavvy, 2016. Motionsavvy Uni. [Online] Available

at: http://www.motionsavvy.com/ [Accessed December

2016].

University of Washington, 2016. UW undergraduate team

wins $10,000 Lemelson-MIT Student Prize for gloves

that translate sign language. [Online] Available at:

http://www.washington.edu/news/2016/04/12/uw-

undergraduate-team-wins-10000-lemelson-mit-

student-prize-for-gloves-that-translate-sign-language/

[Accessed December 2016].

Gloves, 5DT., 2011. Fifth dimension technologies. [Online]

Available at: http://www.5dt.com/?page_id=34.

Zhang, H., Berg, A. C., Maire, M., & Malik, J., 2006. SVM-

KNN: Discriminative nearest neighbor classification

for visual category recognition. In Computer Vision

and Pattern Recognition, 2006 IEEE Computer Society

Conference on (Vol. 2, pp. 2126-2136).

IEEE.

Gee, J. P., 2003. What video games have to teach us about

learning and literacy. Computers in Entertainment

(CIE), 1(1), 20-20.

Escudeiro, P., Escudeiro, N., Reis, R., Barbosa, M.,

Bidarra, J., Baltasar, A. B., ... & Norberto, M. (2014).

Virtual sign game learning sign language. In A.

Zaharim, & K. Sopian (Eds.), Computers and

Technology in Modern Education, ser. Proceedings of

the 5th International Conference on Education and

Educational technologies, Malaysia.

WEBIST 2017 - 13th International Conference on Web Information Systems and Technologies

402