SOLUTIONS FOR SPEEDING-UP ON-LINE DYNAMIC

SIGNATURE AUTHENTICATION

Valentin Andrei, Sorin Mircea Rusu and Ştefan Diaconescu

Research & Development Department, SOFTWIN S.R.L., Sos. Pipera-Tunari, Bucharest, Romania

Keywords: On-line Authentication, Biometrics, Dynamic Signature, FPGA Processing, GPU Computing, Levenshtein

Algorithm, Distributed Computing.

Abstract: The article presents a study and its experimental results, over methods of speeding-up authentication of the

dynamic handwritten signature, in an on-line authentication system. We describe 3 solutions, which use

parallel computing by choosing a 16 processor server, a FPGA development board and a graphics card,

designed with nVidia CUDA technology. For each solution, we detail how can we integrate it into an

authentication provider system, and we specify its advantages and disadvantages.

1 INTRODUCTION

The evolution of biometry in the last few years was

predictable due to the increasing amount of sensible

data, like bank accounts information or new

technologies documentation, which needed to be

stored into databases, safe from any attack attempt.

The Internet held more and more valuable

information so the need for high security kept

increasing.

Biometry appeared relatively recently and used

unique characteristics of a person in order to secure

and authenticate his actions. Some of these

characteristics are: the iris, the fingerprint, the

signature, the voice, the face anatomy, etc. One of

the most non-intrusive methods of biometric

authentication is based on using the dynamic

handwritten signature. The term does not define only

the image drawn on a piece of paper, but it refers

mainly to the movement of the owner-s hand. The

image can be copied but the movement of the hand

is almost impossible to be reproduced. This

characteristic belongs to behavioural biometrics

because a person changes his way of signing over

the years. Due to this particularity, the offered

security level is very high.

In the last 2 years, we have focused our research

towards using dynamic signature in the purpose of

on-line authentication (Marcu, 2009). We have built

a web-service, capable of securing internet

applications, which need authentication. It can be

used in every web-application that provides

authentication services, as an extra-security layer.

The problem that appears is providing a short

response time when the server is being overloaded

with numerous requests. In this article we will

describe 3 solutions that can help us solve the

problem and to offer a short authentication time for a

reasonable number of clients, accessing the service

simultaneously.

2 AUTHENTICATION

In order to verify if a signature is genuine, some

processing needs to be done. First of all the user

needs to input 5 signatures, that will be considered

specimens and every new signature will be

compared to them.

An electronic device will be used to capture the

signature. This electronic device is an intelligent

pen, that contains 2 MEMS accelerometers and an

optical navigation system (ONS) having the ability

to extract the user-s hand movement and to transmit

it via USB port. The pen transmits at 1000 Hz

sampling rate, the hand acceleration values on 2 axes

and the data extracted by the ONS. Inside the PC,

the raw signals can be stored into CSV files for

analysis.

These files contain a fair amount of redundant

data therefore, in order to make the process more

efficient we need to apply a compression method.

121

Andrei V., Rusu S. and Diaconescu ¸S. (2010).

SOLUTIONS FOR SPEEDING-UP ON-LINE DYNAMIC SIGNATURE AUTHENTICATION.

In Proceedings of the 12th International Conference on Enterprise Information Systems - Databases and Information Systems Integration, pages

121-126

DOI: 10.5220/0002870901210126

 SciTePress

For this reason we have developed a set of signature

recognition algorithms (SRA), which build a set of

invariants, from the raw signals. The following

picture represents the signature processing. The

electronic pen captures the hand movement,

transmitting the raw signals through the driver, to

the SRA block.

Figure 1: Signature processing steps.

In order to say if a signature is authentic, we

must compute a distance between the invariants

extracted from the given signature and the ones

extracted from the user-s specimens. The distance is

computed by using the Levenshtein algorithm.

After that, the resulted distance, is passed to a

classifier that will give the final answer over the

originality of the signature (Marcu, 2009). This

classifier can be a threshold based decision system, a

neural network, etc. The next picture shows the

dynamic signature authentication process.

Figure 2: Signature authentication process.

As we mentioned before, we have built a web-

service, capable of providing biometric

authentication based on the dynamic handwritten

signature. The acquired signature is being sent to the

web-service, where comparison with the stored

specimens is made. The service just sends the

response, telling if the given signature is original or

false. If the signature is genuine, the user receives

access to his account. Given this context, a

proportion of the authentication time belongs to the

transfer operation, between the client and the web-

service. However we are interested in accelerating

the most time consuming operation, of the whole

system.

The following diagram shows the main

operations being made, in order to offer the client

access to his account data based on his dynamic

signature.

Figure 3: On-line signature authentication operations.

We have measured the time consumed by each

operation, in order to find the component whose

function needs to be optimized. The following table

shows percentages of the authentication time,

consumed by every system block.

Table 1: Time consumed by each operation of the

authentication process.

Operation Time (% of total)

Signature acquisition < 1%

Transfer time < 2%

Invariants computation < 5%

Distances computation > 92%

As we can see from the previous table, distance

computation’s time has the greatest proportion.

Therefore we need to find a solution in order to

compute the distances as fast as we can, to provide

the client a reasonable authentication time.

The Levenshtein algorithm is based on the

following formula:

D[i, j] = Minimum (D[i-1, j] + Deletion

Cost, D[i, j+1] + Insertion Cost, D[i-1, j-1]

+ Substitution Cost)

(1)

It involves the usage of a matrix of N x M size

where N and M are the lengths of the strings being

compared. The distance between the 2 strings is at

D[N, M] cell inside the matrix.

The parallelizing possibilities of a single

algorithm instance are poor. However parallelizing

is possible by using systolic arrays (Hoang, 1993)

but at increased difficulty cost. If the length of the

strings is large, then the parallelization of the

algorithm is worth being implemented. If it-s small,

having more distances computed in parallel by se-

ICEIS 2010 - 12th International Conference on Enterprise Information Systems

122

veral processing units, has more advantages.

We have chosen to distribute distance computing

tasks into several processing units. We have studied

the possibility of using the following:

 16 processor powerful server;

 FPGA development board;

 Video card using nVidia CUDA technology.

3 ACCELERATING

As said before, we need to start several instances of

the Levenshtein algorithm in multiple processing

units. We focused our work just to study how we can

speed-up the distance computing block, so this is the

reason we have built a block that generates strings of

symbols, to be passed as input data to the processing

units. We have measured the time it took for the

whole block of data to be processed and we

compared the 3 solutions. The following figure gives

an overview upon the evaluating procedure.

Figure 4: Time evaluation procedure.

The strings generator builds a queue of string

pairs. Each pair will serve as input for an instance of

Levenshtein algorithm. This queue will be a symbol

matrix of (2 x N) x M cells where N represents the

number of distances that need to be computed, and

M represents the length of the strings being

compared. In the real system we provide a

mechanism of queuing for the authentication

requests. The request queue is very similar to the

matrix generated by the strings generator so the

estimations we will make based on the results of the

3 solutions give, will be close to truth.

3.1 Speeding-up the Authentication

Process using a 16 Processor Server

The goal is to use the server hardware capabilities at

full power, in order to achieve the best time. We

have done this by starting a number of threads in our

main program and by assigning them a high priority

among the tasks of the operating system.

First of all we needed to discover the optimum

thread number in order to minimize the overhead.

For that we have computed 2048 distances between

strings of 750 symbols on 32 bits each. We have

distributed these tasks to a variable number of

threads and measured the computing time. If the

number of threads is too small, then the computing

power of the server is not used at maximum, which

will result in a time increase. If the number of

threads is too high, then the generated overhead will

lead to a significant time increase. The average

computing time for a variable number of threads is

centralized in the following table.

Table 2: Choosing the optimum thread number.

Number of threads Duration (milliseconds)

8 2281

14 1328

16 1412

32 1515

This behaviour was predictable because the

server has 16 processors and the optimum thread

number should be near 16. The resulted number of

threads that would give best results is 14. In this case

14 of the system-s processors will be used at full

power and the rest of 2 processors will be used for

the vital tasks of the operating system.

In order to obtain a short response time, the

processor and memory frequency, should be as high

as possible. If the memory frequency is too low, then

the reading and writing from and into it, will require

a high amount of time, seriously slowing down the

process. However, caching mechanism should be

also available on each processor but all the

generated data can’t fit into the cache memory so

that is why the RAM frequency is an important

element. The on-line authentication system, using

the first solution, is drawn in the following image:

Figure 5: Authentication system using a 16 processor

server.

SOLUTIONS FOR SPEEDING-UP ON-LINE DYNAMIC SIGNATURE AUTHENTICATION

123

The system will host a web-service that receives

all the client-s requests, and all the distance

computations are done inside the server. The web-

service can be hosted on a different station. The

disadvantage of such a system is that a powerful

server is very expensive and it-s purpose is for

general usage, rather than to be used exclusively for

authentication purposes.

Therefore we need a cheaper solution than this

one, and designed specifically to accelerate the

authentication process.

3.2 Accelerating the Authentication

Process using a FPGA Development

Board

FPGA development boards have became more and

more used lately, especially where computation

power is needed, in systems with an increased

computational complexity. Devices composed of

several FPGA boards are used for example in

biotechnology, performing sequence alignments

(Hoang, 1993), proteins matching, docking,

networking devices (Mohd, 2008) etc.

The FPGA acronynm, stands for Field

Programmable Gate Array, which is a chip, which

contains a matrix of elementary electronic circuits,

which can be combined to build a complex function.

A digital circuit systems designer can implement

inside a FPGA board, a processing unit that can

perform one specific algorithm. He designs the

system’s blocks and describes them in a hardware

description language. After that a synthesizer like

XilinX for example, is used to implement the new

system, inside the FPGA chip.

One key element for a system working on a

FPGA is the achieved frequency. If it-s high, then

the system will be efficient. To increase the

frequency, the delay from a combinational circuit-s

input to it-s output should be as small as possible. A

combinational circuit is one that does not work by

using a clock input and it performs it-s function

asynchronously.

One feature the FPGA board should have is the

presence of block-RAM or if possible external RAM

blocks. The block-RAM are the RAM cells located

inside the FPGA chip, and it-s access frequency is

the same with the system-s working frequency. To

use external memory chips, frequency adapters are

needed and not always these blocks can be used at

full speed.

The number of elementary circuits inside the

FPGA is also crucial. If the chip has a large number

of gates, which we can implement several

processing units inside a single chip, that perform in

parallel. The following picture describes the general

arhitecture of a system composed of several

Levenshtein processing units, implemented on a

FPGA chip.

Figure 6: General architecture of a system performing

multiple Levenshtein comparisons on a FPGA board.

The system can compute multiple Levenshtein

distances in parallel, on a single chip. If we need

more speed, we can build a custom hardware device

composed of several FPGA boards that work

independently, controlled by a processor. However

this solution raises high implementing problems so

it-s preferable to use just FPGA-s connected to a

single computer that will distribute processing tasks

to them.

The following table presents the time needed for

a number of comparisons, depending on the system-

s frequency. We assumed that our FPGA board will

host 10 processing units. We calculated the time

needed to compute 2050 distances between strings

of 750 symbols, 32 bits each. We intend to make a

comparison between the 3 proposed solutions so the

number of comparisons should be the same for all

systems. Of course we can approximate 2050 with

2048.

Table 3: Computing duration using a FPGA board.

System-s

Frequency

Used BUS Duration (ms)

50 MHz USB 2.0 22700

100 MHz USB 2.0 11448

150 MHz USB 2.0 7700

200 MHz USB 2.0 5800

300 MHz USB 2.0 4500

300 MHz PCI Express 4200

ICEIS 2010 - 12th International Conference on Enterprise Information Systems

124

A frequency of 200 MHz is achievable even by

using a low cost FPGA, such as a Spartan 3. More

performant boards such as Virtex 5 can achieve even

a higher frequency. However to reach this goal,

many optimizations need to be done, like adding

pipeline stages in combinational circuits that reduce

frequency.

We can also see that the used BUS does not

influence considerably the response time. An on-line

authentication system using FPGA boards to

perform the distances is described in the following

figure. Besides the Levenshtein processors, a BUS

controller must be implemented inside the FPGA, in

order to synchronize the system with the computer’s

BUS and to realize the data transfer.

Figure 7: Authentication system using multiple FPGA

boards.

3.3 Accelerating the Authentication

Process using CUDA Enabled

Video Cards

A solution proposed in the last few years for high

computing processes is using graphics cards. nVidia

producer launched the CUDA architecture and also a

software development kit, which allows

programmers to use the graphic card-s capabilities at

full power (nVidia 2008). For example, a developer

familiarized with the C language, can easily write C

code for CUDA, respecting a number of conventions

and that code will be computed by the graphics card.

CUDA is very suitable for algorithms that can be

parallelized, matrix multiplications, etc. In the

proposed solution we will use the graphics card by

launching multiple Levenshtein algorithms instances

that compute in parallel.

The CUDA architecture is described in the

following image (nVidia 2008). Each graphics card

chip is composed of a number of multiprocessors

each containing a number of 8 streaming processors.

Fast shared on-chip memory is available to use and

also the board offers a high amount of RAM

connected externally, called device memory.

Figure 8: nVidia CUDA architecture.

The device memory is the slowest RAM

available. The streaming processors can read and

write to it, but at great time cost. The shared

memory is the best solution to use when time is

critical. Each streaming processor has a number of

fast registers that can also be used for optimisations.

We used the same evaluating method as for the

dedicated server solution, by generating a matrix of

input strings. We have copied the matrix into the

device memory and then launched threads on each

streaming processor of each multiprocessor. The

operations needed to perform the comparisons are

mentioned below:

 Start Timer

 Copy strings matrix into device memory

 Fill the shared memory of each processor

 Launch Levenshtein algorithm on several

threads

 Copy distances into device memory

 Copy distances from the device memory into

the system’s RAM

 Stop Timer

The results we have obtained by using this

solution are presented in the following table. We

have used 3 CUDA enabled graphics cards of

different computing capabilities. We initiated the

comparison of 2048 pairs of strings, 750 symbols of

32 bits each. This amount of data is generated by a

number of around 40 clients, accessing the system

simultaneously. We can see that the last client will

receive a response to his request in less than 2

seconds. If the number of available boards increases,

the incoming data will be processed proportionally

faster.

SOLUTIONS FOR SPEEDING-UP ON-LINE DYNAMIC SIGNATURE AUTHENTICATION

125

Table 4: Computing duration using nVidia CUDA enabled

video cards.

Video Card Capabilities Duration (ms)

nVidia Quadro

NV 135 M

2 Multiprocessors 23625

nVidia

GeForce 9500

4 Multiprocessors 7266

nVidia

GeForce

GTX275

30 Multiprocessors 1390

An on-line authentication system, using several

graphics cards to process distances between

invariants strings, will have the architecture

presented below.

Figure 9: Authentication system using multiple nVidia

CUDA enabled video cards.

The communication through the PCI Express

BUS is transparent to the developer because it is

realised by the provided CUDA driver (nVidia

2008). The implementing problems are similar to

ones that appear when implementing the server

solution. The main advantage of this solution is that

the video cards are relatively cheap and their

applicability area is rather large. The system is also

scalable, because adding extra graphic cards is

relatively easy, without major code modifications.

4 CONCLUSIONS

We have presented 3 solutions of accelerating on-

line authentication, by using dynamic handwritten

signature. We have presented the signature

processing which is made, and we have shown 3

methods of speeding-up the most computational

blocks. The following table synthesizes the results

we have obtained.

We have considered a system using a 16

processor powerful server, one using 3 USB 2.0

FPGA boards with 10 Levenshtein processors each,

working at 300 MHz and also a system using a

single nVidia GeForce GTX275 video card.

Table 5: Comparison between the proposed acceleration

solutions.

Solution Price Duration (ms)

16 Processor Server ~8000 USD 1328

3 FPGA Boards on a PC ~2000 USD 1400

1 nVidia GeForce GTX275

video card

~300 USD 1390

As we can see from the table above, the solution

that should be used is obvious. Motherboards built

using nVidia SLI technology allow up to 3 video

cards on one single system so the speed achieved

can be highly improved with minimal costs.

Given this context, an on-line system of

authenticating users by their dynamic signature, can

respond to a number of around 100 requests per

second, when using 3 video cards, which makes it a

high security feature needed to be considered. Using

the proposed architectures, the system is very

scalable and if the number of requests increases,

more computing power can be added at a small

price.

REFERENCES

Marcu, E., 2009. Method of combining the degrees of

similarity in handwritten signature authentication,

using neural networks. In AI-2009, The Twenty-ninth

SGAI International Conference Cambridge, UK.

Springer

Marcu, E., 2009. Self-built grid. In IDC’2009, 3rd

International Symposium on Intelligent Distributed

Computing. Springer

Hoang, D. T., Lopresti, D., 1993. FPGA Implementation

of Systolic Sequence Alignment. In International

Workshop on Field Programmable Logic and

Applications. Springer Berlin

Mohd, E. T., Mohd, Y. I. I., Tee., H. H., Madhihah, S.,

2008. Hardware based SPAM/UCE Filter Design with

Levenshtein Distance Algorithm: A Framework. In

Proceedings of Internet Convergence Conference, 11-

13 March 2008, Kuala Lumpur. Non-Scopus Cited

Publication.

Manavski, A. S., Valle. G., 2008. CUDA compatible GPU

cards as efficient hardware accelerators for Smith-

Waterman sequence alignment. In BMC

Bioinformatics 2008. BMC Bioinformatics.

nVidia CUDA Zone Examples of GPU Processing,

http://www.nvidia.co.uk/object/cuda_home_uk.html

nVidia CUDA Programming Guide, 2008. http://

developer.download.nvidia.com/compute/cuda/2_3/to

olkit/docs/NVIDIA_CUDA_Programming_Guide_2.3

.pdf

ICEIS 2010 - 12th International Conference on Enterprise Information Systems

126