RECURRENT NEURAL NETWORKS APPROACH TO THE

DETECTION OF SQL ATTACKS

Jaroslaw Skaruz

Institute of Computer Science, University of Podlasie, Sienkiewicza 51, 08-110 Siedlce, Poland

Franciszek Seredynski

Institute of Computer Science, Polish Academy of Sciences, Ordona 21, 01-237 Warsaw, Poland

Pascal Bouvry

Faculty of Sciences, Technology and Communication, University of Luxembourg, 6 rue Coudenhove Kalergi, Luxembourg

Keywords:

Recurrent neural network, Elman, Jordan, database security, anomaly detection.

Abstract:

In the paper we present a new approach based on application of neural networks to detect SQL attacks. SQL

attacks are those attacks that take advantage of using SQL statements to be performed. The problem of

detection of this class of attacks is transformed to time series prediction problem. SQL queries are used as a

source of events in a protected environment. To differentiate between normal SQL queries and those sent by

an attacker, we divide SQL statements into tokens and pass them to our detection system, which predicts the

next token, taking into account previously seen tokens. In the learning phase tokens are passed to recurrent

neural network (RNN) trained by backpropagation through time (BPTT) algorithm. Teaching data are shifted

by one token forward in time with relation to input. The purpose of the testing phase is to predict the next

token in the sequence. All experiments were conducted on Jordan and Elman networks using data gathered

from PHP Nuke portal. Experimental results show that the Jordan network outperforms the Elman network

predicting correctly queries of the length up to ten.

1 INTRODUCTION

Large number of Web applications, especially those

deployed for companies to e-business purpose involve

data integrity and conﬁdentiality. Such applications

are written in script languages like PHP embedded in

HTML allowing to establish connection to databases,

retrieving data and putting them in WWW site. Be-

sides that all Web contents is often based on the

retrieved data, a database also stores sensitive user

typed data like credit card numbers and personal in-

formation. Security violations consist in not autho-

rized access and modiﬁcation of data in the database.

SQL is one of languages used to manage data in

databases. Its statements can be one of sources of

events for potential attacks. One of the ideas to detect

an intruder using SQL statements is to build a proﬁle

of normal behavior and in detection stage compare it

with observed events.

In the literature there are some approaches to in-

trusion detection in Web applications. In (Valeur

et al., 2005) the authors developed anomaly-based

system that learns the proﬁles of the normal database

access performed by web-based applications using a

number of different models. A proﬁle is a set of

models, to which parts of SQL statement are fed to

in order to train the set of models or to generate an

anomaly score. During training phase models are

built based on training data and anomaly score is cal-

culated. For each model, the maximum of anomaly

score is stored and used to set an anomaly threshold.

During detection phase, for each SQL query anomaly

score is calculated. If it exceeds the maximum of

anomaly score evaluated during training phase, the

query is considered to be anomalous. Decreasing

false positive alerts involves creating models for cus-

tom data types for each application to which this sys-

tem is applied.

Besides that work, there are some other works

on detecting attacks on a Web server which con-

191

Skaruz J., Seredynski F. and Bouvry P. (2007).

RECURRENT NEURAL NETWORKS APPROACH TO THE DETECTION OF SQL ATTACKS.

In Proceedings of the Ninth International Conference on Enterprise Information Systems - AIDSS, pages 191-197

DOI: 10.5220/0002352901910197

 SciTePress

stitutes a part of infrastructure for Web applica-

tions. In (Kruegel and Vigna, 2003) detection sys-

tem correlates the server-side programs referenced by

clients queries with the parameters contained in these

queries. It is similar approach to detection to the

previous work. The system analyzes HTTP requests

and builds data model based on attribute length of re-

quests, attribute character distribution, structural in-

ference and attribute order. In a detection phase built

model is used for comparing requests of clients.

In (Almgren et al., 2000) logs of Web server are

analyzed to look for security violations. However,

the proposed system is prone to high rates of false

alarm. To decrease it, some site-speciﬁc available in-

formation should be taken into account which is not

portable.

In this work we present a new approach to in-

trusion detection in Web application. Rather than

building proﬁles of normal behavior we focus on a

sequence of tokens within SQL statements observed

during normal use of application. Two architectures

of RNN are used to encode stream of such SQL state-

ments.

The paper is organized as follows. The next sec-

tion discusses SQL attacks. In section 3 we present

two architectures of RNN. Section 4 shows training

and testing data used for experiments. Next, section

5 contains experimental results. Last section summa-

rizes results and shows possible future work.

2 SQL ATTACKS

2.1 SQL Injection

SQL injection attack consists in such a manipulation

of an application communicating with a database, that

it allows a user to gain access or to allow it to modify

data for which it has not privileges. To perform an

attack in the most cases Web forms are used to inject

part of SQL query. Typing SQL keywords and con-

trol signs an intruder is able to change the structure of

SQL query developed by a Web designer. It is pos-

sible because parts of SQL statements depend on the

data typed by a user. If variables used in SQL query

are under control of a user, he can modify SQL query

which will cause change of its meaning. Consider an

example of a poor quality code written in PHP pre-

sented below.

$connection=mysql_connect();

mysql_select_db("test");

$user=$HTTP_GET_VARS[’username’];

$pass=$HTTP_GET_VARS[’password’];

$query="select * from users where

$result=mysql_query($query);

if(mysql_num_rows($result)==1)

echo "authorization successful"

else

echo "authorization failed";

The code is responsible for authorizing users. User

data typed in a Web form are assigned to variables

user and pass and then passed to the SQL statement.

If retrieved data include one row it means that a user

ﬁlled in the form login and password the same as

stored in the database. Because data sent by a Web

form are not analyzed, a user is free to inject any

strings. For example, an intruder can type: ”’ or 1=1

–” in the login ﬁeld leaving the password ﬁeld empty.

The structure of SQL query will be changed as pre-

sented below.

$query="select * from users where login

=’’ or 1=1 --’ and password=’’";

Two dashes comments the following text. Boolean

expression 1=1 is always true and as a result user will

be logged with privileges of the ﬁrst user stored in the

table users.

2.2 Proposed Approach

The way we detect intruders can be easily trans-

formed to time series prediction problem. Accord-

ing to (Nunn and White, 2005) a time series is a se-

quence of data collected from some system by sam-

pling a system property, usually at regular time inter-

vals. One of the goal of the analysis of time series

is to forecast the next value in the sequence based on

values occurred in the past. The problem can be more

precisely formulated as follows:

t−2

, s

t−1

, s

−→ s

t+1

, (1)

where s is any signal, which is dependent on a solv-

ing problem and t is a current moment in time. Given

t−2

, s

t−1

, s

, we want to predict s

t+1

. In the prob-

lem of detection SQL attacks, each SQL statement is

divided into some signals, which we further call to-

kens. The idea of detecting SQL attacks is based on

their key feature. SQL injection and XSS attacks in-

volve modiﬁcation of SQL statement, which lead to

the fact, that the sequence of tokens extracted from a

modiﬁed SQL statement is different than the sequence

of tokens derived from a legal SQL statement. For

example, let S means recorded SQL statement and

, T

tokens of this SQL statement. The

original sequence of tokens is as follows:

, T

. (2)

ICEIS 2007 - International Conference on Enterprise Information Systems

192

If an intruder performs an attack, the form of SQL

statement changes. Transformation of the modiﬁed

statement to tokens results in different tokens than

these shown in eq.(2). The example of a sequence of

tokens related to modiﬁed SQL query is as follows:

, T

mod3

, T

mod4

, T

mod5

. (3)

Tokens number 3, 4, 5 are modiﬁed due to an intruder

activity. We assume that intrusion detection system

trained on original SQL statements is able to predict

the next token based on the tokens from the past. If

the token T

occurs, the system should predict token

, next token T

is expected. In case of attacks token

mod3

occurs which is different than T

, which means

that an attack is performed.

Various techniques have been used to analyze time

series (Kendall and Ord, 1999; Pollock, 1999). Be-

sides statistical methods, RNNs have been widely

used for that problem. In our study presented in this

paper we selected two RNNs, the Elman and the Jor-

dan networks.

3 RECURRENT NEURAL

NETWORKS

3.1 General Issues

In comparison to feedforward neural networks RNN

have feedback connections which provide dynamics.

When they process information, output neurons signal

depends on input and activation of neurons in the pre-

vious steps of teaching RNN. However, RNNs suffer

vanishing gradient (Lin et al., 1996). This is the main

reason why gradient-descent algorithms are not sufﬁ-

ciently powerful to map relationship between output

of RNN and input that occur much earlier in time.

In (Lin et al., 1996) the authors compared the El-

man network with neural network based on NARX

model. The model assumes that output neuron signals

from n times in back are passed to the hidden layer

neurons. This partially overcome vanishing gradi-

ent effect. Some researchers introduce modiﬁcations

to known architectures of RNN to improve teach-

ing process. In (Drake and Miller, 2002) additional

self-feedback connection to context layer neurons of

the Elman network was added. Experimental results

show that error of network when weight of additional

connection is ﬁxed is smaller than error of the Elman

network.

3.2 RNN Architectures

There are some differences between the Elman and

the Jordan networks. The ﬁrst is that input signal

for context layer neurons comes from different layers

and the second is that Jordan network has additional

feedback connection in the context layer. While in

the Elman network the size of the context layer is the

same as the size of the hidden layer, in the Jordan net-

work the size of output layer and context layer is the

same. In both networks recurrent connections have

ﬁxed weight equal to 1.0. Networks were trained by

BPTT and the following equations are applied for the

Elman network:

x(k) = [x

(k), ..., x

(k), v

(k− 1), ..., v

(k− 1)], (4)

(k) =

N+K

∑

i=1

(1)

(k), v

(k) = f(u

(k)), (5)

(k) =

∑

i=1

(2)

(k), y

(k) = f(g

(k)), (6)

E(k) = 0.5

∑

i=1

(k) − d

(k)]

, (7)

(o)

(k) = [y

(k) − d

(k)] f

‘

(k)), (8)

(h)

(k) = f

‘

(k))

∑

j=1

(o)

(k)w

(2)

, (9)

(k+ 1)

(2)

= w

(k)

(2)

sql−length

∑

k=1

(k)δ

(o)

(k)],

(10)

(k+ 1)

(1)

= w

(k)

(1)

sql−length

∑

k=1

(k)δ

(h)

(k)].

(11)

In the equations (4)-(11), N, K, M stand for the size

of the input, hidden and output layers, respectively.

x(k) is an input vector, u

(k) and g

(k) are input sig-

nals provided to the hidden and output layer neurons.

Next, v

(k) and y

(k) stand for the activations of the

neurons in the hidden and output layer at time k, re-

spectively. The equation (7) shows how RNN error

is computed, while neurons error in the output and

hidden layers are evaluated according to (8) and (9),

respectively. Finally, in the last step values of weights

are changed using formulas (10) for the output layer

and (11) for the hidden layer.

3.3 Training

The training process of RNN is performed as follows.

The tokens of the SQL statement become input of a

network. Activations of all neurons are computed.

RECURRENT NEURAL NETWORKS APPROACH TO THE DETECTION OF SQL ATTACKS

193

Table 1: A part of a list of tokens and their indexes.

token index

... ...

WHERE 7

... ...

FROM string 28

... ...

SELECT string 36

... ...

string=number 47

... ...

INSERT INTO 54

vector 1

vector 2

vector 3

vector 4

SELECT user_password FROM nuke_users WHERE user_id = 2

token 7token 36 token 28 token 47

000000000000000000000000000000000001000000000000000000

000000000000000000000000000100000001000000000000000000

000000100000000000000000000100000001000000000000000000

000000100000000000000000000100000001000000000010000000

Figure 1: Preparation of input data for a neural network:

analysis of a statement in terms of tokens (a), input neural

network data corresponding to the statement (b).

Next, an error of each neuron is calculated. These

steps are repeated until last token has been presented

to the network. Next, all weights are evaluated and

activation of the context layer neurons is set to 0. For

each input data, teaching data are shifted by one to-

ken forward in time with relation to input. Training

a network in such a way ensures that it will posses

prediction capability.

Training data consists of 276 SQL queries without

repetition. The following tokens are considered: key-

words of SQL language, numbers, strings and com-

binations of these elements. We used the collection

of SQL statements to deﬁne 54 distinct tokens. Each

token has a unique index. The table 1 shows selected

tokens and their indexes. The indexes are used for

preparation of input data for neural networks. The in-

dex e.g. of a keyword WHERE is 7. The index 28

points to a combination of keyword FROM and any

string. The token with index 36 relates to a grammat-

ical link between SELECT and any string. Finally,

when any string is compared to any number within a

SQL query, the index of a token equals to 47. Figure 1

presents an example of SQL statement, its represen-

tation in the form of tokens and related binary four

inputs of a network. SQL statement is encoded as k

vectors, where k is the number of tokens constituting

the statement (see ﬁgure 1). The number of neurons

on the input layer is the same as the number of deﬁned

tokens. Networks have 55 neurons in the output layer.

54 neurons correspond to each token similarly to the

input layer but the neuron 55 is included to indicate

that just processing input data vector is the last within

a SQL query. Training data, which are compared to

the output of the network have value either equals to

0.1 or 0.9. If a neuron number n in the output layer

has small value then it means that the next processing

token can not have index n. On the other hand, if out-

put neuron number n has value of 0.9, then the next

token in a sequence should have index equals to n.

At the beginning, SQL statement is divided into

tokens. The indexes of tokens are: 36, 28, 7 and 47.

Each row is an input vector for RNN (see ﬁgure 1). In

the ﬁgure 1 the ﬁrst token that has appeared is 36. As a

consequence, in the ﬁrst step of training output signal

of all neurons in the input layer is 0 except neuron

number 36, which has value of 1. Next input vectors

indicate current indexes of tokens and the index of

a token that has been processed by RNN. The next

token in a sequence has index equals to 28. It follows

that only neurons 36 and 28 have output signal equal

to 1. The next index of a token is 7, which means

that neurons: 36, 28 and 7 send 1 and all remaining

neurons send 0. Finally, neurons 36, 28, 7, 47 have

activation signal equal to 1. In that moment weights

of RNN are updated and the next SQL statement is

considered.

4 TRAINING AND TESTING

DATA

We evaluated our system using data collected from

PHP Nuke portal(phpnuke.org, ). It is well known

application with many holes in older versions. Sim-

ilarly to (Valeur et al., 2005) we installed this portal

in version 7.5, which is susceptible to some SQL in-

jection attacks. Data without attacks were gathered

by visiting the Web sites using a browser. Each time

a Website is downloaded by a browser either a link

is clicked or ﬁlled forms are executed, SQL queries

are sent to a database and logged to a ﬁle simulta-

neously. During operation of the portal we collected

nearly 100000 SQL statements. Next, based on this

collection we deﬁned tokens, which are keywords of

SQL and data types. The set of all SQL queries was

divided into 12 subsets, each containing SQL state-

ments of different length. 80% of each data set was

used for training and remaining data used for examin-

ing generalization. Teaching data are shifted one time

forward in time. Data with attacks are the same as

reported in (Valeur et al., 2005).

ICEIS 2007 - International Conference on Enterprise Information Systems

194

100

2 4 6 8 10 12

RMS

% no. of wrong predicted SQL queries

Index of training data subset

Jordan’s network performance

Jordan RMS

Jordan Verification

Figure 2: Error and number of wrong predicted SQL queries

for each subset of data for Jordan’s network.

5 EXPERIMENTAL RESULTS

Experimental study was divided into four stages. In

the ﬁrst one, we wanted to evaluate the best param-

eters of both RNNs and learning algorithm. These

features are: a number of neurons in the hidden layer,

α used in momentum, activation function of neurons

in the hidden and output layer, η that determines the

extent of weights update. For the Elman network all

neurons in the hidden layer have sigmoidal activation

function while all neurons in the output layer have

tanh function. For the Jordan network tanh function

was chosen for the hidden layer and sigmoidal func-

tion for the output layer. In the most cases η (training

coefﬁcient) does not exceed 0.2 and α value (used in

momentum) is less than 0.2. Ranges 2-4, 13-14, 15-

16 and 17-20 of the data subset (see Table 2) means

that these subsets include SQL queries of length be-

tween 2 and 4, 13 and 14, 15 and 16, 17 and 20.

All remaining subsets contain ﬁxed length statements.

The number of neurons in the hidden layer was also

evaluated during experiments - 58 neurons were used.

In the second phase of the experimental study we

trained 12 RNNs, one for each training data subset.

From the beginning of the training, the error of the

Jordan network was much smaller than error of the

Elman network. In the next a few epochs the error of

both networks decreased quickly but the Jordan net-

work error remained much smaller than the Elman

network error. Figures 2 and 3 show how error of

networks changes for all subsets of SQL queries. The

ﬁgures also depict how well the networks are veriﬁed.

Here, a statement is considered as well predicted if for

all input vectors, all neurons in the output layer have

values according to training data. All values pre-

sented in ﬁgures are averaged on 10 runs of RNNs.

One can see that nearly for all data subsets the Jor-

100

2 4 6 8 10 12

RMS

% no. of wrong predicted SQL queries

Index of training data subset

Elman’s network performance

Elman RMS

Elman Verification

Figure 3: Error and number of wrong predicted SQL queries

for each subset of data for Elman’s network.

# SQL statement

1 7

2 2

3 1

4 2

5 2

6 1

7 2

8 0

# index of input vector number of errors

Figure 4: RNN output for an attack.

dan network outperforms the Elman one. Only for

data subsets 11 and 12 (see table 2) the error of the

Jordan network is greater than the error of the Elman

network. Despite of this for all data subsets percent-

age number of wrong predicted SQL queries for the

Jordan network is less than the number of wrong pre-

dicted SQL statements for the Elman network. The

Jordan network is able to predict all tokens of 10

length statements (20.6% false alarms).

In the third part of experiments we checked if

RNNs correctly detect attacks from section 4. Each

experiment was conducted using trained RNNs from

the second stage. Figure 4 presents the typical RNN

output if an attack is performed. The left column

depicts the number of input vector for RNN, while

the right column shows the number of cases in which

the index of the token indicated by network output

is different than the index of the next processed by

RNN token. What is typical for each network is that

nearly each output vector of a network has a few er-

rors. This phenomenon is present for all attacks used

in this work. Based on that observation, a decision

about good or bad veriﬁcation and generalization of a

RECURRENT NEURAL NETWORKS APPROACH TO THE DETECTION OF SQL ATTACKS

195

# SQL statement

# index of input vector num. of errors−ver num. of errors−gen

1 0 0

2 1 1

3 1 2

4 0 1

5 0 1

6 1 1

7 1 1

8 1 0

Figure 5: RNN output for known and unknown SQL state-

ment.

network can be taken in the correlation with a form of

network output against attacks. Figure 5 shows RNN

output for SQL statements derived from the training

set and that, which was not present in the training set.

The second column (see ﬁgure 5) relates to the case

if the SQL query was in the training set and the third

column concerns the SQL query, which was not in the

training set.

It is easy to see that the number of errors during

veriﬁcation and generalization is much smaller than

the number of errors when an attack is processed by

RNN. Moreover, there is also more output vectors free

of errors. Easily noticeable difference between an at-

tack and normal activity allows us to re-evaluate ob-

tained results presented in ﬁgures 2 and 3. Figure 5

presents typical outcomes for all trained RNNs and

our training data. To distinguish between an attack

and a legitimate SQL statement we deﬁne the follow-

ing rule for the Jordan network: an attack occurred

if the average number of errors for each output vec-

tor is not less than 2.0 and 80% of output vectors in-

clude any error. When the Elman network is used,

the threshold equals to 1.6 and the percentage of out-

put vectors possessing errors equals to 90%. Apply-

ing these rules ensures that all attacks are detected by

both RNN. The table 2 presents the percentage num-

ber of SQL statements wrongly predicted during ver-

iﬁcation and generalization if results were processed

by the rules.

For the most cases the Jordan network outper-

forms the Elman network. Only for data subsets con-

taining statements made from 11 and 12 tokens, the

Elman network is a little better than the Jordan net-

work. The important outcome of deﬁned rules is that

both RNNs thought all statements and only few legit-

imate statements, which were not in the training set

were detected as attacks.

Table 2: Results of veriﬁcation and generalization of Elman

and Jordan networks.

index of length of Elman Elman Jordan Jordan

data subset data subset ver gen ver gen

1 2-4 0 0 0 0

2 5 0 1.4 0 0

3 6 0 24.2 0 12.8

4 7 0 15.7 0 1.4

5 8 0 5 0 1.6

6 9 0 3.33 0 0

7 10 0 2.5 0 0

8 11 0 10 0 13.33

9 12 0 0 0 6.66

10 13-14 0 0 0 0

11 15-16 0 3.33 0 3.33

12 17-20 0 40 0 13.33

6 CONCLUSIONS

In the paper we have presented a new approach to

detecting SQL-based attacks. The problem of detec-

tion was transformed to time series prediction prob-

lem and two RNNs were examined to show their po-

tential use for such a class of attacks. Deep analysis of

the experimental results lead to the deﬁnition of rules

used for distinguishing between an attack and legiti-

mate statement. When these rules are applied, both

networks are completely trained for all SQL queries

included in the all training subsets. The advisable

part of experimental study is to apply deﬁned rules

to the other data set, which can conﬁrm efﬁciency of

the proposed approach to detecting SQL attacks. In

the future we are going to compare gradient-based al-

gorithms with nature inspired algorithms, especially

coevolutionary genetic algorithms.

REFERENCES

Almgren, M., Debar, H., and Dacier, M. (2000). A

lightweight tool for detecting web server attacks. In

Proceedings of the ISOC Symposium on Network and

Distributed Systems Security.

Drake, P. R. and Miller, K. A. (2002). Improved self-

feedback gain in the context layer of a modiﬁed el-

man neural network. In Mathematical and Computer

Modelling of Dynamical Systems.

Kendall, M. and Ord, J. (1999). Time Series.

Kruegel, C. and Vigna, G. (2003). Anomaly detection of

web-based attacks. In Proceedings of the 10th ACM

Conference on Computer and Communication Secu-

rity (CCS ’03).

Lin, T., Horne, B. G., Tino, P., and Giles, C. L. (1996).

Learning long-term dependencies in narx recurrent

ICEIS 2007 - International Conference on Enterprise Information Systems

196

neural networks. In IEEE Transactions on Neural Net-

works.

Nunn, I. and White, T. (2005). The application of antigenic

search techniques to time series forecasting. In Pro-

ceedings of the Genetic and Evolutionary Computa-

tion Conference (GECCO).

phpnuke.org. http://phpnuke.org/.

Pollock, D. (1999). A handbook of time-series analysis,

signal processing and dynamics. In Academic Press.

Valeur, F., Mutz, D., and Vigna, G. (2005). A learning-

based approach to the detection of sql attacks. In Pro-

ceedings of the Conference on Detection of Intrusions

and Malware and Vulnerability Assessment.

RECURRENT NEURAL NETWORKS APPROACH TO THE DETECTION OF SQL ATTACKS

197