Table 1: A part of a list of tokens and their indexes.
token index
... ...
WHERE 7
... ...
FROM string 28
... ...
SELECT string 36
... ...
string=number 47
... ...
INSERT INTO 54
a)
b)
vector 1
vector 2
vector 3
vector 4
SELECT user_password FROM nuke_users WHERE user_id = 2
token 7token 36 token 28 token 47
000000000000000000000000000000000001000000000000000000
000000000000000000000000000100000001000000000000000000
000000100000000000000000000100000001000000000000000000
000000100000000000000000000100000001000000000010000000
Figure 1: Preparation of input data for a neural network:
analysis of a statement in terms of tokens (a), input neural
network data corresponding to the statement (b).
Next, an error of each neuron is calculated. These
steps are repeated until last token has been presented
to the network. Next, all weights are evaluated and
activation of the context layer neurons is set to 0. For
each input data, teaching data are shifted by one to-
ken forward in time with relation to input. Training
a network in such a way ensures that it will posses
prediction capability.
Training data consists of 276 SQL queries without
repetition. The following tokens are considered: key-
words of SQL language, numbers, strings and com-
binations of these elements. We used the collection
of SQL statements to define 54 distinct tokens. Each
token has a unique index. The table 1 shows selected
tokens and their indexes. The indexes are used for
preparation of input data for neural networks. The in-
dex e.g. of a keyword WHERE is 7. The index 28
points to a combination of keyword FROM and any
string. The token with index 36 relates to a grammat-
ical link between SELECT and any string. Finally,
when any string is compared to any number within a
SQL query, the index of a token equals to 47. Figure 1
presents an example of SQL statement, its represen-
tation in the form of tokens and related binary four
inputs of a network. SQL statement is encoded as k
vectors, where k is the number of tokens constituting
the statement (see figure 1). The number of neurons
on the input layer is the same as the number of defined
tokens. Networks have 55 neurons in the output layer.
54 neurons correspond to each token similarly to the
input layer but the neuron 55 is included to indicate
that just processing input data vector is the last within
a SQL query. Training data, which are compared to
the output of the network have value either equals to
0.1 or 0.9. If a neuron number n in the output layer
has small value then it means that the next processing
token can not have index n. On the other hand, if out-
put neuron number n has value of 0.9, then the next
token in a sequence should have index equals to n.
At the beginning, SQL statement is divided into
tokens. The indexes of tokens are: 36, 28, 7 and 47.
Each row is an input vector for RNN (see figure 1). In
the figure 1 the first token that has appeared is 36. As a
consequence, in the first step of training output signal
of all neurons in the input layer is 0 except neuron
number 36, which has value of 1. Next input vectors
indicate current indexes of tokens and the index of
a token that has been processed by RNN. The next
token in a sequence has index equals to 28. It follows
that only neurons 36 and 28 have output signal equal
to 1. The next index of a token is 7, which means
that neurons: 36, 28 and 7 send 1 and all remaining
neurons send 0. Finally, neurons 36, 28, 7, 47 have
activation signal equal to 1. In that moment weights
of RNN are updated and the next SQL statement is
considered.
4 TRAINING AND TESTING
DATA
We evaluated our system using data collected from
PHP Nuke portal(phpnuke.org, ). It is well known
application with many holes in older versions. Sim-
ilarly to (Valeur et al., 2005) we installed this portal
in version 7.5, which is susceptible to some SQL in-
jection attacks. Data without attacks were gathered
by visiting the Web sites using a browser. Each time
a Website is downloaded by a browser either a link
is clicked or filled forms are executed, SQL queries
are sent to a database and logged to a file simulta-
neously. During operation of the portal we collected
nearly 100000 SQL statements. Next, based on this
collection we defined tokens, which are keywords of
SQL and data types. The set of all SQL queries was
divided into 12 subsets, each containing SQL state-
ments of different length. 80% of each data set was
used for training and remaining data used for examin-
ing generalization. Teaching data are shifted one time
forward in time. Data with attacks are the same as
reported in (Valeur et al., 2005).
ICEIS 2007 - International Conference on Enterprise Information Systems
194