3.2 Data Pre-processing
To obtain, for each modality, a fixed-size real vector
(required for the protection scheme), collected data
are converted to real vectors then appended. The dis-
tance between two vectors might be influenced by ex-
tremes values, they are consequently normalized.
3.2.1 Browser
Localkey (n-bits key) is converted into a n-bits vec-
tor. Thus, the 16-bits localkey ”0x0123”, is converted
into [0,0,0,0, 1,0,0,0, 0,1,0,0, 1,1,0,0].
3.2.2 Localisation
An IP address is converted in a vector composed by:
• a vector composed by the IP address bits divided
by 2
32−p−1
with p (bit weight);
• a vector composed by the 128/2
k
first bits of the
locality name’s md5 hash with k=1 for ”country”,
k=2 for ”region”, k=3 for ”county”, and k=4 for
”town”;
• a vector composed of 3 angles ∈ [−90; +90] re-
presenting the GPS localization’s latitude (lat),
and the longitude l (lng1, lng2); lng1 and lng2 are
equal to:
sign(α) ∗ ||α| − (|α| > 90) ∗ 180|
with α = l for lng1 and α = rot90(l) = (l −
90)%360 − 180 for lng2. These angles in degree
are normalized by the following formula:
angle
∗
= (angle + 90)/180
As for example, the IP adress ”127.0.0.1” is
converted in [0, 0.5, 0.25, 0.125, 0.0625, 0.03125,
0.015625, 0, 0.0078125, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4.6566 ∗ 10
−10
].
The following GPS localization (135, 0) is converted
in [0.5, 0.75, 0.25].
3.2.3 Network Data
Referer, User-Agent, Connection and Cookie are con-
verted into histograms, vectors giving for each cha-
racter its headcount. Only the ASCII characters
∈ [0x20,0x7F[, so 95 characters, are considered.
Accept, Accept-Encoding, and Accept-Language are
converted into vectors giving the preference for each
format, encoding, and language from a predefined
list. An additional value indicates the presence of
spaces after comma in the field. DNT and Upgrade-
Unsecure-Requests are converted into a 1-integer vec-
tor, equals to 1 if setted, 0 otherwise. The predefined
lists are:
• Accept: ”text/html”, ”application/xhtml+xml”,
”application/xml”, ”image/webp”, ”image/jxr”;
• Accept-Encoding: ”gzip”, ”deflate”, ”br”, ”sdch”;
• Accept-Language: ”fr”, ”fr-FR”, ”en-US”, ”en”.
As for exemple, the following User-Agent va-
lue ”Browser/1.0 (Operating System; rv:1.0) En-
gine/20170701 Browser/1.0” is converted by consi-
dering only characters in [a-z] by [1, 0, 0, 0, 5, 0, 2,
0, 2, 0, 0, 0, 1, 3, 2, 1, 0, 6, 3, 2, 0, 1, 2, 0, 1, 0]. The
Accept-Language ”fr;q=0.8, fr-FR;q=0.5, en-US” is
described by [0.8, 0.5, 1, 0, 1]. The DNT value ”1”
is converted in [1].
3.2.4 Biometric Data
The collected durations are converted into a vector
giving, for each considered digram, the means of
the 6 durations. These average values are conver-
ted in milliseconds, limited by 1000 then divided by
1000. Figure 5 presents the signature values after
pre-processing (here 1218 values). This step permits
to protect the semantic content of the signature, we
propose to enhance this protection thanks a dedicated
process presented in the next section.
Figure 5: Example of raw values after pre-processing (1218
real values).
3.3 Data Protection
The issue we want to address in this work is the
possibility to answer to Internet services applications
(s.a. authentication, attacks detection) while preser-
ving the user privacy. From the personal information
collected, we aim at generating a binary signature as
dynamical user characteristics having lost its seman-
tic description. Finally, the service is able to exploit
this signature without knowing the information used
to generate it.
Biohashing is a well-known algorithm in biometrics.
It enables a biometric data transformation when repre-
sented by a fixed-size real vector. It allows the gene-
ration of a binary model called BioCode having a size
Towards a Personal Identity Code Respecting Privacy
271