3.3 Data Distribution
The predicted popularities are used to estimate where
the datasets should be kept: on type one storages or
on type two. For this purpose we use the risks anal-
ysis. We use a risk matrix to calculate the total risks
of the decision where to store the data. For example,
the following matrix is used for two-types hybrid data
storage systems:
M =
C
00
C
10
C
01
C
11
(1)
C
00
- fine for the decision to keep a dataset on type
1 storage, when it should kept on the type 1 storage,
C
01
- fine for the decision to keep a dataset on type
1 storage, when it should kept on the type 2 storage,
C
10
- fine for the decision to keep a dataset on type
2 storage, when it should kept on the type 1 storage,
C
11
- fine for the decision to keep a dataset on type
2 storage, when it should kept on the type 2 storage.
The predicted probabilities are probabilities that
the datasets should be stored on type 1 storages (rarely
used data) or on type 2 storages (often used data).
Therefore, multiplying the risks matrix and the pre-
dicted probabilities for the each decision we estimate
the total risks for the decisions:
R =
P
0
P
1
C
00
C
10
C
01
C
11
=
R
0
R
1
(2)
P
0
- the predicted probability that a dataset will be
rarely used,
P
1
- the predicted probability that the dataset will
be often used,
R
0
- the total risk for the decision to keep the
dataset on type 1 storage,
R
1
- the total risk for the decision to keep the
dataset on type 2 storage.
3.4 Optimal Number of Replicas
As described in paper (Hushchyn, 2015) we use
Nadaraya-Watson kernel smoothing algorithm and
Leave-One-Out method for smoothing window width
optimization to predict dataset future number of ac-
cesses. In some time series the Hidden Markov Model
(HMM) demonstrates the better prediction results.
Therefore, our system provides the HMM algorithm
for the prediction of the number of accesses. Cur-
rently, we are developing the implementation of the
HMM for the prediction.
Then, the predicted number of accesses are used
to estimate the optimal number of replicas for the
datasets:
Rp
opti
= F(I) (3)
Rp
opti
- the optimal number of replicas for a
dataset,
I - the predicted number of accesses for the
dataset,
F() - function for the optimal number of replicas.
Linear, quadratic or exponential functions are ex-
amples of the function F().
The predicted number of access and the optimal
number of replicas are helpful for optimal usage of the
storage system. Moreover, the predicted number of
accesses can be used to detect the datasets for which
the number of replicas should be reduces to free addi-
tional space on the storage.
For LHCb the following function is used to esti-
mate the optimal number of replicas:
Rp
opti
=
√
αI (4)
α - the free parameter.
The figure 3 shows how the optimal number of
replicas for a dataset depends on its predicted number
of accesses and alpha value. For example, suppose
the predicted number of accesses for a dataset is I =
10 accesses per week and α = 0.5. Then Rp
optimal
=
√
αI =
√
0.5 ∗10 = 2.24 ≈ 2 replicas.
Figure 3: Dependence of optimal number of dataset replicas
(Rp) from its predicted number of accesses (I) and α.
4 RESULTS
Based on the method which is described above we
create the python library. This library contains tools
for the probability prediction, the optimal data dis-
tribution and optimal number of replicas estimation.
Moreover, the library allows to do simulation of work
of our method based on the data usage history.
Also, it will be possible to use our method as web-
service with docker (Docker).
Now we are developing the probability prediction
method decribed above to get higher quality of the
classification.
Special datasets metadata, dataset access history
several features that were calculated using the ac-
cess history were used as inputs in (Hushchyn, 2015).
Distributed Data Replication and Access Optimization for LHCb Storage System - A Position Paper
539