Certainty for one restriction vector (presented for
freshness).
In our extended version, we demonstrate that the
intersection of two restriction vectors is a restriction
vector, by construction. This allows obtaining the
probabilities of the intersections that appear in the
formula.
3.3 Probability Distribution of DIS
Quality
This model gives the probabilities of the different
quality values that the DIS may provide in the data
targets. For simplicity we calculate it for only one
data target, considering the involved sources.
The mechanism for obtaining this model consists
on calculating DIS quality values that result from all
the possible combinations of sources quality values,
and the probabilities of satisfying them. This can be
done if and only if the set of possible quality values
in each source is finite.
We define quality-values vector as a
combination of sources quality values:
vv = <qv
S1
, …, qv
Sn
>, where qv
Si
is the quality
value associated to source S
i
For each DIS quality value it may exist various
quality-values vectors that correspond to it.
For the calculation of the probability of each DIS
quality value we calculate the probability that one of
the corresponding quality-values vectors is satisfied
by the sources. For this, we sum the probabilities of
the quality-values vectors, since they constitute
disjoint events (considering the same random
experiment as the one for accuracy in last section).
The following are the steps to calculate the
probability distribution of DIS quality, for target T
and set of sources {S
1
, …, S
n
}:
1- Generate all the possible quality-values vectors
for S
1
, …, S
n
, obtaining a set V = {vv
1
, …, vv
k
},
where vv
i
= {q
i1
, …, q
in
}
2- For each vv
i
∈ V, calculate the quality value
provided in T, qv
i
, obtaining DISValues1 = {qv
1
,
…, qv
k
}.
3- Eliminate duplicate values from DISValues1,
obtaining DISValues2 = {qv
i1
, …, qv
im
}, 1≤ij≤ k.
4- For each qv
ij
∈ DISValues2, sum the
probabilities of the vectors of V that generate
qv
ij
, obtaining the probability that one of the
vectors is satisfied by the sources.
Note: The probability of a quality-values vector
vv=<qv
S1
, …, qv
Sn
> is calculated as:
P(vv) = P(qv
S1
)…P(qv
Sn
), where P(qv
Si
) is given by
the source model.
4 CONCLUSIONS
This work proposes an approach to maintain quality
on DIS and to deal with source quality changes. To
achieve this, we propose to build and maintain
probabilistic quality behaviour models of the sources
and the DIS, so that they can be used as a support for
DIS quality changes detection and management.
The paper focuses on the presentation of the
quality models and the techniques for constructing
them. A source quality model gives the probability
distribution of the possible quality values at the
source. DIS quality models give the probability of
satisfying quality requirements, the probability
distribution of DIS quality values and the
satisfaction of quality requirements such as
“average”, and “most frequent value”.
The main contribution of the work is the
proposal of a novel approach for quality
maintenance in DIS, based on quality behaviour
models. We believe that the here proposed
probabilistic-based approach is an step forward in
building quality change-tolerant DIS.
We have worked on the whole mechanism of
quality maintenance, which basically consists on
relevant quality changes detection and DIS quality
repair, but it is not presented here for space reasons.
We have done some experimentation applying
our proposal to social networks domain, constructing
all the proposed quality models. It showed the
usefulness and feasibility of the proposal.
REFERENCES
Canavos, G., 1984. Applied Probability and Statistical
Methods. Little, Brown. ISBN: 9780316127783
Cho, J., Garcia-Molina, H., 2003. Estimating Frequency of
Change. ACM Transactions on Internet Technology
(TOIT), Volume 3, Issue 3 , Pages: 256 – 290.
Gertsbakh, I., 1989. Statistical Reliability Theory. Pub.:
M. Dekker. ISBN: 0-8247-8019-1
Scannapieco, M., Missier, P., Batini, C., 2005. Data
Quality at a Glance. Datenbank-Spektrum 14: 6-1
APPLYING PROBABILISTIC MODELS TO DATA QUALITY CHANGE MANAGEMENT
299