
The same criteria can be used for dependent
variables, namely those involved in a referential
integrity dependencies. In these cases, it is necessary
to look at the join operation as a unique independent
set prior to dealing with the selection operation
itself.
4 TESTS AND RESULTS
The theory presented in this paper has been
implemented and tested in industrial environments,
during the auditing and migration stages of decision
support solutions, under the edge of national funded
project Karma (ADI P060-P31B-09/97).
Table 1 refers to the auditing results in a
Enterprise Resource Planning System of a small
company, regarding its sales information.
In this database, although the relational model
was hard-coded in the application, the engine didn’t
implement the concept of referential integrity.
The referential integrity between the Orders
table and the OrderDetails table, as well as the
referential integrity between Orders and Customers
tables and between OrderDetails and Products
tables, have been tested using sampling auditing and
full auditing processes. To evaluate the samples’
behaviour when dealing with independent variables,
the mean value of a purchasing order as well as the
number of items in regular order were calculated for
the samples. These values were also compared with
real observations in the entire tables. From the final
results some conclusions were taken:
a) The validation of referential integrity over
samples using a classification of the population
presents poor results when dealing with small
samples, with estimations above the real values.
b) The validation of existential integrity (for
example the uniqueness), under the same
circumstances, presents poor results when
dealing with small samples, with estimations
bellow the real values.
c) Mean values and distribution results are also
influenced by the scope of the sample, and must
be transformed by the sample size ratio.
For the referential integrity cases, this is an
expected result since the set of records in the
referring table (say T
1
) is much larger than the strict
domain of the referred table (say T
2
). The error of
the estimator must be affected by the percentage of
records involved in the sample. Let:
• t
1
be the sample of the referring table T
1
;
• t
2
be the sample of the referred table T
2
;
• α
2
be the percentage of records of T
2
selected for
the sample (#t
2
/#T
2
);
• β(T
1
,T
2
) be the percentage of records in T
1
that
validates the R.I. in T
2
.
It would be expected that E[β(t
1
,t
2
)] = β(T
1
,T
2
),
but when dealing with samples in the referred table
(T
2
) the expected value will match E[β(t
1
,t
2
)] =
β(T
1
,T
2
) * α
2
. The estimated error is given by ε = 1-
β and therefore E[ε(t
1
,t
2
)] = 1-[1-ε(T
1
,T
2
)] * α
2
.
Table 1 and figures 1, 2 and 3 show the referential
integrity problems detected
ε(T
1
,T
2
), the sampling
error
ε(t
1
,t
2
) and the expected error value for each
case
E[ε(t
1
,t
2
)].
It is possible to establish the same corrective
parameter when dealing with existential integrity,
frequencies or distributions.
Several other tests were made in medium size
and large size databases, corroborating the results
presented above (Cortes, 2002).
Table 1: Referential integrity tests on a ERP database
(OD) OrderDetails, (O) Orders
(P) Products and(C) Customers tables
R.I.
ε(T
1
,T
2
) α
2
ε(t
1
,t
2
) E[ε(t
1
,t
2
)]
Sample I (90% confidence, 5% accuracy)
ODÆO 5.8% 25.3% 72.1% 76.1%
ODÆP 12.9% 77.3% 30.4% 32.6%
OÆC 4.7% 74.4% 27.2% 29.1%
Sample II (95% confidence, 5% accuracy)
ODÆO 5.8% 32.5% 66.6% 69.3%
ODÆP 12.9% 82.6% 23.0% 28.0%
OÆC 4.7% 81.1% 22.0% 22.8%
Sample III (98% confidence, 5% accuracy)
ODÆO 5.8% 40.4% 58.3% 61.9%
ODÆP 12.9% 86.6% 22.0% 24.5%
OÆC 4.7% 85.5% 19.0% 18.5%
Sample IV (99% confidence, 2% accuracy)
ODÆO 5.8% 83.7% 19.5% 21.1%
ODÆP 12.9% 97.3% 14.5% 15.2%
OÆC 4.7% 97.7% 7.5% 6.9%
Sample V (99.5% confidence, 1% accuracy)
ODÆO 5.8% 96.0% 9.4% 9.5%
ODÆP 12.9% 98.8% 14.4% 14.1%
OÆC 4.7% 98.8% 5.5% 5.8%
ICEIS 2004 - DATABASES AND INFORMATION SYSTEMS INTEGRATION
380