5 CONCLUSIONS
In order to deal with inconsistencies among data and
to evaluate the data quality level of a dataset, we pro-
posed an approach based on rules expressed with an
XML-based syntax and we developeda tool to deploy
such rules on a dataset. The proposed approach al-
lows a user to easily define different kinds of rules in
the same environment, without to deal with direct ma-
nipulation of XML trees. In fact, in order to express
data quality rules, a user needs to know the database
schema of the dataset and the types of rules managed
by the tool. The developed tool manages a predefined
number of rule types, but it can be easily extended in
order to deal with other types of rules.
As further work, we seek to perform an assess-
ment of the proposed approach on larger – and possi-
bly different – datasets, in order to validate it on other
applicative domains.
ACKNOWLEDGEMENTS
The authors acknowledge the contribution of the JRC
colleagues involved in the DCF activities, while be-
ing solely responsible for possible incomplete or er-
roneous statements.
Disclaimer. The content of this work reflects only
the opinion of the authors and may not be regarded as
stating an official position of the European Commis-
sion.
REFERENCES
Atzeni, P. and Morfuni, N. (1986). Functional dependencies
and constraints on null values in database relations.
Information and Control, 70(1):1–31.
Bajec, M., Krisper, M., and Rupnik, R. (2000). Us-
ing business rules technologies to bridge the gap be-
tween business and business applications. In Rechnu,
G., editor, Proceedings of the IFIP World Computer
Congress, pages 77–85.
Barateiro, J. and Galhardas, H. (2005). A survey of data
quality tools. Datenbank-Spektrum, 14/2005:15–21.
Bertossi, L. and Chomicki, J. (2003). Query answering in
inconsistent databases. In Chomicki, J., Saake, G.,
and van der Meyden, R., editors, Logics for Emerging
Applications of Databases, pages 43–83. Springer.
Bohannon, P., Fan, W., Geerts, F., Jia, X., and Kementsiet-
sidis, A. (2007). Conditional functional dependencies
for data cleaning. In Proceedings of the Int’l Confer-
ence on Data Engineering, pages 746–755.
Chen, W., Fan, W., and Ma, S. (2009). Analyses and
validation of conditional dependencies with built-in
predicates. In Bhowmick, S., Kung, J., and Wag-
ner, R., editors, Proceedings of the Int’l Conference
on Database and Expert Systems Applications, vol-
ume 5690 of LNCS, pages 576–591. Springer-Verlag.
Chen, Y. (2003). Quality of fisheries data and uncertainty in
Stock Assessment. Scientia Marina, 67(Suppl. 1):75–
87.
Cong, G., Fan, W., Geerts, F., Jia, X., and Ma, S. (2007).
Improving data quality: Consistency and accuracy.
In Proceedings of the Int’l Conference on Very Large
Data Bases, pages 315–326. VLDB Endowment.
Dasu, T. and Johnson, T. (2003). Exploratory Data Mining
and Data Cleaning. Wiley.
Elmagarmid, A., Ipeirotis, P., and Verykios, V. (2007). Du-
plicate record detection: A survey. IEEE Transactions
in Knowledge and Data Engineering, 19(1):1–16.
Elmasri, R. and Navathe, S. (2000). Foundamentals of
Database Systems (3rd edition). Addison-Wesley.
European Commission (15 July 2008). Commission Reg-
ulation (EC) No. 665/2008 of 14 July 2008. Official
Journal of the European Union.
Fan, W., Geerts, F., and Jia, X. (2008). Semandaq: A data
quality system based on conditional functional depen-
dencies. In Proceedings of the Int’l Conference on
Very Large Data Bases, pages 1460–1463. VLDB En-
dowment.
Ginsburg, S. and Hull, R. (1983). Order dependency in
the relational model. Theoretical Computer Science,
26:149–195.
Golab, L., Karloff, H., Korn, F., and D., S. (2010). Data
Auditor: Exploring data quality and semantics using
pattern tableaux. Proceedings of the VLDB Endow-
ment, 3(2):1641–1644.
Hellerstein, J. (2008). Quantitative data cleaning for large
databases. Report for the United Nations Economic
Commission for Europe (UNECE), 42 pp.
Hipp, J., G¨untzer, U., and Grimmer, U. (2001). Data quality
mining - making a virtue of necessity. In Proceedings
of the ACM SIGMOD Workshop on Research Issues in
Data Mining and Knowledge Discovery, pages 52–57.
Juran, J. (1964). Managerial breakthrough. McGraw-Hill,
New York.
Lee, Y., Pipino, L., Funk, J., and Wang, R. (2006). Journey
to Data Quality. The MIT Press.
Ng, W. (1999). Order functional dependencies in relational
databases. Information Systems, 24(7):535–554.
Pivert, O. and Prade, H. (2010). Handling dirty databases:
From user warning to data cleaning - towards an in-
teractive approach. In Deshpande, A. and Hunter, A.,
editors, Proceedings of the Int’l Conference on Scal-
able Uncertainty Management, volume 6379 of LNAI,
pages 292–305. Springer-Verlag.
Renear, A., Sacchi, S., and Wickett, K. (2010). Definitions
of dataset in the scientific and technical literature. In
Grove, A., editor, Proceedings of the American So-
ciety for Information Science and Technology Annual
Meeting, volume 47(1), pages 1–4.
Shankaranarayan, G., Wang, R. Y., and Ziad, M. (2000).
Modeling the manufacture of an information product
with IP-MAP. In Proceedings of the Int’l Conference
on Information Quality, pages 1–16.
DataQualityEvaluationofScientificDatasets-ACaseStudyinaPolicySupportContext
173