
interpret q
a
. To address the first threat, we imple-
mented classification models using scikit-learn (Pe-
dregosa et al., 2011), a widely used library. To
limit the second threat, we used a grid search to set
the hyper-parameters for classification models on all
datasets without any deterioration and then used these
settings for the rest of the experiments. To limit the
third threat, we selected three widely used datasets,
along with the 114 datasets obtained by injecting con-
trolled percentages of errors.
The 4 external threats are the choice of the datasets
for the evaluation, the choice of classification mod-
els, the generation of errors, and the combination of
errors. We tried to limit the first threat by choosing
datasets that are widely used and have different di-
mensions. We also selected datasets that cover var-
ious applications and ranges of dimensions. We se-
lected a wide range of classification approaches to ad-
dress the second threat. To limit the third one, we
decided to generate errors randomly, with a uniform
distribution. Finally, we decided to study errors sepa-
rately to limit the fourth threat. However, we plan to
extend this work to error combinations in future work.
4 CONCLUSION
In this paper, we have introduced a novel metric to
measure data quality. The main advantage of the pro-
posed metric is being independent of learning models
and expert knowledge. Furthermore, it does not re-
quire external reference data. As a consequence, it of-
fers the possibility to compare different datasets. We
have extensively tested and evaluated the proposed
metric and have shown that it is able to characterize
data quality correctly.
REFERENCES
Ataccama (2023). Ataccamaone. https://www.ataccama.
com/platform.
Batini, C., Cappiello, C., Francalanci, C., and Maurino, A.
(2009). Methodologies for data quality assessment
and improvement. In ACM computing surveys.
Batini, C., Scannapieco, M., et al. (2016). Data and infor-
mation quality. Springer.
Bors, C., Gschwandtner, T., Kriglstein, S., Miksch, S.,
and Pohl, M. (2018). Visual interactive creation,
customization, and analysis of data quality metrics.
In Journal of Data and Information Quality (JDIQ)
ACM.
Cichy, C. and Rass, S. (2019). An overview of data quality
frameworks. In IEEE Access.
DataCleaner (2023). Datacleaner. https://datacleaner.
github.io/.
Datamartist (2023). Datamartist. http://www.datamartist.
com/.
Ehrlinger, L. and W
¨
oß, W. (2022). A survey of data quality
measurement and monitoring tools. In Frontiers in big
data.
Experian (2023). User manual version 5.9.
https://www.edq.com/globalassets/documentation/
pandora/pandora\manual\590.pdf.
Foundation, A. (2023). Apache griffin user guide.
https://github.com/apache/griffin/blob/master/
griffin-doc/ui/user-guide.md.
Gudivada, V., Apon, A., and Ding, J. (2017). Data qual-
ity considerations for big data and machine learning:
Going beyond data cleaning and transformations. In
International Journal on Advances in Software.
IBM (2023). Ibm data quality for ai api.
https://developer.ibm.com/apis/catalog/
dataquality4ai--data-quality-for-ai/Introduction.
Informatica (2023). What is data quality?
https://www.informatica.com/resources/articles/
what-is-data-quality.html.
InfoZoom (2023). Infozoom & izdq. https://www.
infozoom.com/en/products/infozoom-data-quality/.
Jouseau, R., Salva, S., and Samir, C. (2022). On study-
ing the effect of data quality on classification per-
formances. In 23rd International Conference on In-
telligent Data Engineering and Automated Learning
(IDEAL). Springer.
Jouseau, R., Salva, S., and Samir, C. (2023a). Ad-
ditional resources for the reproducibility of the
experiment. https://gitlab.com/roxane.jouseau/
measuring-data-quality-for-classification-tasks.
Jouseau, R., Salva, S., and Samir, C. (2023b). A novel met-
ric for measuring data quality in classification appli-
cations (extended version). https://arxiv.org/abs/2312.
08066.
Markelle Kelly, Rachel Longjohn, K. N. (1999). The uci
machine learning repository. https://archive.ics.uci.
edu.
OpenRefine (2023). Openrefine. https://github.com/
OpenRefine/OpenRefine.
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V.,
Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P.,
Weiss, R., Dubourg, V., Vanderplas, J., Passos, A.,
Cournapeau, D., Brucher, M., Perrot, M., and Duch-
esnay, E. (2011). Scikit-learn: Machine learning in
python. In Journal of Machine Learning Research.
Pipino, L. L., Lee, Y. W., and Wang, R. Y. (2002). Data
quality assessment. In Communications of the ACM.
Rolland, A. (2023). Mobydq. https://ubisoft.github.io/
mobydq.
SAS (2023). Dataflux data management studio 2.7:
User guide. http://support.sas.com/documentation/
onlinedoc/dfdmstudio/2.7/dmpdmsug/dfUnity.html.
Talend (2023). Talend open studio for data quality
– user guide 7.0.1m2. http://download-mirror1.
talend.com/top/user-guide-download/V552/
TalendOpenStudio DQ UG 5.5.2 EN.pdf.
ICAART 2024 - 16th International Conference on Agents and Artificial Intelligence
148