et al., 2015). In our work, the user is involved to clean
only dirty data, without any automatically generated
repairs. The user intervention in this case, guarantees
the consistency of data as it resolves all the remaining
violations with no proposed possible fixes.
AI techniques have been widely used for data clean-
ing (Rekatsinas et al., 2017), (Konda et al., 2016),
(Yakout et al., 2011), (Krishnan et al., 2017),
(Volkovs et al., 2014). Some of them, use machine
learning to generate automatic repairs (Yakout et al.,
2011), (Mayfield et al., 2010) and/or leverage users
feedback using active/reinforcement learning (Yakout
et al., 2011), (Berti-Equille, 2019), (Gokhale et al.,
2014).Others learn from probabilities extracted from
data to predict repairs (Rekatsinas et al., 2017), (Yak-
out et al., 2011). In our work we exploit AI techniques
by formulating our problem as a CSP. We leverage
possible repairs returned automatically by a repair al-
gorithm to choose values that guarantee data consis-
tency.
7 CONCLUSION
In this work, we proposed a new data cleaning solu-
tion which makes use of the strength of CSP formu-
lation to ensure data consistency and accuracy in a
fully automatic way, while also allowing human in-
tervention when necessary. For high quality repairs,
we used QDflows as it leverages knowledge bases to
perform automatic repairs when possible, or generate
possible repairs otherwise. To ensure the consistency
in this step, we enable user intervention to manually
repair violations with no possible fixes. We also allow
multiple cleaning iterations to repair new eventual vi-
olations. In order to handle ambiguous repair cases,
we annotate the involved cells and collect their possi-
ble repairs generated previously, a CSP solving algo-
rithm is then used for a holistic repair. For optimizing
the repair search, we propose a new variable selection
technique that allows us to reach the solution quickly
and avoid dead ends. Our experiments show promis-
ing results in improving repair accuracy and data con-
sistency, achieving a F1 score higher than 99% while
minimizing human efforts (less than 0.1% for 10% er-
ror rate). They also show that our optimizations and
the proposed variable ordering technique improve the
efficiency of the backtracking search by more than
99.9% and allow it to repair data in a linear time.
Future works may focus on: proposing a new data
cleaning approach that provides automatic and con-
sistent repairs before using the CSPBasedRepair, han-
dling larger datasets by using a Big Data processing
tool, and automatically discover quality rules from
dirty data when they are not available.
REFERENCES
Abdellaoui, S., Nader, F., and Chalal, R. (2017). QDflows:
A System Driven by Knowledge Bases for Designing
Quality-Aware Data flows. Journal of Data and Infor-
mation Quality, 8(3-4):1–39.
Berti-Equille, L. (2019). Reinforcement Learning for Data
Preparation with Active Reward Learning.
Bohannon, P., Fan, W., Rastogi, R., and Flaster, M. (2005).
A Cost-Based Model and Effective Heuristic for Re-
pairing Constraints by Value Modification.
Chiang, F. and Miller, R. J. (2011). A Unified Model for
Data and Constraint Repair.
Chu, X., Morcos, J., Ilyas, I. F., Ouzzani, M., Papotti,
P., Tang, N., and Ye, Y. (2015). KATARA: A Data
Cleaning System Powered by Knowledge Bases and
Crowdsourcing. In Proceedings of the 2015 ACM
SIGMOD International Conference on Management
of Data, pages 1247–1261, Melbourne Victoria Aus-
tralia. ACM.
Cong, G., Fan, W., Geerts, F., Jia, X., and Ma, S. (2007).
Improving Data Quality: Consistency and Accuracy.
Dallachiesa, M., Ebaid, A., Eldawy, A., Elmagarmid,
A., Ilyas, I. F., Ouzzani, M., and Tang, N. (2013).
NADEEF: a commodity data cleaning system. In
Proceedings of the 2013 international conference on
Management of data - SIGMOD ’13, page 541, New
York, New York, USA. ACM Press.
Fan, W., Li, J., Ma, S., Tang, N., and Yu, W. (2012). To-
wards certain fixes with editing rules and master data.
The VLDB Journal, 21(2):213–238.
Geerts, F., Mecca, G., Papotti, P., and Santoro, D. (2013).
The LLUNATIC data-cleaning framework. Proceed-
ings of the VLDB Endowment, 6(9):625–636.
Gokhale, C., Das, S., Doan, A., Naughton, J. F., Rampalli,
N., Shavlik, J., and Zhu, X. (2014). Corleone: hands-
off crowdsourcing for entity matching. In Proceedings
of the 2014 ACM SIGMOD International Conference
on Management of Data, SIGMOD ’14, pages 601–
612, New York, NY, USA. Association for Computing
Machinery.
Ilyas, I. F. and Chu, X. (2015). Trends in Cleaning Rela-
tional Data: Consistency and Deduplication. Founda-
tions and Trends® in Databases, 5(4):281–393.
Khayyat, Z., Ilyas, I. F., Jindal, A., Madden, S., Ouz-
zani, M., Papotti, P., Quian
´
e-Ruiz, J.-A., Tang, N.,
and Yin, S. (2015). BigDansing: A System for Big
Data Cleansing. In Proceedings of the 2015 ACM
SIGMOD International Conference on Management
of Data, pages 1215–1230, Melbourne Victoria Aus-
tralia. ACM.
Konda, P., Das, S., Ardalan, A., Ballard, J. R., Li, H.,
Panahi, F., Zhang, H., Naughton, J., Prasad, S., Krish-
nan, G., Deep, R., and Raghavendra, V. (2016). Mag-
ellan: Toward Building Entity Matching Management
Systems.
CSP-DC: Data Cleaning via Constraint Satisfaction Problem Solving
487