2014). In terms of business to make arrangements for
data that is very complicated, because each company
has different business rules, to overcome this, analy-
sis and modeling must be done regarding the rules in
the data input process to improve data quality (Boselli
et al., 2014).
Efforts related to data quality include one of the
Data Quality Management (DQM) functions avail-
able at data governance framework. Therefore, this
paper will discuss data cleaning techniques and tech-
niques as one of the DQM processes. The algorithm
used is published in the grouping of text and pattern
data based on the Business Rules given by one of the
Indonesian Government Bodies. The tool used in this
study is an open source tool, namely Pentaho Data
Integration (PDI). PDI has provided components that
can support the process of fulfilling data filling.
2 LITERATURE REVIEW
2.1 Cleansing Method
According to research conducted by Zuhair et al.
(Khayyat et al., 2015), BigDansing is a Big Data
Cleaning system to overcome problems of efficiency,
scalability, and ease in data irrigation. The BigDans-
ing method itself adopts the Machine Learning con-
cept where data is needed with the help of Spark.
The research conducted by Anandary (Riezka, 2011)
discusses the process of completing data carried out
using the Multi-Pass Neighborhood method with the
main focus’ to look for data duplication. The research
conducted by Weije et al. (Wei et al., 2007) method of
managing data is done according to rules that adjust
to business rules. In addition, in the study of Kol-
layut et al. (Kaewbuadee et al., 2003) developed a
drilling machine using the FD discovery feature with
data collection techniques and used elements in query
optimization called ”Selective Values” to increase the
number of FDs (Flexible Data levels) found.
2.2 Data Cleansing Algorithm
The algorithm is a sequence of steps used to find
a solution to a systematic and logistical problem
(Sitorus, 2015). In data cleaning, supporting de-
vices are needed to facilitate data to be faster and
more efficient. Some of the studies conducted by
some people found several algorithms for data clean-
ing, according to a study conducted by Saleh et al.
(Alenazi and Ahmad, 2017) to support duplication
with five algorithms developed by DYSNI (Dynamic
Sorted Neighborhood), PSNM (Progressive Sorted
Neighborhood Neighborhood Methods), Dedup, In-
nWin (Windows Innovative) and DCS ++ (Duplicate
Count Strategy ++). Two benchmark datasets are
used for experiments, namely Restaurant and Cora.
DYNSI uses data sets that provide high accuracy with
approved amounts. In conducting an analysis of the
interrelationship between data in columns, tables and
databases can be done by clustering techniques. One
of the clustering techniques used in this study is the
text cluster method. Text clustering refers to the op-
eration of the findings of different value groups that
might be an alternative to the repression of the same
thing . In the clustering method, the more similarities
and differences found in the data group the better the
data produced. The clustering algorithm according to
Stephens (Stephens, 2018) can be done with the fin-
gerprint method according to Profiling rules, includ-
ing the following processes:
• Normalization of Symbols removes character
symbols to make reading the pattern String read.
• Space Normalization removes the character of the
space, converts the string to uppercase (Upper-
case). Because string attributes are the least im-
portant attributes in terms of differentiation this
meaning changes to the most varied parts of the
string and deleting them has a big advantage.
• Normalization of characters containing ASCII
characters so as not to lead to small errors related
to fingerprint detection.
3 METHOD
The method used to build algorithms for processing
data cleansing and implement an open source tool
can be seen in Figure 1. The research method is
divided into four stages, namely profiling data, de-
termining data cleansing algorithms, mapping algo-
rithms to components in the PDI, and finally evaluat-
ing. The first stage is profiling data. Data profiling
is done to identify the object of data problems that
are in focus. The second stage is to determine the
data cleansing algorithm in accordance with the busi-
ness rules owned by the company. The third step is
to map the processing algorithm to components in the
PDI and implement the processing algorithm that has
been made into the PDI component according to the
needs of the business rule. And the last stage is eval-
uation and trial using a case study with Pentaho Data
Integration (PDI) with the OpenRefine application.
The first step is to carry out profiling data func-
tions to identify the problem data objects that are in
focus and to group the data streams divided into sev-
Data Cleansing with PDI for Improving Data Quality
257