Data Cleansing with PDI for Improving Data Quality

Siti Aulia Noor

, Tien Fabrianti Kusumasari

and Muhammad Azani Hasibuan

Information System Department School of Industrial and System Engineering, Telkom University Bandung, Indonesia

Keywords:

data cleansing, open source, data quality, algorithm, pattern.

Abstract:

Technological developments that will quickly produce diverse data or information can improve the decision-

making process. This causes the organization to require quality data so that it can be used as a basis for decision

making that can truly be trusted. Data quality is an important supporting factor for processing data to produce

valid information that can be beneﬁcial to the company. Therefore, in this paper we will discuss data cleaning

to improve data quality by using open source tools. As an open source tool used in this paper is Pentaho

Data Integration (PDI). The cleaning data collection method in this paper includes data proﬁles, determine

the processing algorithm for data cleansing, mapping algorithms of data collection to components in the PDI,

and ﬁnally evaluating. Evaluation is done by comparing the results of research with existing data cleaning

tools (OpenReﬁne and Talend). The results of the implementation of data cleansing show the character of data

settings that form for Drug Circular Permit numbers with an accuracy of 0.0614. The advantage of the results

of this study is that the data sources used can consist of databases with various considerations.

1 INTRODUCTION

Data is an important component that contains infor-

mation that can affect an organization in the pro-

cess of making decisions. Basically, data can explain

facts and numbers managed by a company every day

(MacDonald, 2011). The more advanced the devel-

opment of information technology, the more formats

and forms of data will be generated. Thus to maintain

data quality, a process is needed to analyze data and

manage data to create a knowledge/knowledge that

can be useful in increasing business value in a com-

pany (MacDonald, 2011). The speed and accuracy

of data processing will have a signiﬁcant impact on

the company / organization. In 2015, the California

State Auditor found that there was a gap of 800 hours

in the data entry process for about one day of report-

ing, which caused a loss of $ 6 million (Barrett and

Greene, 2015). And based on the research conducted

by Experian, the most signiﬁcant factor of the decline

in the level of data quality is caused by human error

or human error (Schutz, 2013).

Data quality is an important supporting factor for

processing data to produce valid information that can

be beneﬁcial to the company; however, in support of

many organizations or companies that do not have

good data quality. Therefore, to ensure that quality

can be adequately controlled can be done with the

data cleaning process.

K Hirna Prasad et al. (Prasad et al., 2011) con-

ducted a study of data management at enterprise-

scale companies by applying the Ripple Down rule

to maintain data quality rules that help save services

to add new rules. The development method is carried

out by holding RDR (Dow Ripple Rules) and clas-

sifying data based on the rules that have been made

(Prasad et al., 2011). Research conducted by Alﬁ et

al. (Khoirunisa, 2011) states that Pentaho Data In-

tegration is used as a data cleaning tool to improve

data quality. The data cleaning algorithm used in this

study consists of algorithms derived from User In-

put and Automatic Logic Users. In this study, the

business rules applied to the ﬁrst logic are applied to

the user’s choice. The logic is stored in the database

with the status ”selected” while the logistics for both

data processes are carried out automatically using

clusters without any repetition of data (duplication)

while this logistics changes the ﬁnal status to ”True”

(Khoirunisa, 2011).

Based on research conducted by several papers,

data cleaning is a solution that can be used to over-

come data problems that occur in government insti-

tutions and company-scale companies. Data cleaning

is needed to transfer the required data that supports

activities in speciﬁc domains that are useful for clean-

ing up incompatibilities in the database (Boselli et al.,

256

Noor, S., Kusumasari, T. and Hasibuan, M.

Data Cleansing with PDI for Improving Data Quality.

DOI: 10.5220/0009868102560261

In Proceedings of the International Conference on Creative Economics, Tourism and Information Management (ICCETIM 2019) - Creativity and Innovation Developments for Global

Competitiveness and Sustainability, pages 256-261

ISBN: 978-989-758-451-0

2014). In terms of business to make arrangements for

data that is very complicated, because each company

has different business rules, to overcome this, analy-

sis and modeling must be done regarding the rules in

the data input process to improve data quality (Boselli

et al., 2014).

Efforts related to data quality include one of the

Data Quality Management (DQM) functions avail-

able at data governance framework. Therefore, this

paper will discuss data cleaning techniques and tech-

niques as one of the DQM processes. The algorithm

used is published in the grouping of text and pattern

data based on the Business Rules given by one of the

Indonesian Government Bodies. The tool used in this

study is an open source tool, namely Pentaho Data

Integration (PDI). PDI has provided components that

can support the process of fulﬁlling data ﬁlling.

2 LITERATURE REVIEW

2.1 Cleansing Method

According to research conducted by Zuhair et al.

(Khayyat et al., 2015), BigDansing is a Big Data

Cleaning system to overcome problems of efﬁciency,

scalability, and ease in data irrigation. The BigDans-

ing method itself adopts the Machine Learning con-

cept where data is needed with the help of Spark.

The research conducted by Anandary (Riezka, 2011)

discusses the process of completing data carried out

using the Multi-Pass Neighborhood method with the

main focus’ to look for data duplication. The research

conducted by Weije et al. (Wei et al., 2007) method of

managing data is done according to rules that adjust

to business rules. In addition, in the study of Kol-

layut et al. (Kaewbuadee et al., 2003) developed a

drilling machine using the FD discovery feature with

data collection techniques and used elements in query

optimization called ”Selective Values” to increase the

number of FDs (Flexible Data levels) found.

2.2 Data Cleansing Algorithm

The algorithm is a sequence of steps used to ﬁnd

a solution to a systematic and logistical problem

(Sitorus, 2015). In data cleaning, supporting de-

vices are needed to facilitate data to be faster and

more efﬁcient. Some of the studies conducted by

some people found several algorithms for data clean-

ing, according to a study conducted by Saleh et al.

(Alenazi and Ahmad, 2017) to support duplication

with ﬁve algorithms developed by DYSNI (Dynamic

Sorted Neighborhood), PSNM (Progressive Sorted

Neighborhood Neighborhood Methods), Dedup, In-

nWin (Windows Innovative) and DCS ++ (Duplicate

Count Strategy ++). Two benchmark datasets are

used for experiments, namely Restaurant and Cora.

DYNSI uses data sets that provide high accuracy with

approved amounts. In conducting an analysis of the

interrelationship between data in columns, tables and

databases can be done by clustering techniques. One

of the clustering techniques used in this study is the

text cluster method. Text clustering refers to the op-

eration of the ﬁndings of different value groups that

might be an alternative to the repression of the same

thing . In the clustering method, the more similarities

and differences found in the data group the better the

data produced. The clustering algorithm according to

Stephens (Stephens, 2018) can be done with the ﬁn-

gerprint method according to Proﬁling rules, includ-

ing the following processes:

• Normalization of Symbols removes character

symbols to make reading the pattern String read.

• Space Normalization removes the character of the

space, converts the string to uppercase (Upper-

case). Because string attributes are the least im-

portant attributes in terms of differentiation this

meaning changes to the most varied parts of the

string and deleting them has a big advantage.

• Normalization of characters containing ASCII

characters so as not to lead to small errors related

to ﬁngerprint detection.

3 METHOD

The method used to build algorithms for processing

data cleansing and implement an open source tool

can be seen in Figure 1. The research method is

divided into four stages, namely proﬁling data, de-

termining data cleansing algorithms, mapping algo-

rithms to components in the PDI, and ﬁnally evaluat-

ing. The ﬁrst stage is proﬁling data. Data proﬁling

is done to identify the object of data problems that

are in focus. The second stage is to determine the

data cleansing algorithm in accordance with the busi-

ness rules owned by the company. The third step is

to map the processing algorithm to components in the

PDI and implement the processing algorithm that has

been made into the PDI component according to the

needs of the business rule. And the last stage is eval-

uation and trial using a case study with Pentaho Data

Integration (PDI) with the OpenReﬁne application.

The ﬁrst step is to carry out proﬁling data func-

tions to identify the problem data objects that are in

focus and to group the data streams divided into sev-

Data Cleansing with PDI for Improving Data Quality

257

eral clusters based on the patterns found in the data

format.

The second stage is analyzing and determining the

right processing algorithm to be used in data cleans-

ing in accordance with the business rules of the com-

pany. This stage focuses on analyzing the data ﬂow

that is running and determining the algorithm that

matches the pattern found previously in proﬁling pro-

cesses.

The third step is to map the predeﬁned algorithm

for processing data cleansing into PDI. Mapping is

done to implement the algorithm that has been made

into the PDI component and conﬁgure and select PDI

components that are in accordance with the business

rule requirements for data cleansing.

The last step is to test the processing algorithm

using data from the company, which is a case study

one of the Central Government Agencies in Indone-

sia. OpenReﬁne. This evaluation is useful to see how

accurately the results of processing data cleansing are

carried out by Pentaho Data Integration (PDI).

Figure 1: Flow of Implementation with Pentaho Data Inte-

gration.

In this study, a case study from one of the Govern-

ment Agencies in Indonesia will be used. The case

study used is data from the Drug Distribution License

Number that is processed based on the business rule

provisions where NIE has a unique writing format and

format. To facilitate checking NIE data, NIE data

mapping is done using Data Pattern. The Data Pattern

Mapping was obtained from previous research con-

ducted by Sandy (Amethyst et al., 2018) by changing

the original NIE form into Pattern Clearly state the

units for each quantity in an equation.

4 ALGORITHM FOR DATA

CLEANSING

The data cleansing algorithm processing proposed to

address the problems in the case study can be seen in

Figure 2. The purpose of this algorithm is to group

data into several clusters and ﬁnd the right data pat-

tern, which will then be carried out data cleansing

process to break the pattern into valid data patterns in

accordance with the provisions of the business rule.

The following is the ﬂow of the data cleansing pro-

cess.

Figure 2: Data Cleansing Processing Flow.

Figure 2 Shows the processing pattern that will be

implemented into the PDI. The processing is started

by inputting the data to be processed, then after the

data is processed, Copy the data in the Column Data

Pattern and create a new column with the name PAT-

TERN BARU and ﬁll the value in the column with

the value in column Pattern NIE All. The second pro-

cess aims to eliminate all punctuation marks and sym-

bols contained in the data. The third process aims to

separate data with spaces and change character writ-

ing. The fourth process is doing faceting by sorting

columns with a speciﬁc value in ascending then the

results of the sorting will be grouped based on the

pattern and will then be calculated to ﬁnd out the to-

tal data in each pattern. The ﬁfth process is to de-

termine the results of the pattern based on faceting

which is done by using regular expressions as a rule

that is applied based on the business rules of the body,

checks on business rules for data patterns aiming to

break down data based on the patterns found. The

sixth process is to map and ﬁlter patterns whether or

not it is a business rules, if it is appropriate then in the

seventh process the data on the proper column will be

saved directly into the database and if it is not suitable

then the eighth process will be checked again and the

data will be cleaned again based on business rules by

ﬁnding another pattern by using regular expressions

to compare patterns and as a reference to the applied

business rules. The process of symmetry is ﬁltering

patterns based on the results of comparison from the

previous stage if the pattern is in accordance with the

rules of regular expressions. At this stage, the pat-

tern found will be analyzed and can be replaced by

the data pattern in the NIE column if the pattern does

not meet the conditions speciﬁed in the business rule.

Data pattern analysis previously performed by the

Sandy produces a data pattern with AA format for

alphabet (commodity code on NIE), and 9 for num-

ber (code number on NIE) (Amethyst, 2018). In this

study, the algorithms produced were in accordance

ICCETIM 2019 - International Conference on Creative Economics, Tourism Information Management

258

with business rules with the concept of clustering for

data cleansing with Pentaho Data Integration (PDI).

Mapping the data cleansing algorithm in accordance

with case studies with components in the PDI can be

seen in Figure 3.

Figure 3: Mapping Cleansing Algorithm into PDI Compo-

nent.

Figure 3. shows the components that will be used

to map algorithms to the PDI. Components used in the

implementation of the algorithm on PDI have func-

tions and uses that can support the cleansing process

that is used between, namely, calculator, replace in

string, string operation, sort rows, group by, regex

evaluation, rows ﬁlter and insert/update. Data cleans-

ing mapping algorithm with pattern data uses PDI

to facilitate data search patterns to make it faster

and more efﬁcient. The logic implementation of the

Cleansing processing algorithm that conﬁgures using

the Pentaho Data Integration component can be seen

in Figure 4.

Figure 4: Flow of Implementation Cleansing Algorithm.

Based on the implementation of the processing al-

gorithm for data cleansing at PDI that has been car-

ried out. A description of the conﬁguration and func-

tion of each component in the PDI can be seen in Ta-

ble 2. Table Input component is used to retrieve data

from a table stored in the MySQL database. And the

data will be processed into the PDI which then goes

into the calculator component to deﬁne a new column

that will be ﬁlled with the value of the column in the

table that was taken. The next data pattern will be

cleared to remove special characters and punctuation

using Replace in String and String operations. Then

the data will be sorted ascending and grouping to ﬁnd

out the number of patterns of data found. The data

that has been grouped will be examined to determine

the suitability of the pattern with the business rule us-

ing regex where the desired pattern is [A-Za-z] 2 the

ﬁrst two characters are letters and followed by nine

characters 9

is true other than the pattern, the wrong

patterns will be re-checked and adjustments will be

made to the data pattern. Then a different pattern will

be checked to ﬁnd and change the data pattern format

according to the initial conditions. The pattern found

will be replaced with the front character into two let-

ters, the replacement format can be adjusted to the

drug data information.

Data Cleansing with PDI for Improving Data Quality

259

And the last process is the data that has been pro-

duced is done by cleansing and will be stored in the

database.

Figure 5: Conﬁguration PDI Component.

The implementation of the cleansing process is

done using data from the .sql database ﬁle, the results

of the cleansing process will be stored back into the

result table in the database.

5 EVALUATION AND

DISCUSSION

Data cleansing algorithm testing is carried out by us-

ing a case study on one of the Government Agencies

in Indonesia, namely data on License Number Circu-

lar on food. Then NIE is exported into the MySQL

database and processed through Pentaho Data Integra-

tion. NIE is the registration number of food products

or drugs to be legally distributed in Indonesia. A large

number of data formats in the irregular NIE has made

it difﬁcult to check NIE data. In the cleansing pro-

cess, business rules are applied where the data in the

column cannot be empty, there is no data redundancy

and the data format is incorrectly inputted. Therefore,

the NIE data must be cleansed in order to overcome

the misuse of NIE on products issued by certain com-

panies.

The ﬁrst stage of this research is to do an NIE data

pattern mapping, then the next step is to do clustering

based on the data pattern, and the last step is to clean

the data both in terms of improving the data format,

changes dan replace data using Pentaho Data Integra-

tion. The results of comparison that are done using

Pentaho Data Integration, OpenReﬁne and Talend ap-

plications can be seen in Figure 6.

Figure 6: The Results of Comparison of Data Cleansing

with Open Source Tools.

The results of data processing carried out through

the data pattern method on Pentaho, OpenReﬁne and

Talend Data Preparation there are several equations

ICCETIM 2019 - International Conference on Creative Economics, Tourism Information Management

260

including the number of patterns found in PDI and

OpenReﬁne applications almost as different from Tal-

end Data Preparation. Signiﬁcant differences can be

seen from the comparison of the number of valid pat-

tern lines found in Pentaho, which is 281 rows based

on the AA999999999 pattern. This happens because

the data generated by processing with Pentaho is data

that has been cleared by removing duplicates so that

it is different from the data found in other applica-

tions because the number is the amount of data that is

cleaned without removing duplicate data. So, it can

be concluded that all applications compared to suc-

cessfully carrying out the cleansing process but for

the level of accuracy and speed as well as complete

features are owned by PDI, with a total level of ac-

curacy of 0.0614. The results of processing will be

greatly improved if more test data are used.

6 CONCLUSION

Pentaho Data Integration is an open source applica-

tion for data management that can be used as a data

cleaning tool. PDI has a complete component for

processing ETL data and supports many formats for

input and output. PDI can also be integrated with

web-based applications to represent proﬁle data, data

cleaning, and monitoring data and testing the imple-

mentation of algorithms can be done with a greater

amount of data for more signiﬁcant results. The use

of PDI in this study is expected to overcome the prob-

lems that occur in the agency. In addition, PDI is an

open source tool that facilitates development and cost

savings.

ACKNOWLEDGEMENTS

This research is part of the study dealing with data

quality management tools funded by Telkom Uni-

versity’s research and community service directorate.

We thank Telkom and the Industrial Engineering Fac-

ulty for supporting our research. We also express

our gratitude to the entire team of research data qual-

ity management tools which are part of the Enter-

prise System Development, Telkom University exper-

tise group.

REFERENCES

Alenazi, S. R. and Ahmad, K. (2017). An efﬁcient algo-

rithm for data cleansing. Journal of Theoretical and

Applied Information Technology, 95(22):6183–6191.

Amethyst, S., Kusumasari, T., and Hasibuan, M. (2018).

Data pattern single column analysis for data proﬁling

using an open source platform. In IOP Conference

Series: Materials Science and Engineering, volume

453, page 012024. IOP Publishing.

Barrett, K. and Greene, R. (2015). The causes, costs and

consequences of bad government data. Governing, 24.

Boselli, R., Cesarini, M., Mercorio, F., and Mezzanzanica,

M. (2014). Planning meets data cleansing. In Twenty-

Fourth International Conference on Automated Plan-

ning and Scheduling.

Kaewbuadee, K., Temtanapat, Y., and Peachavanish, R.

(2003). Data cleaning using fd from data mining pro-

cess. In Proceedings of conference.

Khayyat, Z., Ilyas, I. F., Jindal, A., Madden, S., Ouzzani,

M., Papotti, P., Quian

e-Ruiz, J.-A., Tang, N., and Yin,

S. (2015). Bigdansing: A system for big data cleans-

ing. In Proceedings of the 2015 ACM SIGMOD Inter-

national Conference on Management of Data, pages

1215–1230.

Khoirunisa, A. N. (2011). Analysis and design of applica-

tion architecture data cleansing based on open source

in xyz organization.

MacDonald, L. (2011). The role of data in business. Con-

sulted in.

Prasad, K. H., Faruquie, T. A., Joshi, S., Chaturvedi, S.,

Subramaniam, L. V., and Mohania, M. (2011). Data

cleansing techniques for large enterprise datasets. In

2011 Annual SRII Global Conference, pages 135–144.

IEEE.

Riezka, A. (2011). Analisis dan implementasi data clean-

ing menggunakan metode multi-pass neighborhood

(mpn).

Schutz, T. (2013). The state of data quality.

Sitorus, L. (2015). Algoritma dan Pemrograman. Penerbit

Andi.

Stephens, O. (2018). Clustering in depth.

Wei, W., Zhang, M., Zhang, B., and Tang, X. (2007). A

data cleaning method based on association rules. In

International Conference on Intelligent Systems and

Knowledge Engineering 2007. Atlantis Press.

Data Cleansing with PDI for Improving Data Quality

261