to provide information for data cleaning. Analysing
the structure of the data involves checking their con-
sistency and format. Computing descriptive statistical
information like minima, maxima and percentages as
well as determining data types and lengths falls un-
der this category. Profiling the data content identi-
fies specific properties concerning missing values or
other errors. A crucial part of data profiling especially
with regard to subsequent feature engineering tasks in
machine learning projects is to discover how differ-
ent parts of the data are related to each other. Iden-
tifying embedded value dependencies and functional
dependencies between attributes or tables as well as
potential (foreign) keys are part of this data profiling
aspect. An excellent overview about the topic can be
found in (Abedjan et al., 2015). Data profiling is also
related to data exploration. A survey on data explo-
ration techniques can be found in (Guido et al., 2015;
Di Blas et al., 2014). Both commercial and open-
source software tools are available for data profiling
and cleaning. Often vendors of commercial ERP sys-
tems also have a corresponding data quality tools in
their portfolios. SAP sells its Information Steward,
Oracle its Enterprise Data Quality and IBM offers In-
foSphere Information Server. The company Informat-
ica specialized in business analytics sells Informatica
Data Profiling while SAS offers its DataFlux Manage-
ment Studio. Apart from their comprehensive scope,
the major advantage of these tools is their seamless
integration into the respective ERP systems of their
companies, which avoids migration and transforma-
tion problems. Their disadvantage is that they are
not optimized for the profiling of product data in gen-
eral. Beside these commercially available products a
plethora of open-source tools exists. A few mature
software packages are Metonome (Papenbrock et al.,
2015), developed and managed by the University of
Potsdam, Germany, Profiler (Kandel et al., 2012) and
Talend Open Studio. The drawbacks of free data pro-
filing software usually lie in their narrow focus on
specific applications especially in case of more sci-
entific origins. In contrast to these tools we propose a
data profiling tool and reference process specifically
taking into account domain knowledge about product
data.
3 PRODUCT DATA
3.1 General Description
Product data are of utmost importance in retail com-
panies. For technical and business reasons, products
are classified into categories arranged in a multi-level
merchandise category hierarchy. Moreover, there are
a variety of different product types like simple prod-
uct items, grouped products (e.g. displays, sets etc.)
and generic products and their variants. Product at-
tributes are referenced by many processes in purchas-
ing, material requirements planning, inventory man-
agement and sales. Table 1 shows some attributes of
product data in a standard ERP software. We can dis-
tinguish between descriptive and process-related at-
tributes.
The values of descriptive attributes are the same
for all locations (e.g. stores, warehouses) of a com-
pany whereas process attributes may be maintained
differently. For example, the safety stock may be 10
pieces in one store and 15 pieces in another. Process-
related attributes control the processes in a company
and are decisive for process automation. Usually the
product attributes provided in a standard ERP soft-
ware do not meet all requirements of a company.
Therefore, they are supplemented by custom-specific
attributes, which can be either descriptive or process-
related.
Mostly, product attributes are also classified in the
ERP software thematically. Typical classes are basic
data, logistic data, purchasing data etc. Basic data
include the product description, packaging units, di-
mensions, volumes, gross and net weights, hazardous
substance codes etc. Logistic data include safety
stock, service level, delivery time, goods receipt pro-
cessing time etc. Purchasing and sales data include
prices, minimum order quantities, delivery periods
etc.
The attributes that have to be maintained for a spe-
cific product item depend on its category and type.
Consequently, the product categories and types have
to be taken into account in the data profiling process.
Table 1: High-level classification of product attributes in a
ERP system.
descriptive process-related
standard
- product description
- category
- volume
- unit of measure
...
- reorder point
- safety stock
- goods receipt
processing time
...
custom-specific
- alcoholic strength
- description of
ingredients
...
- age limit for
sales
- flags for logistic
processes
...
DATA 2019 - 8th International Conference on Data Science, Technology and Applications
318