Metadata is one aspect that is common to all forms
of data sources. Traditional, relational, graph,
columnar etc. All data sources do have a common
representation and idea regarding metadata. The most
crude yet accurate definition of metadata is “data
about data”. Another representation of metadata can
be regarded as information that elaborates the content
inside a data source or datasets. Metadata has its
many representations and forms. Some of the most
common metadata types include descriptive,
structural and administrative. Each type serves its
own purpose and complexity. Descriptive metadata
for instance deals with information that identifies an
entity or objects or documents e.g. titles, keywords
used, abstract of documents, summaries and authors
or creators of a project for instance. The metadata is
a powerful set of data that can be enhanced, enriched
and exploited to develop reasonable insights into the
data itself. Metadata provides an opportunity to
resolve the data integration problem to a certain
extent if properly cultured and deployed over the data
sources. The metadata can be utilized to draw data
based context summaries before the actual integration
of sources. The only hurdle is the availability of good
quality metadata. A good structured metadata can
provide more useful insights into the data and can also
look into the problems of conflict resolution based on
the learning curve of metadata repositories.
The metadata has many forms and complexities
associated with it. For the scope of this research the
metadata type has been restricted to “textual” content.
Thus, all metadata analysed, acquired and created is
text based. No formal numerical representations have
been considered for the initial research phase. The
second most important concern regarding metadata
management is the use of techniques, formulas or
algorithms that can devise a learning and
classification path for the metadata under
consideration. Eurostat metadata is a very detailed
and rich in contextual information regarding various
European statistics. This metadata provides the
opportunity to analyse and locate similar information
between different sources for integration, processing,
analysis etc.
Text processing is the primary research aspect for
analysing textual metadata. The classification of
keywords, meaning of different headers, titles,
sources, date of publishing etc. all represent valuable
and useful information. The text can be analysed
using various algorithms and can be designated under
different categories using predictive and classifying
algorithms.
In this paper we propose a fuzzy based approach
in developing metadata integration strategies along
with Naïve Bayes representation for text analysis and
classification. The new process differs in two major
aspects as compared to previous approaches. First, the
unstructured metadata documents can also be
exploited using the fuzzy logic representation i.e.
various classes have been designed for unstructured
information to fall into. It defines a more open context
for metadata to associate with. For example, a simple
table summary of 20 lines without any title or any
other description can be classified based upon “open
fuzzy classes”. The text in the description is analysed
and a probabilistic match based on keywords
extracted is sent through the framework pipeline for
“possible matches” and “possible classes”. Secondly,
besides the set of keywords a set of synonyms is
derived from dictionary and kept in a repository. This
repository can then be accessed to find similar
meaning words in other summaries of data sources.
The repository also maintains a key-link or historical
information. This is regarded as the learning curve for
future work. For example. Each iteration of
classification access the words, possible keywords,
possible matches and repository synonyms. Once a
match is successfully classified the original word
“OW”, the associated keyword “OKW” and the
derived or accessed synonym “DSA” is kept as a link
in the historical table. This table will be used as a
feedback generator in the next phase of the research
i.e. concerned with the learning curve of metadata
repositories.
In summary the contributions of the research
paper includes the following:
We propose and formulate a new methodology for
utilizing metadata for more accurate or enhanced
data integration.
We have deployed Fuzzy logic and Naïve Bayes
as ground algorithms for proof of concept and
algorithm efficiency monitoring.
We conducted our experiments by utilizing the
standardized and relevant Eurostat data and
metadata. This provides the ground work for
analytical analysis and validated experimentation
The remaining of the paper is organized as follows.
The problem definition is formulated in section 2.1.
In section 2.2 an overview of the designed Fuzzy
metadata framework is provided with the details of
input, processes and outputs obtained. The study
variables and framework components are also
discussed in this section. Section 2.3 discusses the
methodology proposed from start till the end of
framework and problem execution. Section 3
discusses the concepts are literature studied and
utilized in the construction of this research. It reviews
the existing work conducted in this field. Section 4
DATA 2018 - 7th International Conference on Data Science, Technology and Applications
84