First, we observe that data are noisy, erroneous,
or missing. For example, jargons, misspelled words,
and incorrect grammars pose significant technical
challenges for linguistic analysis. Moreover, data
captured by mobile and wearable devices and
sensors can be noisy.
Secondly, we point out that, as data grow
exponentially, it becomes increasingly difficult for
companies to ensure that their sources of data and
information therein are trustworthy.
Veracity of Big Data, which is an issue of data
validity, is a bigger challenge than volume, velocity,
and variety in BDA. It is estimated that
approximately 20–25% of online consumer reviews
texts are fake (Qiao, 2017; Hayek, 2020). In our
Project, data validity, more than an issue of veracity,
is a parameter depending on the data quality at the
input point.
Given these premises, data cleaning, filtering,
and selection techniques able to detect and remove
noise and abnormality from data automatically
become essential.
In any knowledge extraction process, the value
of extracted knowledge is related to the quality of
the used data. Big Data problems, generated by
massive growth in the scale of data observed in
recent years, also follow the same issue. A common
problem affecting data, and Big Data, quality is the
presence of noise, particularly in classification
problems, where label noise refers to the incorrect
labeling of training instances, and is known to be a
very disruptive feature of data (Garcia-Gil, 2019). In
our approach, data are labelled initially by human
operators, and gradually passed on as a training set
to the supervised ML algorithms. This procedure
avoids having many partially or even unlabelled
data.
Our research focuses on the development of
uniform data quality standards and metrics for BDA
that address some of the various data quality
dimensions (e.g., accuracy, accessibility, credibility,
consistency, completeness, integrity, auditability,
interpretability, and timeliness). In particular, our
research focuses on the set up of a panel of
indicators for the analysis of performance and
quality aspects of the BDA process.
To this aim, the most significant techniques and
the technologies considered in SIBDA belong to the
three following areas.
1. Data ingestion. Among the scenarios that
characterize the stage of acquisition and initial
processing of Big Data, one of the most relevant
concerns regards data coming from IoT, with the
enabling middleware and event processing
techniques that support an effective integration
(Marjani, 2017). For text analytics, the research
questions regard ECM, namely how advanced
content management applications are characterized
by the growing significance of information
extraction in the enterprise environment, and how
the diffusion of Big Data storage tools applies to
document-oriented databases.
2. Data storage. For BDA, we identify two
research questions: i) how to store large volumes of
data, including the whole documents; ii) how to
archive unstructured and variable data in a way that
makes them understandable automatically via a ML
approach. The solution lies in adopting NoSQL, or
document-oriented databases.
Another issue related to storage regards how to
limit the complexity and costs of the hardware
infrastructure when scaling up volumes.
Different storage models have been proposed in
the presence of large volumes and yet adequate
performance, particularly in response times. These
new types of databases include NoSQL databases
and NewSQL databases (Meier, 2019). An
interesting model is proposed for document-oriented
databases. Values consisting of semi-structured data
are represented in a standard format (e.g., XML,
JSON - JavaScript Object Notation - or BSON
(Binary JSON), and are organized as attributes or
name-value pairs, where one column can contain
hundreds of attributes. The most common examples
are CouchDB (JSON) and MongoDB (BSON). This
way of organizing information is suitable for
managing textual content.
3. Data analysis. Analytics applications are the
core of the Big Data phenomenon, as they are in
charge of extracting significant values and
knowledge from data acquired and archived with the
techniques above. This result can be achieved either
through Business Intelligence techniques, or through
more exploratory techniques, defined as Advanced
Analytics (Chawda, 2016; Elragal, 2019). The
generated knowledge has to be available to users and
shared effectively among all the actors of the
business processes that can benefit from it. In this
area, we mention the techniques of Big Data
Intelligence, Advanced Analytics, Content
Analytics, Enterprise Search (or Information
Discovery) (Hariri, 2019).
An overall view of the above themes is in Figure
1, where we show the component schema adopted
for our company’s model.