Quality Evaluation for Documental Big Data

Mariagrazia Fugini and Jacopo Finocchi

Dipartimento di Elettronica Informazione e Bioingegneria, Politecnico di Milano, Piazza L. Da Vinci 32, Milano, Italy

Keywords: Text Analytics, Big Data Analytics, Enterprise Content Management, Document Management, Machine

Learning for Document Processing.

Abstract: This paper presents the analysis of quality regarding a textual Big Data Analytics approach developed

within a Project dealing with a platform for Big Data shared among three companies. In particular, the paper

focuses on documental Big Data. In the context of the Project, the work presented here deals with extraction

of knowledge from document and process data in a Big Data environment, and focuses on the quality of

processed data. Performance indexes, like correctness, precision, and efficiency parameters are used to

evaluate the quality of the extraction and classification process. The novelty of the approach is that no

document types are predefined but rather, after manual processing of new types, datasets are continuously

set up as training sets to be processed by a Machine Learning step that learns the new documents types. The

paper presents the document management architecture and discusses the main results.

1 INTRODUCTION

In recent years, Big Data Analytics (BDA) has

spread rapidly, proposing tools, techniques and

technologies that allow extracting enterprise

knowledge useful to companies and society (Singh,

2019; Wang, 2018). In particular, we are witnessing

a growing relevance of unstructured information for

Enterprise Architectures, compared to structured

data stored in traditional database systems. The

spread of Big Data has changed the type of data

collected and processed by information systems and

is made more and more suitable to unstructured,

large data, such as images, streaming data, enterprise

documents, and so on.

This paper describes a solution for document

processing automation based on Enterprise Content

Management (ECM) in the context of Big Data. The

focus of the paper is on the definition of quality

indexes enabling to evaluate the performance,

precision, effectiveness, and other quality features of

BDA for document management. It presents an

experimental BDA approach in a real setting, given

by the studies performed in the SIBDA – Sistema

Innovativo di Big Data Analytics - Project

Big Data Analytics Project - SIBDA, funded by Regione

Lombardia (201-2018) Progetto Competitività Cremona,

concerning an innovative solution to Big Data

collection, analysis and management in a

cooperative shared enterprise environment. This

Project develops a solution shared among three

companies, structured in a temporary agreement for

competitiveness.

SIBDA integrates the activities and assets of the

three involved companies, namely:

 MailUp, leader of the project, one of the main

Italian operators in the field of Email Service

Providers; it developed the BD storage

infrastructure and provided the general-

purpose BDA tools.

 Microdata Service, a company specialized in

outsourcing services for management of

document processes; it developed automatic

document analysis tools to extract information

from heterogeneous and unstructured data

sources.

 Linea Com, operating in IT and

telecommunications services in the area of

fiber-optic and wireless networks for various

municipalities of Regione Lombardia; it

“Smart Cities”, see IEEE WETICE Best paper IEEE WETICE

2018 - W2T Conf. (Paris, June 27-29, 2018) Available at:

https://scholar.google.com/scholar?as_ylo=2016&q=fugini+wet

ice+2018&hl=en&as_sdt=1,5

132

Fugini, M. and Finocchi, J.

Quality Evaluation for Documental Big Data.

DOI: 10.5220/0009394301320139

In Proceedings of the 22nd International Conference on Enterprise Information Systems (ICEIS 2020) - Volume 1, pages 132-139

ISBN: 978-989-758-423-7

created a platform for the acquisition of sensor

data.

In the area of BDA, SIBDA tackled Enterprise

ECM (Shivakumar, 2016; Vlad, 2019). ECM for

documents is the subject of this paper, that we

studied in cooperation with Microdata Service.

As an evolution of traditional document

management applications, ECM moves towards

advanced content management applications,

characterized by the growing significance of

semantics in the enterprise environment and by the

diffusion of Big Data tools. In such scope, a relevant

trend in the analysis of unstructured data is the

application of Natural Language Processing (NLP)

techniques - frequently based on Machine Learning

(ML) - to draw semantic value from textual data and

use the semantics to enhance search functionalities

or to develop new value-added services. The studied

ECM is inspired to a context-aware computing

model: document semantics is extracted based on a

knowledge of the process where they are immersed

(Fugini, 2019).

In this paper, the focus is on incorporating data

quality parameters and process quality features,

such as, efficiency, effectiveness, accuracy, etc., in

management processes involving Big Data

Documents. The introduction of these parameters

can provide the capability of evaluating service

performance and textual data quality (Batini, 2016;

Kiefer, 2019).

The paper is organized as follows. Section 2

reviews the SIBDA Project and the approach to

Quality Indexes. Section 3 focuses on indicators of

process quality and data quality. Section 4 describes

the system architecture and technologies,

commenting basic results. Section 5 concludes the

paper.

2 PROJECT AIMS AND

QUALITY INDEXES

The aim of the SIBDA Project is to demonstrate the

potential of Big Data tools and methods for “Smart

Cities” using data and sensor technologies, in

combination with BDA, to deliver services in an

efficient way. In particular, due to the growing

relevance of unstructured information for enterprise

business processes, Big Data are growing in mass

and relevance for Smart Cities development and

management. The overall research aim is the

creation of platform models for the collection and

management of Open Data (Smart Platform), which

should be replicable, so leading to standard solutions

suitable for urban contexts of medium/small size. In

(Fugini, 2018; Fugini, 2019) we give details about

the overall Project. The portion of Document Big

Data illustrated in this paper is concerned with

private data owned by Microdata regarding

customers.

In particular, the goals of SIBDA dealing with

documents are twofold: industrial and scientific.

Industrial aims can be summarized as follows:

1. to reduce the human work involved in

document management with potentially no

loss of information nor of correctness during

the data extraction process;

2. to increase the volume of processed documents

per time unit, while respecting aim 1 above;

3. to enable going beyond key data extraction,

namely to enable storing the whole document

text for analysis in the light of making

proposal of new services to the customers.

Scientific aims can be summarized as follows:

1. to classify texts with a good precision rate,

with no need to define a template for new

document types, instead using ML to learn the

structure of new documents;

2. to make some indicators available to

understand automatically whether a document

should be forwarder to manual or to automatic

classification. This allows the system to

classify manually or automatically the input

documents, depending on their quality; in fact,

documents can enter the system via fax, via

smart phone, as full digital texts or as images,

and so on, with different quality.

2.1 Quality Indexes

We identify indexes of quality for assessment and

for performance monitoring of textual BDA. We

illustrate the quality indexes, the system

architecture, and the basic results achieved by

Politecnico di Milano and Microdata Group

When building a BDA system, core data assets

present many challenges, such as data quality,

data integration, data analysis on metadata, and so

on. In particular, data quality problems add

complexity to the use of Big Data, with several

general data quality challenges.

https://www.microdatagroup.it/

Quality Evaluation for Documental Big Data

133

First, we observe that data are noisy, erroneous,

or missing. For example, jargons, misspelled words,

and incorrect grammars pose significant technical

challenges for linguistic analysis. Moreover, data

captured by mobile and wearable devices and

sensors can be noisy.

Secondly, we point out that, as data grow

exponentially, it becomes increasingly difficult for

companies to ensure that their sources of data and

information therein are trustworthy.

Veracity of Big Data, which is an issue of data

validity, is a bigger challenge than volume, velocity,

and variety in BDA. It is estimated that

approximately 20–25% of online consumer reviews

texts are fake (Qiao, 2017; Hayek, 2020). In our

Project, data validity, more than an issue of veracity,

is a parameter depending on the data quality at the

input point.

Given these premises, data cleaning, filtering,

and selection techniques able to detect and remove

noise and abnormality from data automatically

become essential.

In any knowledge extraction process, the value

of extracted knowledge is related to the quality of

the used data. Big Data problems, generated by

massive growth in the scale of data observed in

recent years, also follow the same issue. A common

problem affecting data, and Big Data, quality is the

presence of noise, particularly in classification

problems, where label noise refers to the incorrect

labeling of training instances, and is known to be a

very disruptive feature of data (Garcia-Gil, 2019). In

our approach, data are labelled initially by human

operators, and gradually passed on as a training set

to the supervised ML algorithms. This procedure

avoids having many partially or even unlabelled

data.

Our research focuses on the development of

uniform data quality standards and metrics for BDA

that address some of the various data quality

dimensions (e.g., accuracy, accessibility, credibility,

consistency, completeness, integrity, auditability,

interpretability, and timeliness). In particular, our

research focuses on the set up of a panel of

indicators for the analysis of performance and

quality aspects of the BDA process.

To this aim, the most significant techniques and

the technologies considered in SIBDA belong to the

three following areas.

1. Data ingestion. Among the scenarios that

characterize the stage of acquisition and initial

processing of Big Data, one of the most relevant

concerns regards data coming from IoT, with the

enabling middleware and event processing

techniques that support an effective integration

(Marjani, 2017). For text analytics, the research

questions regard ECM, namely how advanced

content management applications are characterized

by the growing significance of information

extraction in the enterprise environment, and how

the diffusion of Big Data storage tools applies to

document-oriented databases.

2. Data storage. For BDA, we identify two

research questions: i) how to store large volumes of

data, including the whole documents; ii) how to

archive unstructured and variable data in a way that

makes them understandable automatically via a ML

approach. The solution lies in adopting NoSQL, or

document-oriented databases.

Another issue related to storage regards how to

limit the complexity and costs of the hardware

infrastructure when scaling up volumes.

Different storage models have been proposed in

the presence of large volumes and yet adequate

performance, particularly in response times. These

new types of databases include NoSQL databases

and NewSQL databases (Meier, 2019). An

interesting model is proposed for document-oriented

databases. Values consisting of semi-structured data

are represented in a standard format (e.g., XML,

JSON - JavaScript Object Notation - or BSON

(Binary JSON), and are organized as attributes or

name-value pairs, where one column can contain

hundreds of attributes. The most common examples

are CouchDB (JSON) and MongoDB (BSON). This

way of organizing information is suitable for

managing textual content.

3. Data analysis. Analytics applications are the

core of the Big Data phenomenon, as they are in

charge of extracting significant values and

knowledge from data acquired and archived with the

techniques above. This result can be achieved either

through Business Intelligence techniques, or through

more exploratory techniques, defined as Advanced

Analytics (Chawda, 2016; Elragal, 2019). The

generated knowledge has to be available to users and

shared effectively among all the actors of the

business processes that can benefit from it. In this

area, we mention the techniques of Big Data

Intelligence, Advanced Analytics, Content

Analytics, Enterprise Search (or Information

Discovery) (Hariri, 2019).

An overall view of the above themes is in Figure

1, where we show the component schema adopted

for our company’s model.

ICEIS 2020 - 22nd International Conference on Enterprise Information Systems

134

Figure 1: General overview of the technological

components of the SIBDA System (components related to

ECM are in grey background).

3 IDENTIFYING INDICATORS

The automated ECM system developed by

Microdata Group and DEIB-Politecnico di Milano in

SIBDA is framed in the BDA context both for the

volume of textual data processed and for the variety

and variability of handled documents.

To define quality indicators, we first have to

distinguish between process efficiency and

effectiveness. The former is evaluated through a

process quality assessment, carried out by a tracking

software component, which measures the process

performance. The latter is evaluated through an

estimation of the data quality during the various

processing stages, measured by different quality

confidence indicators.

During the project, we ascertained the need to

verify the quality of processed data not only at the

end of the process, to assess the adequacy of the

returned results, but also in the various steps of the

content management process. A data quality

assessment had to be embedded in the process in

order to be able to route the data to the most

appropriate sub-process. In fact, currently, it is often

necessary to decide whether to send the document to

the so-called “residual interactive processing”,

manually carried out by human operators.

Thanks to a measurement of the estimated quality

of text, one can establish threshold values to decide

the path that the document must follow (automated

vs manual inspection, and steps within each branch

of the path, e.g. specific tests). These threshold

values may be set and tuned according to various

parameters, such as the processing job type, the

document type, the customer, or the channel of

document transmission (paper, e-mail, web portal,

mobile app, and so on).

The relevance of evaluating the quality of textual

data in the Project originates from the high

heterogeneity of our Big Data document sources and

from the poor quality of some document input

channels (e.g., smartphone cameras). In fact, the

data source of the textual data is a multichannel

acquisition system, involving various document and

file formats.

Therefore, we define three confidence indicators,

evaluated at three successive points in the process,

as shown in Figure 2.

As a basis, confidence indicators used in our

Project are taken from the literature, since their

methods and computation are suitable, as

summarized in a recent paper (Kiefer, 2019). These

have been adapted to our case, computing them to

give an estimation of the reliability at each process

step in the automated stages: i) OCR, ii) document

classification, iii) data extraction.

Each of the identified indicators assesses the

quality of data from the previous stage, approached

with a different technique, as illustrated in the

following.

Figure 2: Computation of data quality indexes in the

document processing workflow.

1. The first measure is the Text Quality

confidence index, which is evaluated by comparing

the output of the texts from the OCR phase with

the entry into the classification phase. It measures

the quality of the textual data and therefore their

reliability, in view of the successive phases of

automated processing. If the value is lower than a

predetermined threshold, the document is

conveyed to the interactive manual process

performed by a human operator. In this case, also

the subsequent key data extraction stage is carried

out interactively by the human operator, both for

practical reasons (he/she already visualized the

Quality Evaluation for Documental Big Data

135

document on the screen), and for the fact that, if

the text quality is poor, the automated extraction is

likely to lead to errors.

2. The second indicator, that we call

Classification Quality confidence index, is

evaluated at the exit of the classification stage,

before entering the extraction stage. This is the

simplest indicator to be computed, since the

classification module based on ML already

provides an uncertainty value of the classification

result. This evaluation of uncertainty is combined

with the value of the previous indicator, to decide

whether to send the document to the automated

data extraction or otherwise to send it to manual

processing (Joachims, 2002).

3. The third indicator is the Extraction Quality

confidence index, which is evaluated at the output

of the key data automatic extraction stage. It is

measured using a fuzzy lookup approach, based on

support data provided by customers, to verify the

quality of the extracted data. Through the fuzzy

lookup technique, every data extracted from

automatic processing is matched against a

dictionary of valid words, including the personal

data records of the specific customer. In this way,

it is possible to identify the correct data, increasing

the recognition rate of the index data and

estimating a confidence index of the overall

automatic processing (Bast, 2013).

In the first phase of the Project, we mainly

focused on the Text Quality confidence index: the

heterogeneity of the source channels implies the

processing of low quality documents, due to

geometric distortions or low resolution. A degraded

text input greatly affects the subsequent process

steps and for this reason we needed to implement a

customized solution to check the quality of texts

before sending the documents to automated

classification and data extraction.

Therefore, we developed a text quality

evaluation component based on the combination of

confidence estimation over successive text levels,

starting from the XML output provided by the OCR

(Optical Character Recognition) software. Through a

series of nested cycles, we compute an estimated

confidence value for each character, for each word

and for each text line, converging in the overall

document confidence index.

The measurement of the character-level

confidence does not require specific techniques, as

our OCR system directly provide a confidence value

for each recognized character. The OCR also marks

the portions of the document containing significant

elements and identifies them by rectangular areas

that are called bounding boxes. By setting a

threshold value, we can estimate the total count of

boxes believed to contain valid text and the count of

boxes that probably contain invalid text. The

corresponding ratio can provide a clue of the

document deterioration degree.

Then we computed the word-level confidence by

estimating the correctness of each token. A first

estimation results from the average confidence at

character level, integrated with the reliability

estimation of the bounding boxes and with the

frequency of the words labelled as "suspicious" by

the OCR software.

In the subsequent phases of the Project, a more

advanced check of a token plausibility has been

developed, based on lexical analysis techniques,

using a search within a vocabulary. First, we search

in the vocabulary if the token is a valid word. For

very ‘chaotic’ tokens, we select the word for which

the string distance with the token is minimal,

implementing a common weighted edit distance

measurement (a generalization of the Levenshtein

distance).

The data generated by the OCR software

explicitly provides the grouping of words in lines.

This grouping is used to measure the confidence of

the individual lines of text or row-level confidence,

which can be used both as an indicator of the

possible presence of deteriorated areas within the

document, and as an intermediate step in the

calculation of the overall document text quality.

To compute the synthetic indicator that estimates

the overall document-level text quality confidence

index, we perform a linear combination of the

various level confidence factors described above,

whose relevance is expressed by a series of weights.

The value is then compared with a threshold, to

determine whether the document can be sent to the

automated classification and semantic information

extraction process or it has to be manually processed

by an operator. The algorithm is tuned by modifying

the parameter weights and the threshold values, to

identify the most effective configuration according

to the type of processing and the document category.

4 ARCHITECTURE AND

TECHNOLOGY HINTS

The approach has been prototyped, providing for

basis the experimentation of different specialized

ICEIS 2020 - 22nd International Conference on Enterprise Information Systems

136

software tools, and applied in a defined context,

based on Proofs of Concept (PoC). The purpose was

to evaluate the resulting process performances and

error level, and to estimate the costs of adoption,

both for software implementation and for the

necessary infrastructure setup, before the subsequent

step into production on a set of pilots.

Figure 3: ECM system architecture.

The architectural design started from the need to

extract information from documents even in the

presence of a structure not described a priori

formally, and to carry out a complete extraction of

all the textual information contained in the

document, without being limited to the information

necessary for the required processing.

The implementation choices adhere to the

following directions:

1. Choose software components that can be

integrated into an industrial process, ensuring

scalability and high availability requirements

compliance, while avoiding the need to install

components on client stations and minimizing

the operators' interventions.

2. Favour solutions capable of minimizing the

specific implementation effort to be devoted to

each processing job or to each document type,

given the frequent changes in the content and

structure of the documents and the growing

number of document types to be processed.

The first requirement is met by a sever-based

solution, deployed on a hyper-converged

infrastructure (based on a commercial HCI

platform). In particular, the overall software

architecture of the ECM system, represented in

Figure 3, is based on a service-oriented architecture,

supported by a service bus and by a SOA message

broker, with the role of orchestrator, which invokes

the different content classification and extraction

modules that process the document.

A process-tracking application performs both

monitoring and supervision of the whole process and

feeds an on-line dashboard that enables a fine-

grained control of the automated jobs.

The second requirement is fulfilled by exploiting

tools with ML capabilities, which progressively

improve the quality of the result as the number of

processed documents increases, and avoiding the

solutions based on building a specific template for

each document type to process.

Regarding the classification step of text

documents, we therefore apply a proprietary ML

algorithm based on the Support Vector Machines

(SVM) technique, which works on the weight

assigned to a set of keywords. The system is

enriched using text extracted from the training set

documents, through which the algorithm learns the

list of keywords and related weights useful for

document classification.

The interactive human intervention is limited to

the processing of the fraction of documents that have

not been properly managed by the automated tools.

Each manually elaborated document becomes a part

of the training set of the ML engine, so that, at each

subsequent step, the automatic systems has a higher

probability of processing it.

The subsequent process step is the extraction of

the characteristic content meaningful information of

each document, which starts from the document type

returned by the classification step and its geometric

features. To maximize extraction yield, this stage

focuses only on specific portions of the document.

For this data extraction task, different logics are

applied, depending on whether the textual document

is structured or unstructured.

For structured documents, a set of rules is used,

employing:

 a recognition logic, based on regular

expressions;

 the search for specific anchor elements in the

documents.

For unstructured documents, a supervised ML

algorithm is applied that operates as follows. In the

preparation of the training set, tags are attached to

the key information to be extracted. Then, the

algorithm learns to recognize the textual patterns

related to the occurrence of combinations of words

and phrases, such as a statistic of the most frequent

words that typically precede or follow the key

information, without relying on a true semantic

analysis. Finally, the algorithm extracts the relevant

text portions from the document.

Quality Evaluation for Documental Big Data

137

Upon completion of the classification and

extraction phases, and after the possible document

analysis interactive processing, the extracted

information is stored in the database of the

document management system.

For this purpose, the project uses a document-

oriented database for Big Data, which allows for

variable data structures according to the type of

document. As for storage, the adoption of a NoSQL

DB was selected, in particular a document-oriented

DB, in order to store both the data extracted from the

documents and the process data. Document-oriented

DBs provide structured data types, typically in XML

or JSON format (JavaScript Object Notation), which

allow storing a multi-level text structure in a single

point, thus facilitating the insertion and retrieval

operations. For this purpose, the MongoDB DB

server was adopted, which natively uses the JSON

format, in an extended variant with binary encoding

called BSON.

The complete textual content extracted from the

document aimed for example to the design of new

value-added services, currently is not saved, since

the possible violation of privacy issues is still under

discussion.

As of process performance evaluation, the use of

metadata attached to each document makes it

possible to detect the process tracking data in real

time, and therefore to be able to access timely

information on the work in progress. The collection

of these data in an automated way, previously

carried out manually by operators, makes it easier to

analyse the load balancing and therefore to optimize

the process. Through dash boarding and reporting

functions, precise estimates of processing times are

possible. This increases future performance, thanks

to smarter resource allocation. SLAs, which are so

far negotiated on a prudential basis, could instead be

evaluated more precisely.

At the same time, the automated system made it

possible to lower the level of jobs granularity from

the document package to the level of a practice and

soon to the single document, in order to have a more

precise tracking of the process and a more flexible

reallocation of resources.

4.1 Discussion

When the PoC proved the technical feasibility of the

automatic document classification and automatic key

data extraction, the company set up a number of

pilots for a first group of job process categories.

Experiments were carried out to test the automated

process performance and to evaluate data quality.

The pilot tests concerned a few job categories,

where it was easier to build a good document

training set. it is necessary in effect to set up a

dedicated training set for each processed document

type, and although many document types are

transversal to the different job processes categories,

other are specific to each individual job category.

The resulting success rate is variable depending

on the type of processed document and the pilot

experiments are still continuing to gradually include

new document types. The overall results here

provided are therefore affected by the document

types so far included in the training and they are also

very sensitive to the input document acquisition

channel.

For structured documents, such as contracts

documentation or privacy consent forms, we

provided training sets that include two or three

hundreds of documents having a high Text Quality

Index score, and an accordingly low frequency of

errors, reaching automatic classification rates

between 90% and 95%.

For the identification documents, such as driving

licenses and identity cards, it was necessary to build

a larger set of training documents, reaching the

thousand documents for each type, to achieve a

successful classification rate around 90%.

As of the process phase of key data extraction,

the resulting success rate are lower and they are

more sensitive to the document type and to the

quality of the original image of the digitized

document. Rates are also affected by the wide

variability of the document templates and contents,

depending on the different customers.

Using the approach described in the sections

above, the system achieved an average key data

recognition rate of around 70% on more variable and

less structured documents to 90% for more

standardized ones.

In addition, the adoption of an automated system

results in a dramatically cut of the document

processing time. In fact, the number of documents

processed in a unit of time increases by at least a

factor of 15 for a single processing thread, to be

multiplied by the level of parallel computation that

could be deployed on the base of the available

hardware infrastructures.

5 CONCLUDING REMARKS

The text analytics solution for documents described

in this paper shows how the calculation of data

quality indexes along a Big Data Analytics process

ICEIS 2020 - 22nd International Conference on Enterprise Information Systems

138

can give effective results in document classification

and semantic analysis practices.

We have presented an approach that combines

three criteria for evaluating the quality of the content

data. This combination allows one to avoid

conveying good-quality texts to human operators.

On the other side, it avoids to send to the automatic

processing a document that contains degraded text,

even when for example it is classified as “safe” by

the automatic classifiers and therefore with a high

classification confidence value.

In the future, we plan to use a combination of the

techniques introduced for the computation of quality

indexes, namely, the text lexicality evaluation and

the fuzzy lookup, to directly improve the quality of

the texts, using them to perform automatic

corrections.

REFERENCES

Bast, H., & Celikik, M. (2013). Efficient fuzzy search in

large text collections. ACM Transactions on

Information Systems (TOIS), 31(2), 1-59.

Batini, C., & Scannapieco, M. (2016). Data and

information quality. Cham, Switzerland: Springer

International Publishing. Google Scholar, 43..

Chawda, R. K., & Thakur, G. (2016, March). Big data and

advanced analytics tools. In 2016 symposium on

colossal data analysis and networking (CDAN) (pp. 1-

8). IEEE.

Elragal, A., & Hassanien, H. E. D. (2019). Augmenting

advanced analytics into enterprise systems: A focus on

post-implementation activities. Systems, 7(2), 31.

Fugini, M., & Finocchi, J. (2018, June). Innovative Big

Data Analytics: A System for Document Management.

In 2018 IEEE 27th International Conference on

Enabling Technologies: Infrastructure for

Collaborative Enterprises (WETICE) (pp. 267-274).

IEEE.

Fugini, M., Finocchi, J., Leccardi, F., Locatelli, P., &

Lupi, A. (2019, June). A Text Analytics Architecture

for Smart Companies. In 2019 IEEE 28th

International Conference on Enabling Technologies:

Infrastructure for Collaborative Enterprises

(WETICE) (pp. 271-276). IEEE.

García-Gil, D., Luengo, J., García, S., & Herrera, F.

(2019). Enabling smart data: noise filtering in big data

classification. Information Sciences, 479, 135-152.

Hariri, R. H., Fredericks, E. M., & Bowers, K. M. (2019).

Uncertainty in big data analytics: survey,

opportunities, and challenges. Journal of Big Data,

6(1), 44.

Hajek, P., Barushka, A., & Munk, M. (2020). Fake

consumer review detection using deep neural networks

integrating word embeddings and emotion mining.

Neural Computing and Applications, 1-16.

Joachims, T. (2002). Learning to classify text using

support vector machines (Vol. 668). Springer Science

& Business Media.

Kiefer, C. (2019). Quality indicators for text data. BTW

2019–Workshopband.

Marjani, M., Nasaruddin, F., Gani, A., Karim, A.,

Hashem, I. A. T., Siddiqa, A., & Yaqoob, I. (2017).

Big IoT data analytics: architecture, opportunities, and

open research challenges. IEEE Access, 5, 5247-5261.

Meier, A., & Kaufmann, M. (2019). Nosql databases. In

SQL & NoSQL Databases (pp. 201-218). Springer

Vieweg, Wiesbaden.

Qiao, Z., Zhang, X., Zhou, M., Wang, G. A., & Fan, W.

(2017). A domain oriented LDA model for mining

product defects from online customer reviews.

Shivakumar, S. K. (2016). Enterprise content and search

management for building digital platforms. John

Wiley & Sons.

Singh, S. K., & El-Kassar, A. N. (2019). Role of big data

analytics in developing sustainable capabilities.

Journal of cleaner production, 213, 1264-1273.

Vlad, M.P. and Mocean, L., 2019. About document

management systems. Quaestus, (15), pp.217-225.

Wang, Y., Kung, L., & Byrd, T. A. (2018). Big data

analytics: Understanding its capabilities and potential

benefits for healthcare organizations. Technological

Forecasting and Social Change, 126, 3-13.

Quality Evaluation for Documental Big Data

139