Mining M-Grams by a Granular Computing Approach for

Text Classiﬁcation

Antonino Capillo

, Enrico de Santis

, Fabio Massimo Frattale Mascioli

and Antonello Rizzi

Department of Information Engineering, Electronics and Telecommunications,

University of Rome “La Sapienza”, Via Eudossiana 18, 00184 Rome, Italy

Keywords:

Text Mining, Text Categorization, Granular Computing, Knowledge Discovery, Explainable AI.

Abstract:

Text mining and text classiﬁcation are gaining more and more importance in AI related research ﬁelds. Re-

searchers are particularly focused on classiﬁcation systems, based on structured data (such as sequences or

graphs), facing the challenge of synthesizing interpretable models, exploiting gray-box approaches. In this

paper, a novel gray-box text classiﬁer is presented. Documents to be classiﬁed are split into their constituent

words, or tokens. Groups of frequent m tokens (or m-grams) are suitably mined adopting the Granular Com-

puting framework. By fastText algorithm, each token is encoded in a real-valued vector and a custom-based

dissimilarity measure, grounded on the Edit family, is designed speciﬁcally to deal with m-grams. Through

a clustering procedure the most representative m-grams, pertaining the corpus of documents, are extrapolated

and arranged into a Symbolic Histogram representation. The latter allows embedding documents in a well-

suited real-valued space in which a standard classiﬁer, such as SVM, can safety operate. Along with the

classiﬁcation procedure, an Evolutionary Algorithm is in charge of performing features selection, which is

able to select most relevant symbols – m-grams – for each class. This study shows how symbols can be fruit-

fully interpreted, allowing an interesting knowledge discovery procedure, in lights with the new requirements

of modern explainable AI systems. The effectiveness of the proposed algorithm has been proved through a set

of experiments on paper abstracts classiﬁcation and SMS spam detection.

1 INTRODUCTION

With the rapid evolution of Web-based Internet and

mobile applications, an exponential growth of un-

structured data in the form of web pages occurs.

Thus, social network sites, microblog textual sources,

IM sources, electronic papers and books have become

very challenging problems. Among these, text min-

ing - which means obtaining high-level information

from natural language text excerpts - and text cat-

egorization (TC) - meaning assignment of a given

text document to a class - can be considered. Text

mining is by now a really wide research area at

the intersection with Data Mining, Artiﬁcial Intelli-

gence, Natural Language Processing (NLP), Informa-

tion Retrieval (IR), Knowledge Management and Dis-

covery, corpus-based Computational Linguistics and

other Computer Science related disciplines (Feldman

https://orcid.org/0000-0002-6360-7737

https://orcid.org/0000-0003-4915-0723

https://orcid.org/0000-0002-3748-5019

https://orcid.org/0000-0001-8244-0015

et al., 2007). Text categorization or text classiﬁca-

tion (TC) is a sub-discipline that exploits Machine

Learning for solving a supervised learning problem,

where documents in a corpus are provided together

with well-suited labels, reﬂecting the meaning of the

category or topic in which the document falls.

In the context of information overload (too much

information, beyond the most relevant data) (Buck-

land, 2017), driven by Big Data applications, it is ex-

tremely useful having automatic procedures able to

deal with a huge amount of unstructured text data, for

summarization, knowledge discovery or classiﬁcation

purposes.

On the other hand, natural language can be consid-

ered as a communication system, deﬁned and work-

ing at the interface between biology and social inter-

actions (Kwapie

n and Dro

z, 2012). Natural lan-

guage arises from information processing and inte-

gration performed by different brain areas, hence pro-

duced by a real-world complex system. Moreover,

language has a hierarchical structure. A higher level

of language organization is formed by clauses and

350

Capillo, A., de Santis, E., Mascioli, F. and Rizzi, A.

Mining M-Grams by a Granular Computing Approach for Text Classiﬁcation.

DOI: 10.5220/0010109803500360

In Proceedings of the 12th International Joint Conference on Computational Intelligence (IJCCI 2020), pages 350-360

ISBN: 978-989-758-475-6

sentences which are the most important units of in-

formation transfer. In the case of written language,

information levels are organized in a hierarchy of en-

tities (characters, words, sentences, paragraphs, chap-

ters, documents, concepts and so on) (Kwapie

n and

Dro

z, 2012). This makes the problem of modeling

and mining text data more challenging, even from the

scientiﬁc point of view.

Heeman, from its own hand, some years ago

pointed out how text documents possess a well-

deﬁned granular structure that can be divided in a

high-level structure and a low-level one (Heeman,

1992). While the high-level structure relies on the or-

ganization of a given document, the low-level struc-

ture reﬂects the hierarchical structure of the lan-

guage including characters, words, phrases, concepts,

clauses, sentences, paragraphs (Jing and Lau, 2009).

While words are the ground of semantic richness,

groups of words or sentences can be considered, at

a more advanced level of the hierarchy, as a basic

unit dealing with concepts, that are notoriously en-

tities richer than words.

The natural granular structure of text leads to ﬁnd

better representation and processing paradigms solv-

ing related downstream language modeling problems.

Moreover, most of the techniques developed in Com-

putational Linguistics try to grasp the meaning of text

pieces with statistical measures on suitable text repre-

sentations, hence ﬁnding regularities and recurrences

within text data (Mikolov et al., 2013b). Granular

Computing (GRC) dates back to 2008 (Apolloni et al.,

2008) and provides a well-suited approach to repre-

sent text data through the notion of “information gran-

ules”, that are complex information entities which

arise in the process of data abstraction and knowl-

edge formation. Information granules are collections

of entities that originate at the numeric level and are

arranged together due to their similarity, functional

or physical adjacency, indistinguishability, coherency,

etc. (Yao, 2006)(Martino et al., 2018). The granular

processing paradigm allows to exploit the structure

of a data source, providing a multilevel hierarchical

view. Hence, this methodology leads to work to the

right hierarchical level represented by the structure of

the information granule (Martino et al., 2017). For

example, in text data the m-gram decomposition can

be seen as a text granulation procedure whose output

are small pieces of text or phrases. In TC problems

the GrC approach can help in ﬁnding regularities in

text data, building a performing classiﬁcation system

able also to accomplish knowledge discovery tasks.

This paper deals with a TC system grounded on

the GrC approach that provides an interesting and

novel text embedding technique. The ﬁrst step con-

cerns the text corpus granulation, randomly extracting

a suitable number of m-grams, i.e. information gran-

ules, of a given size. A per-class clustering procedure

is then performed providing a proper hard partition

of the space generated over a suitable numerical rep-

resentation of the m-grams, obtained, in turn, trough

a neural word embedding technique, performed on

each token within the m-gram. Hence, the informa-

tion granule is represented as a variable-length or-

dered word-vector, lying in a rich semantic space.

In this work word-vectors are obtained through fast-

Text (Bojanowski et al., 2016). The relatively new

fastText model among the variety of word embedding

techniques not based on the transformer architecture

(Vaswani et al., 2017) is found to perform better on

several language modeling tasks (Ritu et al., 2018).

As a further novelty, the adopted dissimilarity

measure within the clustering procedure is a custom-

based Edit distance, whose substitution cost is the Eu-

clidean distance between word-vectors. The last dis-

similarity metric allows measuring the dissimilarity

even between m-grams – as a set of word-vectors – of

different length. Once obtained the cluster model, the

m-gram representatives are interpreted as symbols of

a proper alphabet. The alphabet is used to estimate the

statistical distribution of symbols in a given test doc-

ument, through a proper membership rule. In other

words, this approach leads to represent each docu-

ment in a corpus as a Symbolic Histogram (SH) (Rizzi

et al., 2012), a real-valued vector of ﬁxed length

useful for feeding standard Machine Learning algo-

rithms. The text mining GrC system consists, ﬁnally,

in an evolutionary optimization procedure in charge

of both ﬁnding the (sub-)optimal hyper-parameters of

the system and performing a wrapper-like features se-

lection over symbols of the SH representation. The

last optimization allows obtaining a simpler docu-

ment representation that is computationally efﬁcient

and more interpretable. Speciﬁcally, a comparison

between two features selection approaches is made

by deciding at which step of the work-ﬂow to bina-

rize the features selections weights, since they are

encoded in the GA as real-valued genes. Moreover,

the obtained symbols, that are small phrases, can be

used for interpreting the decision rule generated by

the learning system in assigning a class label, opening

to the knowledge discovery paradigm. Experiments

are conducted on different datasets. The ﬁrst dataset is

a corpus of scientiﬁc paper abstracts for various topics

pertaining some published famous journals in English

language. The second one is a corpus of SMS adopted

for spam detection. Classiﬁcation performances to-

gether with the extracted concepts underlying the m-

grams demonstrate the effectiveness of the GrC ap-

Mining M-Grams by a Granular Computing Approach for Text Classiﬁcation

351

proach in text mining and classiﬁcation.

This paper is organized as follows.

In section 2 is carried out a deeper explanation

about the GrC TC system along with each work-

ﬂow step. The experimental setup, results and per-

formance evaluations are reported and discussed in

section 3. Finally, conclusions are drawn in section

2 THE GRANULAR COMPUTING

TEXT CATEGORIZATION

SYSTEM

2.1 A Quick Overview

This work aims at synthesizing a learning algorithm

for TC relying on NLP techniques, grounded on the

GrC approach, within the context of Text Mining. A

textual dataset D =

{

}

i=1

composed by n documents

is preprocessed, achieving a granular tokenization

based on the GrC paradigm. The GrC model is gen-

erated by a per-class clustering process allowing to

deﬁne a suitable robust embedding procedure repre-

senting documents in R

through the SH approach.

In other words, we seek a suitable mapping M start-

ing from the document (unstructured) space D to real-

valued

k-tuples, that is M : D → R

n×

The overall documents corpus D is partitioned in

three disjoint sets, namely the training set S

T R

, the

validation set S

VAL

and the test set S

T S

Hence, during the learning phase document em-

bedding vectors in S

T R

can be fed to a standard clas-

siﬁcation algorithm, such as νSVM (Sch

olkopf et al.,

2000), where both the selection of hyper-parameters

and a wrapper feature selection procedure can be op-

timized by evolutionary meta-heuristics, driven by

the classiﬁcation performances obtained on the val-

idation set S

VAL

as ﬁtness function. Feature selec-

tion has been chosen as a proﬁtable procedure for the

most relevant m-grams selection, both for improving

classiﬁcation performances and the knowledge dis-

covery outcome. The global optimization capabili-

ties of Evolutionary Algorithms are exploited through

a wrapping approach, at a computational cost that is

worth the ﬁnal result. In particular, it is used a GA that

encodes the cluster membership threshold adopted to

build the embedding (of which we will provide further

details), the SVM classiﬁer hyper-parameters and the

features selection weights as its genes. The work-ﬂow

is brieﬂy synthesized as follows.

In line with the GrC paradigm, a set of information

units called m-grams

is extracted from the dataset.

Each m-gram η is an ordered m-tuple of words (short

phrases), that is η = [w

, w

, ..., w

], whose maxi-

mum and minimum size is a meta-parameter set in

advance. Let the complete per-class set of m-grams

be χ

, where ω is the generic class label. It is worth

to note that for a large document corpus the cardinal-

ity

is combinatorial, thus a sub-sampling is per-

formed through a uniformly random extraction proce-

dure, where only a sub-set of cardinality ϑ ·

, ϑ ∈

[0, 1] is extracted and successively used.

These information granules create the knowledge

base for the construction of the embedding that is

the ground for the downstream classiﬁcation proce-

dure. Furthermore, the adoption of m-grams, as a

set of words, together with the granular approach, al-

lows building an interpretable model that behaves as

a gray-box model.

The m-gram set is the information basin for the

per-class k-medoids-based clustering process, which

in general needs a suitable dissimilarity measure

to group objects. To this aim, the Edit Distance

(Navarro, 2001) is deﬁned between two given se-

quences of words, not necessary of the same lenght.

In this work, words are appropriately converted into

real-valued vectors through a word embedding pro-

cess, based on pre-trained Artiﬁcial Neural Networks

(ANNs), eliciting a semantic relation between under-

lying word representations. It is worth to note that

since the k-medoids clustering algorithm selects the

most representative m-grams – medoids interpreted as

symbols of a suitable alphabet Λ – pertaining each

cluster among the m-grams set, the set of word-based

medoids for the generic cluster is useful for perform-

ing knowledge discovery tasks beside the overall clas-

siﬁcation process.

Hence, the clustering procedure allows partition-

ing the m-gram representation space in k clusters

, s = 1, 2, ..., k (for each class ω ∈ Ω), so that each

document t ∈ D can be represented by the SH, count-

ing the occurrences in t of each symbol in the alpha-

bet. Speciﬁcally, this procedure constitutes the em-

bedding of an unstructured object – through the map-

ping M . Therefore, a text document is transformed in

k-tuple of real numbers ready to be fed to the clas-

siﬁcation algorithm, in charge of solving a supervised

learning problem.

A wrapper-like features selection procedure is

then performed by a GA, where some genes repre-

sent feature selection weights and, according to their

values, m-grams related to the corresponding symbols

In the following we use m to denote the m-grams

(instead of n) emphasizing the speciﬁc dimension herein

adopted, speciﬁed in the experimental section.

NCTA 2020 - 12th International Conference on Neural Computation Theory and Applications

352

in Λ are deleted following a suitable rule. Thence, the

new (reduced) SH is passed to the νSVM classiﬁer.

The resulting classiﬁcation performances on a vali-

dation set is considered as the GA ﬁtness function,

driving the evolution of GA population until a suit-

able stopping condition is reached. As a result, the

TC procedure is optimized and, consequently, a suit-

able model of the text corpus is learned. In the fol-

lowing we will provide deeper details on the various

building-blocks of the proposed system.

2.2 The Symbolic Histogram

Construction

The SH encloses the information needed by the

νSVM classiﬁer as it represents the m-grams dis-

tribution for a given pattern (document). In other

words, for a generic document, pertaining to a class

ω ∈ Ω, the SH is an array of (normalized) counters

c = [c

, c

, ..., c

] ∈ R

, each one related to an alphabet

symbol λ

∈ Λ

, s = 1, 2, ..., k, that measure the oc-

currence frequency of the m-grams pertaining a given

generic document t. In this work we build an alphabet

for each class, therefore the total alphabet

ω∈Ω

is of dimension

k =

Ω

, where the ﬁrst term in

the second member is the cardinality of the alphabet

for each class ω, while the second term indicates the

number of available classes. In the following we omit

the subscript ω if not strictly necessary.

The assignment is performed by a suitable method

that relies on selecting the nearest over k represen-

tatives and checking if the m-gram vector represen-

tation falls within a threshold τ. The procedure is

called “minDist” and it is speciﬁed in Algorithm 1

pseudocode.

We indicate a generic m-gram (numerical) repre-

sentation for the i-th m-gram η

as a suitable map-

ping Φ, whose nature will be speciﬁed further, such

as Φ(η

Firstly, the Edit Distance, – indicated by edit(·) in

the following – between each m-gram representation

Φ(η)

t,i

from the generic document t and each symbol

of the alphabet Λ is measured and arranged into

the distance vector d

Φ(η)

t,i

. Once the minimum value

j,min

= min(d

φ(η)

t,i

) is estimated, it is compared to

the cluster threshold τ (the same for all clusters). If

and only if d

min

is lower than τ the value in the counter

vector c

of size k is incremented, at the index j =

argmin

min

. In detail:

∀λ

∈ Λ, s = 1, .., k d

φ(η)

t,i

= edit(λ

, φ(η)

t,i

( j) =



( j) + 1 d

j,min

< τ

( j) otherwise,

(1)

where c

( j) is the counter variable for the docu-

ment t at position (index) j.

Thus, the m-grams pertaining to the document t

that belong to each cluster are tracked by calculating

their occurrence number, following the procedure re-

ported in the Algorithm 1.

The procedure is performed for each document

t of the dataset D so that a counter matrix C can

be deﬁned, where documents t are arranged as rows

while alphabet symbols λ as columns, where each en-

try c

i,s

, i = 1, 2, .., n, s = 1, 2, ..., k is the number of

occurrences of symbol λ

in a document t

. Finally,

the C matrix collects the SH c

for each document

providing an algebraic representation of the underly-

ing unstructured text object. As concerns the normal-

ization procedure, each column of the counter matrix

C is divided by the maximum value reached on each

column, that is: c

norm

i,s

max(c

:,s

)

, i = 1, 2, .., n, s =

1, 2, ..., k, where c

:,s

is the s-th column vector of ma-

trix C.

Algorithm 1: The minDist algorithm.

Require: the dataset D, the m-grams in each document t, the threshold τ,

the alphabet Λ

Ensure: the SH matrix C

1: procedure MINDIST(DATASET,Λ,τ)

2: for t-document in dataset do

3: for m-gram i in t do

4: for λ

in Λ do

5: d

φ(η)

t,i

← edit(λ

, φ(η)

t,i

)

6: end for

7: d

j,min

← min(d

φ(η)

t,i

)

8: if d

j,min

< τ then

9: c

( j) ← c

( j) + 1

10: else

11: c

( j) ← c

( j)

12: end if

13: end for

14: end for

15: end procedure

2.3 The M-Gram Representation

In the last decade, a series of new techniques

grounded on shallow and deep ANN models have

been developed, such as skip-gram with negative sam-

pling (SGNS)(Goldberg and Levy, 2014; Mikolov

et al., 2013a), known generally as the word2vec algo-

rithm. These techniques rely on a suitable windowing

procedure and the ANN is in charge of providing a

vector-based representation of words while solving a

suitable language task (i.e. guessing words around a

target word in the window) in an unsupervised fash-

ion. Some recent ﬁndings claim that the word2vec

words representation derives from an implicit matrix

factorization related to the HAL family representation

Mining M-Grams by a Granular Computing Approach for Text Classiﬁcation

353

(Levy and Goldberg, 2014).

These by now popular word representation models

ignore words morphology by associating a vector per-

word. Therefore, some relevant semantic information

could be ignored by learning algorithms, with a con-

sequent lack of performance on downstream language

modeling or classiﬁcation tasks.

An alternative words representation model –

adopted here – is fastText (Bojanowski et al., 2016). It

is based on information units named n-grams

that are

a set of sub-words. The procedure increases the effec-

tiveness by incrementing data granularity. Thence, a

single word becomes a set of n-tuples of characters. In

the example below, the word “where” is represented

by its n-grams of size 3, with two sufﬁx and preﬁx

extreme: <wh, whe, her, ere, re>.

Therefore, it is possible to deﬁne the dictionary

∈ {sub w

, sub w

, ...., sub w

} as the set of n-

grams for the generic word w.

In this way, morphological information about the

words are taken into account. With reference to what

mentioned above, a learning algorithm could be more

effective in grouping apparently very different words

with the same root by computing the semantic infor-

mation lead by the root n-gram. Moreover, a rare

word could be linked to others on the base of sim-

ilar n-grams. Data granularity, ruled by the size of

the n-grams, is a crucial parameter to set, since the

learning algorithm performance is strictly related to

it. Thereby, the word representation consists in the

numeric encoding of words assigning similar vectors

to words with similar (distributional) meaning. As we

will see, it can be adopted a pre-trained word embed-

ding leading to transfer learning tasks or the word-

vector representation can be learned on a given train-

ing set.

Hence, in this paper, a given m-gram η

(m)

, w

, ..., w

] is extracted from t ∈ D. Then, a map-

ping for each word w

is obtained trough fastText, such

as F : w

→ R

, where u is the size of the real-valued

vector which encodes the single word (this is a pa-

rameter ﬁxed at training time). Thus, the entire m-

gram η

(m)

is a matrix of encoded words. In particular,

it is mapped with an ordered set of apposed column

vectors obtained by the mapping F acting on each

constituent words, generating a matrix φ

∈ R

u×m

The mapping is represented as Φ : [w

, w

, ..., w

] →

u×m

, that is Φ(η

(m)

) = φ

φ ∈ R

u×m

In this way, unlike the original approach, each m-

gram is a sequence of words instead of some kind of

averaged vector that can lead to a information loss.

Here for n-grams we intend pieces of sub-words and

not a group of contiguous words, such as for the approach

adopted in the current work.

2.4 The Custom-based Edit

Dissimilarity Measure

The clustering procedure adopted for the SHs con-

struction needs a suitable dissimilarity measure as a

kernel for partitioning an input dataset. In this work

we adopt a suitable modiﬁed version of the well-

known Edit distance.

According to the classic Edit dissimilarity mea-

sure, the minimum cost between two words is com-

puted as a sequence of Edit operations, i.e. substi-

tutions, insertions and deletions, needed to transform

one word into the other. In this simple case, a word is

a string and the object of manipulation are its charac-

ters. In this work, the information unit is not a single

word (as a sequence of characters) but the m-gram as a

sequence of vectors (each representing a single word)

and different m-gram sizes are generally set to achieve

a good granularity exploration.

The custom approach presented here is designed

to achieve this goal: to compare m-grams of different

sizes providing a dissimilarity measure for the cluster-

ing process. Through the fastText model each token

in a m-gram is encoded in a real-valued vector whose

dimension is u. Hence, it is possible to extend the

Edit distance to the m-gram numerical representation

given by φ

∈ R

u×m

– see Sec. 2.3.

Given two m-gram representations φ

and φ

length m



col

and m



col

, the generic

entry φ

[i] is the i-th column vector of φ

correspond-

ing to a suitable word w.

Here, the custom substitution cost between the

two given word-vectors within the m-grams represen-

tation, x = φ

] and y = φ

], belonging to dif-

ferent sequences, is the Euclidean distance d

cost

(x, y),

computed as follows:

cost

(x, y) =

kx − yk

|u|

∑

i=1

− y

)

|u|

, (2)

where

|u| is a normalization factor.

2.5 The Classiﬁcation Algorithm

The SH representation of a given set of documents,

both for S

T R

and S

VAL

, consists of counter matrices

C ∈ R

n×

. The C matrices – conceived as data ma-

trices in the classiﬁcation contest – have documents

as rows and (normalized) counters for symbols as

columns – see Sec. 2.2. Once obtained the above

described SH representation, the vector-based docu-

ments mapping can be classiﬁed through any standard

learning algorithm able to process real-valued tuples.

NCTA 2020 - 12th International Conference on Neural Computation Theory and Applications

354

Algorithm 2: The custom-based Edit dissimilarity measure

algorithm.

Require: The two m-grams representations to compare φ

and φ

, the

m-grams legths m

and m

Ensure: The Edit dissimilarity measure d

1: procedure CUSTEDITDIST(φ

, φ

, m

)

2: for i

from 0 to m

3: d[i

, 0] := i

4: end for

5: for i

from 0 to m

6: d[0, i

] := i

7: end for

8: for i

from 1 to m

9: for i

from 0 to m

10: if φ

− 1] = φ

− 1] then

11: d ← 0

12: d[i

, i

] ← d[i

− 1, i

− 1]

13: else

14: d ← d

cost

(Euclidean distance)

15: a ← d[i

− 1, i

]

16: b ← d[i

, i

− 1]

17: c ← d[i

− 1, i

− 1]

18: d[0, i

] ← min(a, b, c)

19: end if

20: end for

21: end for

22: return d[m

, m

]

23: end procedure

In this work, we use a νSVM with RBF kernel adopt-

ing the LibSVM library (Chang and Lin, 2011).

The νSVM hyper-parameters ν ∈ [0, 1] and γ ∈ R

rule the amount of outlier patterns and the size of the

RBF kernel, respectively. The former is related to

the degree of classiﬁer generalization (lower values

could cause over-ﬁtting) and the latter is deﬁned as

K(x, x

) = exp(−γ·

x − x

) where x and x

are real-

valued data patterns that belong to the input space,

that in turn, foresees a different dimension compared

to the arrival (kernel) space (that is inﬁnite). The

hyper-parameters ν and γ are encoded as genes of the

wrapper GA as detailed in the next section.

2.6 Wrapper-based Features Selection

and Hyper-parameters

Optimization

The GA-Wrapper is in charge of achieving the opti-

mal classiﬁcation performances thanks to an evolu-

tionary procedure, which catches the best values of

the νSVM hyper-parameters (ν and γ), the threshold

τ for the SH construction and the set of real-valued

ﬁgures in the weight vector v, which is used in the

features selection procedure. The dimension of the

weights vector is dim(v) =

With the purpose of achieving an appropriate gen-

eralization level, the model is trained using the

k-fold

cross-validation. The dataset is partitioned in

k dis-

joint subsets, which, in turn, are equally divided in a

training set S

T R

and a validation set S

VAL

. The clas-

siﬁer hyper-parameters are separately optimized for

each of the

k subsets.

In other words, the clustering procedure and the

consequent alphabet Λ deﬁnition are completed once

for all, before the evolutionary optimization proce-

dure. In line with the

k-fold cross-validation tech-

nique, the same generic GA individual is provided

to all the

k-folds. For each

k-fold, after the fea-

tures selection (described at the end of this subsec-

tion), a new classiﬁcation model is deﬁned and the

related error rate ε

(

is evaluated on the S

VAL

, that is

(

= 1−α

(

, where α

(

is the accuracy obtained as:

α =

number of correct predicted patterns

total number of predicted patterns

. (3)

The subscript f stands for “fold”.

The overall error rate ε for the classiﬁcation task

is the average of the ε

(

values, that is:

ε =

∑

i=1

(i)

. (4)

The ﬁnal objective function – to be minimized – for

the overall optimization scheme is given by:

J = β · ε + (1 − β) ·

∑

. (5)

The last expression (5) is a convex linear combina-

tion of the total (average) error rate and a structural

complexity measure, which accounts for the number

of selected features (i.e., the number of alphabet sym-

bols).

It is worth to note that, despite the increasing com-

plexity of the learning procedure, the

k-fold cross-

validation scheme leads to better generalization capa-

bilities. The GA performs the evolutionary optimiza-

tion until the stopping condition is reached, returning

the best individual.

As concerns the feature selection, it involves the

counter matrices C of both S

T R

and S

VAL

. We re-

mark that the C matrices have documents as rows and

symbols as columns. According to the weights en-

coded in the GA as genes, the columns c

:, j

of both

the counter matrices are deleted if their indices cor-

responds to the indices of null weights in the weight

vector v = [v

, v

, ..., v

:, j



:, j

= 1

0 v

= 0.

(6)

Mining M-Grams by a Granular Computing Approach for Text Classiﬁcation

355

It is worth to note that the weights in v are real-valued

when they are processed by the GA and then they are

binarized with the following operation:



≥ θ

0 v

< θ,

(7)

where θ is a threshold computed as θ =

, that is the

reciprocal of the number of weights.

Such a choice about the genes encoding leads to

better GA performances , thanks to an higher granu-

larity of the genes values.

3 SIMULATION SETTINGS AND

RESULTS

3.1 The Dataset

The ﬁrst experimented dataset, namely the “Abstract”

dataset, consists of a collection of scientiﬁc paper

abstracts, published on famous journals in English.

More precisely, it counts 460 abstracts distributed

over the following four balanced classes (115 doc-

uments per-class): Anatomy, Information Theory,

String Theory and Semiconductors. A further ex-

periment is conducted adding documents for another

class, namely Smart Grids. Tab. 1 reports the main

statistics related to the corpus adopted for the experi-

ments.

The second dataset consists of a corpus of 5574

SMS labeled as spam and ham with 81175 tokens.

It is used for spam detection tasks. Further details

can be found in (Almeida et al., 2011; Asuncion and

Newman, 2007).

3.2 Experimental Setup

Before performing experiments, the text data has been

pre-processed. As concerns the “Abstract” dataset,

in this work, uppercase characters have been trans-

formed in lowercase ones (capitalization) and stop

words stored in a pre-deﬁned list have been elimi-

nated. Stop words are words without a semantic con-

sistency, like the deﬁnite article the in the English

language. The remaining words are then normalized,

that is transformed in their simplest form (normaliza-

tion). In the experiments with the “Abstract” dataset

lemmatization is adopted. The last pre-processing

phase is the tokenization step, where each document

is mapped in a set of tokens (the m-grams) represent-

ing the document itself.

Instead, for the “SMS” dataset only the lowercas-

ing pre-processing is performed, with no stop word

elimination and retaining the punctuation, which is in-

cluded in the tokens just like letters. The motivation

resides in the fact that part of the discriminative in-

formation is contained in the punctuation and special

characters.

The experiments have been conducted with some

pre-deﬁned conditions. Since clustering is performed

once per-experiment, the granularity level (i.e. the

number of clusters) k is ﬁxed and set to 10. As

concerns the m-grams extractor, the size m = 1 (uni-

grams), m = 2 (bi-grams) and m = 3 (tri-grams) are

considered. The undersampling rate ϑ is ﬁxed and set

to 0.1. We remark that the number of the complete list

of m-grams can be huge, thus the undersampling rate

ϑ is used to reduce the computational burden, by con-

sidering only a relative percentage of the m-grams for

each class. The numbers of m-grams per-class after

the uniform random undersampling for the “Abstract”

dataset are: 1284 for “Anatomy”, 1775 for “Informa-

tion Theory”, 995 for “String Theory”and 1563 for

“Semiconductor”.

For the “Abstract” dataset, the fastText word em-

bedding model is pre-trained on 1 million words, with

an embedding dimension of u = 300. We introduced

even a special parameter that allows considering only

the dictionary words which have an occurrence fre-

quency of at least equal to ψ, that, in the current ex-

periments, has been set to 5. Instead, for the “SMS

dataset”, the word embedding is computed directly on

the SMS corpus, speciﬁcally on the training set. This

leads to better classiﬁcation performances.

The parameter β in the objective function (5) is

set to 0.8 giving more importance to the classiﬁcation

error than to the sparsity parameter. For the evolu-

tionary optimization, it is used a kind of elitism that

brings ahead the two ﬁttest individuals. The dimen-

sion of population is set to 50 individuals and the ﬁt-

ness evaluation can be carried out in parallel fashion,

on a many-core workstation.

The current investigation grounds ﬁrstly on the

comparison between two evolutionary feature selec-

tion strategies, speciﬁcally on the “Abstract” dataset.

In the ﬁrst one, the weights related to the features se-

lection are encoded as real-valued GA genes, which

are binarized before each time the ﬁtness function is

evaluated. This online binarization (namely the Ap-

proach A) is performed for every single individual and

for each generation. In the second one, the weights

for features selection are yet encoded as real-valued

GA genes, but they are binarized at the end of the

optimization process, that is after the last generation.

We call this procedure “post-processing binarization”

(Approach B).

A second investigation is conducted with the aim

NCTA 2020 - 12th International Conference on Neural Computation Theory and Applications

356

Table 1: “Abstract” dataset statistics: total number of words (# words), minimum and maximum number of words per docu-

ment (#minWords and #maxWords), average number of words per document µ

words

and the variance of the number of words

per document σ

words

. The statistics concern the input dataset raw (before pre-processing), the dataset after pre-processing and

after the word embedding procedure with fastText.

# words # minWords # maxWords µ

words

Dataset Raw 52201 14 342 113.48 3229

Preprocessed Dataset 28151 10 200 61.19 965

fastText 26825 9 193 58.31 876

of measuring the generalization capability on the

SMS spam detection task adopting what we found as

the best feature selection strategy.

For robustness purposes, experimental results re-

lated to the current multi-class classiﬁcation problem

are averaged on 5 run of the optimization procedure

with different seeds for the pseudo-random number

generator.

Classiﬁcation performances are generated through

the evaluation of the confusion matrix obtained on the

test set S

T S

. Speciﬁcally, the proposed ﬁgure of merits

are the Accuracy, the Precision, the Speci f icity, the

Sensitivity, the F1 Score and the In f ormedness com-

puted as: In f ormedness = Speci f icity+Sensitivity −

3.3 Experimental Results

Main results are reported in Tab 2 for the “Abstract”

dataset without the Smart Grids class. The Accuracy

average values for Approach A are greater than the

ones for Approach B, even if it is not true for the vari-

ance values. Moreover, the same holds for all perfor-

mance indicators. According to that, it can be stated

that Approach A leads to a more effective solution

than Approach B, although less robust. Nevertheless,

even if ﬁgures in variance for Approach A are greater

of one order of magnitude than the ones for Approach

B (except of Speci f icity), they are reasonably and

enough exiguous to assert their scarce relevance in the

overall classiﬁcation performance. Thus, Approach A

performs better than Approach B.

A more accurate observation of the results can re-

veal some interesting considerations about the differ-

ence between the two approaches.

In Tab. 2 it can be seen that the difference be-

tween the Precision average value on one hand and

the same ﬁgure for Speci f icity and Sensitivity on the

other hand is greater in Approach B. Speciﬁcally, a

larger gap between these two values is possible when

the model classiﬁes more True Negatives than True

Positives. In addition, with reference to Precision and

Sensitivity, a larger gap between these two values is

possible when the model classiﬁes more False Posi-

tives than False Negatives. What just said is true for

Table 2: Measures of performances for the ﬁrst “Abstract”

dataset (without the Smart Grids class) obtained with the

two genes binarization approaches. Results are averaged on

5 runs of the optimization procedure with different seeds for

the pseudo-random number generator. Variance is reported

in brackets.

Perf. Indicator

Approach A Approach B

Accuracy

0.9293

(0.0010)

0.9066

(0.0002)

Precision

0.9668

(0.0022)

0.9259

(0.0006)

Speci f icity

0.9884

(0.0003)

0.9739

(0.0001)

Sensitivity

0.9696

(0.0012)

0.9578

(0.0007)

F1Score

0.9678

(0.0013)

0.9356

(0.0000)

In f ormedness

0.9580

(0.0021)

0.9217

(0.0003)

Table 3: Measures of performances for the second “Ab-

stract” dataset (with the Smart Grids class) obtained with

the two genes binarization approaches. Results are aver-

aged on 5 runs of the optimization procedure with different

seeds for the pseudo-random number generator. Variance is

reported in brackets.

Perf. Indicator

Approach A Approach B

Accuracy

0.9130

(0.0009)

0.9061

(0.0003)

Precision

0.9660

(0.0011)

0.9750

(0.0013)

Speci f icity

0.9913

(0.0000)

0.9935

(0.0000)

Sensitivity

0.9569

(0.0028)

0.9478

(0.0022)

F1Score

0.9621

(0.0005)

0.9661

(0.0006)

In f ormedness

0.9479

(0.0024)

0.9413

(0.0019)

both the approaches but ﬁgures hint that in follow-

ing Approach B there is a greater gap between True

Negatives and True Positives and a greater gap be-

tween False Positives and False Negatives, than for

Approach A.

Similar consideration can be done for the experi-

Mining M-Grams by a Granular Computing Approach for Text Classiﬁcation

357

Table 4: Alphabet symbols (a.k.a. representative m-grams) extracted from the best performing experiment (the ﬁrst “Abstract”

dataset). In blue the selected symbols by the feature selection procedure.

Anatomy Information Theory String Theory Semiconductor

”ﬁrst” ”case” ”ﬁrst” ”type” ”ﬁrst” ”time” ”effect” ”present” ”work”

”another” ”present” ”ﬁrst” ”control” ”latter” ”well” ”current”

”ﬁrst” ”case” ”many” ”another” ”theory” ”area” ”application” ”one”

”potential” ”treatment” ”patient” ”information” ”process” ”method” ”open” ”string” ”theory” ”various” ”type” ”semiconductor”

”point” ”patient” ”result” ”show” ”propose” ”ﬁeld” ”theory” ”string” semiconductor” ”effect”

”patient” ”type” ”2” ”management” ”system” ”string” ”theory” ”photoexcited” ”semiconductor” ”analysis”

”multivessel” ”method” ”information” ”show” ”string” ”semiconductor” ”device” ”aim”

”just” ”3” ”day” ”one” ”develop” ”model” ”string” ”ﬁeld” ”theory” ”concept” ”present”

”disease” ”heart” ”analysis” ”method” ”information” ”perturbatively” ”up” ”second” ”metal” ”semiconductor”

”myocardial” ”infarction” ”revascularization” ”system” ”information” ”entropy” ”present” ”paper” ”ﬁeld” ”semiconductor”

ments performed on the “Abstract” dataset increased

with the Smart Grids class, whose results are reported

in Tab. 3.

As concerns the knowledge discovery task, in Tab.

4 the selected symbols for each class for the best

performing experiment (Accuracy 0.9293) with Ap-

proach A (“Abstract” dataset without the Smart Grids

class) are reported . m-grams highlighted in blue are

the ones selected during the feature selection phase.

As concerns the classes Semiconductor and Anatomy

all symbols are retained. Documents which belongs

to Anatomy do not contain the word “anatomy” or

its derived words. Maybe, as a consequence, the al-

gorithm needs as much information as possible to

classify documents in this class. On the other side,

Semiconductor has many contextual words, roughly

equally relevant for classiﬁcation.

For the Information Theory class several symbols

are discarded, such as first type, result show

purpose, management system, while it is retained

system information entropy, that is meaning-

ful for its membership class. For the class

String Theory the discarded symbols are generic

m-grams, such as present paper field, theory

area application, while they are retained, for ex-

ample, string field theory. At least in this ex-

ample, the system seems to prefer, among m-grams

containing the tokens theory and field, the ones

where these tokens are closed to the word string,

that it is found meaningful for the underlying class.

In Tab. 5 are reported the performances obtained

with the Approach A for the “SMS” dataset. The sys-

tem reach good results in terms of Accuracy (0.9505)

with an high Precision (0.9592) and In f ormedness

(0.7138). This demonstrates that the proposed ap-

proach is in charge of solving an instance of spam

detection problem. But what is really interesting con-

sists of the selected symbols, especially for the spam

class – see Tab. 6. In fact, the m-gram selected by

the evolutionary feature selection wrapper are words

remarkably related to spam or adv., such as the uni-

gram subsciber, or the bi-grams customer care or

only 10p. This allow to conduct an in depth analysis

of what is discriminative for each class even varying

the granulation level, for example, setting more alpha-

bet symbols in the SH synthesis.

Table 5: Performance measures for the “SMS” dataset ob-

tained with the approach A. Results are averaged on 5 runs

of the optimization procedure with different seeds for the

pseudo-random number generator. Variance is reported in

brackets.

Acc. Prec. Spec. Sens. F1Score In f ormed.

0.9502 0.9592 0.7294 0.9843 0.9716 0.7138

(0.0000) (0.0000) (0.0014) (0.0000) (0.0000) (0.0013)

Table 6: Alphabet symbols (a.k.a. representative m-grams)

extracted from the best performing experiment (“SMS”

dataset). In blue the selected symbols by the feature se-

lection procedure.

ham spam

”&” ”gt” ”subscriber”

”without” ”customer” ”care”

”city” ”nokia” ”plus”

”party” ”&” ”

A¢” ”

A£”

”probably” ”take” ”the”

”great” ”personal” ”future” ”reply”

”made” ”fun” ”9.05E+09”

”.” ”makes” ”great”

”chennai” ”only” ”10p

”you” ”actually” ”spanish”

4 CONCLUSIONS

The current work represents the endeavor of build-

ing a TC system able to reach good recognition per-

formances and at the same time able to provide an

interpretable model, in which classiﬁcation rules are

even comprehensible for the ﬁnal user with no expe-

rience in the Machine Learning ﬁeld. The paradigm

adopted is the one provided by GrC, where an ap-

NCTA 2020 - 12th International Conference on Neural Computation Theory and Applications

358

propriate granulation of the text in m-grams, along

with a neural word embedding representation of the

single tokens, allows to build a real-valued embed-

ding of documents (the SH approach). This proce-

dure yields a supervised learning problem to solve

through a standard Machine Learning algorithm, like-

wise νSVM. The appealing of the current work is

twofold. On one hand, there is the possibility of mea-

suring the dissimilarity between a word-vector repre-

sentation of m-grams of different lengths by means

of a custom-based dissimilarity pertaining to the fam-

ily of Edit distances. From the other, the entire pro-

cessing pipeline allows building a gray-box model en-

abling the users to understand how the core classi-

ﬁer takes decisions, outputting a series of meaning-

ful symbols, such as small sequences of words re-

lated to the class label. An evolutionary strategy along

with the tuning of the classiﬁer hyper-parameters is

used for a wrapper-like feature selection, where fea-

ture weights (genes pertaining to the overall chromo-

some), originally casted as real-valued vectors, are

binarized in two different ways, that is in an online

and an off-line fashion. The ﬁrst approach outperform

clearly the second on the conducted experiments. The

satisfying recognition performances together with the

remarkable possibility to have additional information

for knowledge discovery tasks, let us be conﬁdent in

further developments of the described system. As

concerns the GrC model, it is possible to adopt even

an external text corpus (e.g. Wikipedia) eliciting a

kind of focused transfer-learning procedure. The de-

cision rule “minDist” experimented here can be sub-

stituted with other proper rules, making more robust

the process of construction of the SH, hence improv-

ing the alphabet symbols synthesis. Finally, as con-

cerns the dissimilarity measure between information

granules (i.e. m-grams), other dissimilarity measures

can be experimented, in order to provide a good se-

mantic background to the system, such as the plain

Euclidean distance between equal-sized m-grams or

more general Edit distances such as the multidimen-

sional Dynamic Time Warping. In the last case,

longer m-grams can be used pushing the boundary to-

wards more explainable AI systems.

REFERENCES

Almeida, T. A., Hidalgo, J. M. G., and Yamakami, A.

(2011). Contributions to the study of sms spam ﬁl-

tering: new collection and results. In Proceedings of

the 11th ACM symposium on Document engineering,

pages 259–262.

Apolloni, B., Bassis, S., Malchiodi, D., and Pedrycz, W.

(2008). The puzzle of granular computing, volume

138. Springer.

Asuncion, A. and Newman, D. (2007). Uci machine learn-

ing repository.

Bojanowski, P., Grave, E., Joulin, A., and Mikolov, T.

(2016). Enriching word vectors with subword infor-

mation. arXiv preprint arXiv:1607.04606.

Buckland, M. (2017). Information and society. MIT Press.

Chang, C.-C. and Lin, C.-J. (2011). LIBSVM: A library for

support vector machines. ACM Transactions on Intel-

ligent Systems and Technology, 2:27:1–27:27. Soft-

ware available at http://www.csie.ntu.edu.tw/

∼

cjlin/

libsvm.

Feldman, R., Sanger, J., et al. (2007). The text mining hand-

book: advanced approaches in analyzing unstruc-

tured data. Cambridge university press.

Goldberg, Y. and Levy, O. (2014). word2vec explained:

deriving mikolov et al.’s negative-sampling word-

embedding method. arXiv preprint arXiv:1402.3722.

Heeman, F. C. (1992). Granularity in structured documents.

Electronic publishing, 5(3):143–155.

Jing, L. and Lau, R. Y. (2009). Granular computing for text

mining: New research challenges and opportunities.

In International Workshop on Rough Sets, Fuzzy Sets,

Data Mining, and Granular-Soft Computing, pages

478–485. Springer.

Kwapie

n, J. and Dro

z, S. (2012). Physical approach

to complex systems. Physics Reports, 515(3-4):115–

226.

Levy, O. and Goldberg, Y. (2014). Neural word embedding

as implicit matrix factorization. In Advances in neural

information processing systems, pages 2177–2185.

Martino, A., Giuliani, A., and Rizzi, A. (2018). Gran-

ular computing techniques for bioinformatics pat-

tern recognition problems in non-metric spaces. In

Pedrycz, W. and Chen, S.-M., editors, Computational

Intelligence for Pattern Recognition, pages 53–81.

Springer International Publishing, Cham.

Martino, A., Rizzi, A., and Frattale Mascioli, F. M. (2017).

Efﬁcient approaches for solving the large-scale k-

medoids problem. In Proceedings of the 9th Inter-

national Joint Conference on Computational Intelli-

gence - Volume 1: IJCCI,, pages 338–347. INSTICC,

SciTePress.

Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013a).

Efﬁcient estimation of word representations in vector

space. arXiv preprint arXiv:1301.3781.

Mikolov, T., Yih, W.-t., and Zweig, G. (2013b). Linguis-

tic regularities in continuous space word representa-

tions. In Proceedings of the 2013 Conference of the

North American Chapter of the Association for Com-

putational Linguistics: Human Language Technolo-

gies, pages 746–751.

Navarro, G. (2001). A guided tour to approximate

string matching. ACM computing surveys (CSUR),

33(1):31–88.

Ritu, Z. S., Nowshin, N., Nahid, M. M. H., and Ismail, S.

(2018). Performance analysis of different word em-

bedding models on bangla language. In 2018 Inter-

Mining M-Grams by a Granular Computing Approach for Text Classiﬁcation

359

national Conference on Bangla Speech and Language

Processing (ICBSLP), pages 1–5. IEEE.

Rizzi, A., Del Vescovo, G., Livi, L., and Mascioli, F.

(2012). A new granular computing approach for se-

quences representation and classiﬁcation. In Proceed-

ings of the International Joint Conference on Neural

Networks. cited By 16.

Sch

olkopf, B., Smola, A. J., Williamson, R. C., and Bartlett,

P. L. (2000). New support vector algorithms. Neural

computation, 12(5):1207–1245.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,

L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I.

(2017). Attention is all you need. In Advances in

neural information processing systems, pages 5998–

6008.

Yao, Y. (2006). Granular computing for data mining. In

Data Mining, Intrusion Detection, Information As-

surance, and Data Networks Security 2006, volume

6241, page 624105. International Society for Optics

and Photonics.

NCTA 2020 - 12th International Conference on Neural Computation Theory and Applications

360