propriate granulation of the text in m-grams, along
with a neural word embedding representation of the
single tokens, allows to build a real-valued embed-
ding of documents (the SH approach). This proce-
dure yields a supervised learning problem to solve
through a standard Machine Learning algorithm, like-
wise νSVM. The appealing of the current work is
twofold. On one hand, there is the possibility of mea-
suring the dissimilarity between a word-vector repre-
sentation of m-grams of different lengths by means
of a custom-based dissimilarity pertaining to the fam-
ily of Edit distances. From the other, the entire pro-
cessing pipeline allows building a gray-box model en-
abling the users to understand how the core classi-
fier takes decisions, outputting a series of meaning-
ful symbols, such as small sequences of words re-
lated to the class label. An evolutionary strategy along
with the tuning of the classifier hyper-parameters is
used for a wrapper-like feature selection, where fea-
ture weights (genes pertaining to the overall chromo-
some), originally casted as real-valued vectors, are
binarized in two different ways, that is in an online
and an off-line fashion. The first approach outperform
clearly the second on the conducted experiments. The
satisfying recognition performances together with the
remarkable possibility to have additional information
for knowledge discovery tasks, let us be confident in
further developments of the described system. As
concerns the GrC model, it is possible to adopt even
an external text corpus (e.g. Wikipedia) eliciting a
kind of focused transfer-learning procedure. The de-
cision rule “minDist” experimented here can be sub-
stituted with other proper rules, making more robust
the process of construction of the SH, hence improv-
ing the alphabet symbols synthesis. Finally, as con-
cerns the dissimilarity measure between information
granules (i.e. m-grams), other dissimilarity measures
can be experimented, in order to provide a good se-
mantic background to the system, such as the plain
Euclidean distance between equal-sized m-grams or
more general Edit distances such as the multidimen-
sional Dynamic Time Warping. In the last case,
longer m-grams can be used pushing the boundary to-
wards more explainable AI systems.
REFERENCES
Almeida, T. A., Hidalgo, J. M. G., and Yamakami, A.
(2011). Contributions to the study of sms spam fil-
tering: new collection and results. In Proceedings of
the 11th ACM symposium on Document engineering,
pages 259–262.
Apolloni, B., Bassis, S., Malchiodi, D., and Pedrycz, W.
(2008). The puzzle of granular computing, volume
138. Springer.
Asuncion, A. and Newman, D. (2007). Uci machine learn-
ing repository.
Bojanowski, P., Grave, E., Joulin, A., and Mikolov, T.
(2016). Enriching word vectors with subword infor-
mation. arXiv preprint arXiv:1607.04606.
Buckland, M. (2017). Information and society. MIT Press.
Chang, C.-C. and Lin, C.-J. (2011). LIBSVM: A library for
support vector machines. ACM Transactions on Intel-
ligent Systems and Technology, 2:27:1–27:27. Soft-
ware available at http://www.csie.ntu.edu.tw/
∼
cjlin/
libsvm.
Feldman, R., Sanger, J., et al. (2007). The text mining hand-
book: advanced approaches in analyzing unstruc-
tured data. Cambridge university press.
Goldberg, Y. and Levy, O. (2014). word2vec explained:
deriving mikolov et al.’s negative-sampling word-
embedding method. arXiv preprint arXiv:1402.3722.
Heeman, F. C. (1992). Granularity in structured documents.
Electronic publishing, 5(3):143–155.
Jing, L. and Lau, R. Y. (2009). Granular computing for text
mining: New research challenges and opportunities.
In International Workshop on Rough Sets, Fuzzy Sets,
Data Mining, and Granular-Soft Computing, pages
478–485. Springer.
Kwapie
´
n, J. and Dro
˙
zd
˙
z, S. (2012). Physical approach
to complex systems. Physics Reports, 515(3-4):115–
226.
Levy, O. and Goldberg, Y. (2014). Neural word embedding
as implicit matrix factorization. In Advances in neural
information processing systems, pages 2177–2185.
Martino, A., Giuliani, A., and Rizzi, A. (2018). Gran-
ular computing techniques for bioinformatics pat-
tern recognition problems in non-metric spaces. In
Pedrycz, W. and Chen, S.-M., editors, Computational
Intelligence for Pattern Recognition, pages 53–81.
Springer International Publishing, Cham.
Martino, A., Rizzi, A., and Frattale Mascioli, F. M. (2017).
Efficient approaches for solving the large-scale k-
medoids problem. In Proceedings of the 9th Inter-
national Joint Conference on Computational Intelli-
gence - Volume 1: IJCCI,, pages 338–347. INSTICC,
SciTePress.
Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013a).
Efficient estimation of word representations in vector
space. arXiv preprint arXiv:1301.3781.
Mikolov, T., Yih, W.-t., and Zweig, G. (2013b). Linguis-
tic regularities in continuous space word representa-
tions. In Proceedings of the 2013 Conference of the
North American Chapter of the Association for Com-
putational Linguistics: Human Language Technolo-
gies, pages 746–751.
Navarro, G. (2001). A guided tour to approximate
string matching. ACM computing surveys (CSUR),
33(1):31–88.
Ritu, Z. S., Nowshin, N., Nahid, M. M. H., and Ismail, S.
(2018). Performance analysis of different word em-
bedding models on bangla language. In 2018 Inter-
Mining M-Grams by a Granular Computing Approach for Text Classification
359