cepts are retrieved, (2) optimizing the parameters of
our Smart Feature Selection technique to maximize
dataset coverage with minimal features, as well as
comparing it to other well-known feature selection
techniques, like Single Value Decomposition, (3) cal-
culating additional performance metrics to compare
ATCCs, such as recall and precision, (4) turning the
ATCC Performance Cube into a comprehensive tool
that allows interactive analysis of ATCCs for business
decision-makers.
ACKNOWLEDGEMENTS
We would like to thank the Graduate School of Excel-
lence advanced Manufacturing Engineering (GSaME)
for supporting the broader research project from
which this paper is developed.
REFERENCES
Clauset, A., Rohilla Shalizi, C., and Newman, M. (2009).
Power-law Distributions in Empirical Data. SIAM Re-
view, 51(4):661–703.
Csardi, G. and Nepusz, T. (2006). The igraph software
package for complex network research. InterJournal,
Complex Systems:1695.
Dasgupta, a., Drineas, P., Harb, B., Josifovski, V., and
Mahoney, M. W. (2007). Feature selection methods
for text classification. Proceedings of the 13th ACM
SIGKDD International Conference, pages 230–239.
Feinerer, I. and Hornik, K. (2015). tm: Text Mining Pack-
age. R package version 0.6-2.
Ferrucci, D. and Lally, A. (2004). UIMA: an architec-
tural approach to unstructured information processing
in the corporate research environment. Natural Lan-
guage Engineering, 10(3-4):327–348.
Forman, G. (2003). An Extensive Empirical Study of Fea-
ture Selection Metrics for Text Classification. Journal
of Machine Learning Research, 3:1289–1305.
Gupta, M. R., Bengio, S., and Weston, J. (2014). Train-
ing Highly Multiclass Classifiers. Journal of Machine
Learning Research, 15:1461–1492.
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann,
P., and Witten, I. H. (2009). The WEKA data min-
ing software. ACM SIGKDD Explorations Newsletter,
11(1):10.
Heimerl, F., Koch, S., Bosch, H., and Ertl, T. (2012). Visual
classifier training for text document retrieval. IEEE
TVCG Journal, 18(12):2839–2848.
Hornik, K., Buchta, C., and Zeileis, A. (2009). Open-source
machine learning: R meets Weka. Computational
Statistics, 24(2):225–232.
Hornik, K., Mair, P., Rauch, J., Geiger, W., Buchta, C., and
Feinerer, I. (2013). The textcat package for n-gram
based text categorization in R. Journal of Statistical
Software, 52(6):1–17.
Hotho, A., N
¨
urnberger, A., and Paaß, G. (2005). A Brief
Survey of Text Mining. LDV Forum - GLDV Journal
for Computational Linguistics and Language Technol-
ogy, 20:19–62.
Kassner, L. and Mitschang, B. (2016). Exploring text clas-
sification for messy data: An industry use case for
domain-specific analytics. In Proceedings of the 19th
EDBT International Conference 2016.
Kemper, H.-G., Baars, H., and Lasi, H. (2013). An Inte-
grated Business Intelligence Framework. In Rausch,
P., Sheta, A. F., and Ayesh, A., editors, Business In-
telligence and Performance Management, chapter 2,
pages 13–26. Springer, London.
Kouznetsov, A. and Japkowicz, N. (2010). Using classifier
performance visualization to improve collective rank-
ing techniques for biomedical abstracts classification.
In Farzindar, A. and Ke
ˇ
selj, V., editors, Advances in
Artificial Intelligence, volume 6085, pages 299–303.
Springer Berlin Heidelberg, Ottawa.
Liu, W., Wang, L., and Yi, M. (2013). Power Law for Text
Categorization. In Sun, M., Zhang, M., Lin, D., and
Wang, H., editors, Chinese Computational Linguistics
and Natural Language Processing Based on Naturally
Annotated Big Data, volume 8208, pages 131–143,
Suzhou. Springer.
Luhn, H. P. (1958). The Automatic Creation of Literature
Abstracts. IBM Journal of Research and Develop-
ment, 2(2):159–165.
Naidu, K., Dhenge, A., and Wankhade, K. (2014). Feature
selection algorithm for improving the performance of
classification: A survey. In Tomar, G. and Singh,
S., editors, Proceedings of the 2014 4th CSNT Inter-
national Conference, pages 468–471, Bhopal. IEEE
Computer Society.
Newman, M. E. J. (2005). Power laws, Pareto distributions
and Zipf’s law. Power laws, Pareto distributions and
Zipf’s law. Contemporary physics, 46(5):323–351.
Ng, R. T., Arocena, P. C., Barbosa, D., and Carenini, G.
(2013). Perspectives on Business Intelligence. Mor-
gan & Claypool.
Salton, G., Wong, a., and Yang, C. S. (1975). A Vec-
tor Space Model for Automatic Indexing. Magazine
Communications of the ACM, 18(11):613–620.
Sebastiani, F. (2002). Machine learning in automated text
categorization. ACM Computing Surveys, 34(1):1–47.
Exploring Text Classification Configurations - A Bottom-up Approach to Customize Text Classifiers based on the Visualization of
Performance
511