der consideration. The second variant uses a mixture
of the previous idea and the relevance frequency rf in
order to consider also the amount of documents be-
longing to the category in which the term occurs.
We performed extensive experimental studies on
two datasets, i.e. Reuters corpus with either 10 or
52 categories and 20 Newsgroups, and three different
classification methods, i.e. SVM classifier with linear
and RBF kernel functions and RandomForest. The
obtained results show that the tf.idfec-based method
that combines idfec and rf generally gets top results
on all datasets and with all classifiers. Through statis-
tical significance tests, we showed that the proposed
scheme always achieves top effectiveness and is never
worse than other methods. The results put in evi-
dence a close competition between our tf.idfec-based
and the tf.rf schemes; in particular, the best results
obtained with the different datasets and algorithms,
varying the amount of feature selection, are very sim-
ilar, but with some differences: tf-rf seems to be more
stable when the number of features is high, while our
tf.idfec-based gives excellent results with few features
and shows some decay (less than 4%) when the num-
ber of features increases.
As future works, we plan to apply this idea to
larger datasets and to hierarchical text corpora, using
a variation of idfec able to take into account the tax-
onomy of categories. We also plan to test the use of
this scheme for feature selection, other than for term
weighting. Moreover, we are going to investigate the
effectiveness of this variant of tf.idf in other fields
where weighting schemes can be employed with pos-
sible efficacy improvements, such as sentiment anal-
ysis and opinion mining.
REFERENCES
Bloehdorn, S. and Hotho, A. (2006). Boosting for text
classification with semantic features. In Mobasher,
B., Nasraoui, O., Liu, B., and Masand, B., editors,
Advances in Web Mining and Web Usage Analysis,
volume 3932 of Lecture Notes in Computer Science,
pages 149–166. Springer Berlin Heidelberg.
Breiman, L. (2001). Random forests. Machine Learning,
45(1):5–32.
Carmel, D., Mejer, A., Pinter, Y., and Szpektor, I. (2014).
Improving term weighting for community question
answering search using syntactic analysis. In Pro-
ceedings of the 23rd ACM International Conference
on Conference on Information and Knowledge Man-
agement, CIKM ’14, pages 351–360, New York, NY,
USA. ACM.
Debole, F. and Sebastiani, F. (2003). Supervised term
weighting for automated text categorization. In In
Proceedings of SAC-03, 18th ACM Symposium on Ap-
plied Computing, pages 784–788. ACM Press.
Deisy, C., Gowri, M., Baskar, S., Kalaiarasi, S., and Ramraj,
N. (2010). A novel term weighting scheme midf for
text categorization. Journal of Engineering Science
and Technology, 5(1):94–107.
Deng, Z.-H., Luo, K.-H., and Yu, H.-L. (2014). A study of
supervised term weighting scheme for sentiment anal-
ysis. Expert Systems with Applications, 41(7):3506–
3513.
Deng, Z.-H., Tang, S.-W., Yang, D.-Q., Li, M. Z. L.-Y., and
Xie, K.-Q. (2004). A comparative study on feature
weight in text categorization. In Advanced Web Tech-
nologies and Applications, pages 588–597. Springer.
Dietterich, T. G. (1998). Approximate statistical tests
for comparing supervised classification learning algo-
rithms. Neural Comput., 10(7):1895–1923.
Domeniconi, G., Moro, G., Pasolini, R., and Sartori, C.
(2014). Cross-domain text classification through it-
erative refining of target categories representations. In
Proceedings of the 6th International Conference on
Knowledge Discovery and Information Retrieval.
Galavotti, L., Sebastiani, F., and Simi, M. (2000). Experi-
ments on the use of feature selection and negative ev-
idence in automated text categorization. In Research
and Advanced Technology for Digital Libraries, pages
59–68. Springer.
Hassan, S. and Banea, C. (2006). Random-walk term
weighting for improved text classification. In In Pro-
ceedings of TextGraphs: 2nd Workshop on Graph
Based Methods for Natural Language Processing.
ACL, pages 53–60.
Joachims, T. (1998). Text categorization with support vec-
tor machines: Learning with many relevant features.
Lan, M., Sung, S.-Y., Low, H.-B., and Tan, C.-L. (2005). A
comparative study on term weighting schemes for text
categorization. In Neural Networks, 2005. IJCNN’05.
Proceedings. 2005 IEEE International Joint Confer-
ence on, volume 1, pages 546–551. IEEE.
Lan, M., Tan, C. L., Su, J., and Lu, Y. (2009). Supervised
and traditional term weighting methods for automatic
text categorization. IEEE Transactions on Pattern
Analysis and Machine Intelligence, 31(4):721–735.
Largeron, C., Moulin, C., and G
´
ery, M. (2011). Entropy
based feature selection for text categorization. In Pro-
ceedings of the 2011 ACM Symposium on Applied
Computing, SAC ’11, pages 924–928, New York, NY,
USA. ACM.
Leopold, E. and Kindermann, J. (2002). Text categorization
with support vector machines. how to represent texts
in input space? Mach. Learn., 46(1-3):423–444.
Lewis, D. D. (1995). Evaluating and optimizing au-
tonomous text classification systems. In Proceedings
of the 18th annual international ACM SIGIR confer-
ence on Research and development in information re-
trieval, SIGIR ’95, pages 246–254, New York, NY,
USA. ACM.
Liu, Y., Loh, H. T., and Sun, A. (2009). Imbalanced text
classification: A term weighting approach. Expert
Syst. Appl., 36(1):690–701.
DATA2015-4thInternationalConferenceonDataManagementTechnologiesandApplications
36