EXPLORING BASQUE DOCUMENT CATEGORIZATION FOR EDUCATIONAL PURPOSES USING LSI
A. Zelaia, I. Alegria, O. Arregi, A. Arruarte, A. Díaz de Ilarraza, J. A. Elorriaga, B. Sierra
2009
Abstract
In the process of preparing learning material for Computer Supported Learning Systems (CSLSs), one of the first steps involves finding documents relevant to the topics and to the students. This requires documents to be categorized according to some criteria. In this paper we analyze the behaviour of classification techniques such as Na " i ve Bayes, Winnow, SVMs and k-NN, together with lemmatization and noun selection, in the categorization of documents written in Basque. In a second experiment, we study the effect of applying the Singular Value Decomposition (SVD) dimensionality reduction technique before using the mentioned classification techniques. The results obtained show that the approach which combines SVD and k-NN for a lemmatized corpus gives the best categorization of all with a remarkable difference. The final aim pursued in this project is to facilitate the semiautomatic construction of the domain module of a CSLS.
References
- Alegria, I., Artola, X., Sarasola, K., and Urkia, M. (1996). Automatic morphological analysis of basque. Literary & Linguistic Computing, 11.
- Aleven, V., Hoppe, U., Kay, J., Mizoguchi, R., Pain, H., Verdejo, F., and Yacef, K., editors (2003). Technologies for Electronic Documents for Supporting Learning.
- Berry, M. and Browne, M. (1999). Understanding Search Engines: Mathematical Modeling and Text Retrieval. SIAM Society for Industrial and Applied Mathematics, ISBN: 0-89871-437-0, Philadelphia.
- Carlson, A., Cumby, C., Rosen, J., and Roth, D. (1999). Snow. UIUC Tech report UIUC-DCS-R-99-210. University of Illinois.
- Dagan, I., Karov, Y., and Roth, D. (1997). Mistake-driven learning in text categorization. In Proceedings of the 2nd Conference on Empirical Methods in Natural Language Processing, pages 55-63.
- Dasarathy, B. (1991). Nearest neighbor (nn) norms: Nn pattern recognition classification techniques. IEEE Computer Society Press.
- Deerwester, S., Dumais, S., Furnas, G., Landauer, T., and Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41:391-407.
- Dolin, R., Pierre, J., Butler, M., and Avedon, R. (1999). Practical evaluation of ir within automated classification systems. Proceedings of the International Conference on Information and Knowledge Management CIKM, pages 322-329.
- Dumais, S. (1995). Using lsi for information filtering: Trec3 experiments. In Harman, D., editor, Third Text REtrieval Conference (TREC3), pages 219-230.
- Dumais, S. (2004). Latent semantic analysis. ARIST (Annual Review of Information Science Technology), 38:189-230.
- Ezeiza, N., Aduriz, I., Alegria, I., Arriola, J., and Urizar, R. (1998). Combining stochastic and rule-based methods for disambiguation in agglutinative languages. COLING-ACL'98.
- Ghani, R., Jones, R., and Mladenic, D. (2001). Using the web to create minority language corpora. In International Conference on Infroamtion and Knowledge Management (CIKM 2001).
- Graesser, A., Person, N., Harter, D., and Group, T. T. R. (2001). Teaching tactics and dialog in autotutor. International Journal of Artificial Intelligence in Education, 12(3):257-279.
- homepage, L. L. O. M. W. G. (2001). IEEE P1484.12. http://ltsc.ieee.org/wg12/.
- Larran˜aga, M., Elorriaga, J., and Arruarte, A. (2008). A heuristic nlp based approach for getting didactic resources from electronic documents. In Proceedings of the 3th Europeanl Conference on TechnologyEnhanced Learning, Springer, LNCS 5192, pages 197-202.
- Larran˜aga, M., Rueda, U., Elorriaga, J., and Arruarte, A. (2003). Index analysis: A means to acquire the domain module structure. In X CAEPIA - V TTIA, volume II, pages 339-342.
- Larran˜aga, M., Rueda, U., Elorriaga, J., and Arruarte, A. (2004). Acquisition of the domain structure from document indexes using heuristic reasoning. In Lester, J., Vicari, R., and Paraguacu, F., editors, Intelligent Tutoring Systems, LNCS 3220, pages 175-186.
- Miller, T. (2003). Essay assessment with latent semantic analysis. Journal of Educational Computing Research, 28.
- Minsky, M. (1961). Steps toward artificial intelligence. In Proceedings of the Institute of Radio Engineers, volume 49, pages 8-30.
- Nakayama, M. and Shimizu, Y. (2003). Subject categorization for web educational resources using mlp. In European Symposium on Artificial Neural Networks, ESANN'2003, pages 9-14.
- Nakov, P., Valchanova, E., and Angelova, G. (2003). Towards deeper understanding of the lsa performance. In Proc. of the Int. Conference RANLP-03 ”Recent Advances in Natural Language Processing”, pages 311- 318, Bulgaria.
- Sebastiani, F. (2005). Text categorization. Text Mining and its Applications, pages 109-129.
- Vereoustre, A. and McLean, A. (2003). Reusing educational material for teaching and learning: Current approaches and directions. In Aleven, V., Hoppe, U., Kay, J., Mizoguchi, R., Pain, H., Verdejo, F., and Yacef, K., editors, Supplementary Proceedings of AIED2003, pages 621-630.
- Witten, I. and Frank, E. (2005). Data mining. practical machine learning tools and techniques. Morgan Kaufmann Publishers.
- Zampa, V. and Lemaire, B. (2002). Latent semantic analysis for user modeling. Journal of Intelligent Information Systems. Special Issue on Education Applications., 18(1):15-30,.
Paper Citation
in Harvard Style
Zelaia A., Alegria I., Arregi O., Arruarte A., Díaz de Ilarraza A., Elorriaga J. and Sierra B. (2009). EXPLORING BASQUE DOCUMENT CATEGORIZATION FOR EDUCATIONAL PURPOSES USING LSI . In Proceedings of the First International Conference on Computer Supported Education - Volume 1: CSEDU, ISBN 978-989-8111-82-1, pages 5-9. DOI: 10.5220/0001834300050009
in Bibtex Style
@conference{csedu09,
author={A. Zelaia and I. Alegria and O. Arregi and A. Arruarte and A. Díaz de Ilarraza and J. A. Elorriaga and B. Sierra},
title={EXPLORING BASQUE DOCUMENT CATEGORIZATION FOR EDUCATIONAL PURPOSES USING LSI},
booktitle={Proceedings of the First International Conference on Computer Supported Education - Volume 1: CSEDU,},
year={2009},
pages={5-9},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0001834300050009},
isbn={978-989-8111-82-1},
}
in EndNote Style
TY - CONF
JO - Proceedings of the First International Conference on Computer Supported Education - Volume 1: CSEDU,
TI - EXPLORING BASQUE DOCUMENT CATEGORIZATION FOR EDUCATIONAL PURPOSES USING LSI
SN - 978-989-8111-82-1
AU - Zelaia A.
AU - Alegria I.
AU - Arregi O.
AU - Arruarte A.
AU - Díaz de Ilarraza A.
AU - Elorriaga J.
AU - Sierra B.
PY - 2009
SP - 5
EP - 9
DO - 10.5220/0001834300050009