essary that a document contains terms of query. From
the recall, the boolean representation has the better re-
sult. This was because BR returned a lot of documents
(more than 300) for each query.
5 CONCLUSIONS
An unsupervised approach for indexing documents is
proposed in this paper. The proposal combines text
minig and natural language processing to obtain a
document-topic matrix representation for a set of doc-
uments. First, verb-noun relationships are obtained
by using a POS tagger. Then the Clustering by Com-
mittee algorithm is used to group terms according to
verb-noun relations. After that, the Latent Direch-
let Allocation is applied to obtain the most relevant
terms. The parameters for LDA are obtained without
human intervention. According to the experiments,
in general, the proposal has a better performance in
topic-based semantic searches over traditional mod-
els (boolean and vector space model). Future work
should include semantic processing in web search or
analysis of tweets/posts.
ACKNOWLEDGEMENTS
This research was partialy funded by project number
165474 from “Fondo Mixto Conacyt-Gobierno del
Estado de Tamaulipas”.
REFERENCES
Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003). Latent
Dirichlet Allocation. Journal of Machine Learning
Research, 3:993–1022.
Deerwester, S. C., Dumais, S. T., Landauer, T. K., Furnas,
G. W., and Harshman, R. A. (1990). Indexing by latent
semantic analysis. Journal of the American Society of
Information Science, 41:391–407.
Fischer, H. (2011). Conclusion: The central limit theorem
as a link between classical and modern probability
theory. In A History of the Central Limit Theorem,
Sources and Studies in the History of Mathematics
and Physical Sciences, pages 353–362. Springer New
York.
Griffiths, T. L. and Steyvers, M. (2004). Finding scientific
topics. Proceedings of the National Academy of Sci-
ence, 101:5228–5235.
Klein, D. and Manning, C. (2003). Accurate unlexicalized
parsing. In Proceedings of the 41st Meeting of the
Association for Computational Linguistics.
Konietzny, S. G. A., Dietz, L., and McHardy, A. C. (2011).
Inferring functional modules of protein families with
probabilistic topic models. BMC Bioinformatics,
12:141.
Lafferty, J. D. and Zhai, C. (2001). Document language
models, query models, and risk minimization for in-
formation retrieval. In Croft, W. B., Harper, D. J.,
Kraft, D. H., and Zobel, J., editors, SIGIR, pages 111–
119. ACM.
Lin, D. (1998). Automatic retrieval and clustering of similar
words. In Proceedings of the 17th international con-
ference on Computational linguistics, pages 768–774,
Morristown, NJ, USA. Association for Computational
Linguistics.
Manning, C. D., Raghavan, P., and Schtze, H. (2008). In-
troduction to Information Retrieval. Cambridge Uni-
versity Press.
Pantel, P. A. (2003). Clustering by committee. PhD the-
sis, University of Alberta Edmonton. Adviser-Dekang
Lin.
Ponte, J. and Croft, B. (1998). A language modeling ap-
proach to information retrieval. In Proceedings of the
21st International Conference on Research and De-
velopment in Information Retrieval.
Robertson, S. E. and Jones, K. S. (1976). Relevance weight-
ing of search terms. Journal of the American Society
for Information Science, 27:129–146.
Salton, G., Wong, A., and Yang, C.-S. (1975). A vector
space model for automatic indexing. Communications
of the ACM, 18(11):613–620. The paper where vector
space model for IR was introduced.
S
´
anchez, D. (2009). Domain ontology learning from
the web. The Knowledge Engineering Review,
24(04):413–413.
A SEMANTIC CLUSTERING APPROACH FOR INDEXING DOCUMENTS
293