# TOPIC EXTRACTION FROM DIVIDED DOCUMENT SETS

### Takeru Yokoi, Hidekazu Yanagimoto

#### Abstract

We propose here a method to extract topics from a large document set with the topics included in its divisions and the combination of them. In order to extract topics, the Sparse Non-negative Matrix Factorization that imposes sparse constrain only to a basis matrix, which we call SNMF/L, is applied to document sets. It is useful to combine the topics from some small document sets since if the number of documents is large, the procedure of topic extraction with the SNMF/L from a large corpus takes a long time. In this paper, we have shortened the procedure time for the topic extraction from a large document set with the combining topics that are extracted from respective divided document set. In addition, an evaluation of our proposed method has been carried out with the corresponding topics between the combined topics and the topics from the large document set by the SNMF/L directly, and the procedure times of the SNMF/L.

#### References

- A. Hyvarinen, E. O. (2000). Independent component analysis: A tutorial. Neural Network, 13:411-430.
- E.Bingham, A.Kaban, M. (2003). Topic identification in dynamical text by complexity pursuit. Neural Processing Letters, 17(1):69-83.
- G. Cselle, K. Albrecht, R. Wattenhofer (2007). Buzztrack: Topic detection and tracking in email. In IUI2007.
- G.Salton, M.J.McGill (1983). Introduction to Modern Information Retrieval. McGraw-Hill Book Company.
- H. Kim, H. Park (2007). Sparse non-negative matrix factorizations via alternating non-negativity-constrained least squares for microarray data analysis. Bioinformatics, 23:1495-1502.
- M. W. Berry, M. Browne, A. N. Langville, V. P. Pauca, R. J. Plemmons (2007). Algorithms and applications for approximate nonnegative matrix factorization. Computational Statistics & Data Analysis, 52(1):155-173.
- P.O.Hoyer (2004). Non-negative matrix factorization with sparseness constraints. Journal of Machine Learning Research, 5:1457-1469.
- T. Yokoi, H. Yanagimoto, S. Omatu (2008). Improvement of information filtering by independent components selection. volume 163, pages 49-56. Wiley.
- T. Kolenda, L. K. Hansen (2000). Independent components in text. In Advances in Independent Component Analysis. Springer-Verlag.
- Tou J. T.,Gonzalez R. C. (1974). Pattern Recognition Principles. Addison-Wesley, Reading.
- Xu. W., Liu. X., Gong. Y. (2003). Document clustering based on non-negative matrix factorization.
- Y. Yang, J. Carbonell, R. Brown, T. Pierce, B. T. Archibald, X. Liu (1999). Learning approaches for detecting and tracking news events. IEEE Inteligent Systems, 14(4):32-43.

#### Paper Citation

#### in Harvard Style

Yokoi T. and Yanagimoto H. (2009). **TOPIC EXTRACTION FROM DIVIDED DOCUMENT SETS** . In *Proceedings of the Fifth International Conference on Web Information Systems and Technologies - Volume 1: WEBIST,* ISBN 978-989-8111-81-4, pages 654-659. DOI: 10.5220/0001822106540659

#### in Bibtex Style

@conference{webist09,

author={Takeru Yokoi and Hidekazu Yanagimoto},

title={TOPIC EXTRACTION FROM DIVIDED DOCUMENT SETS},

booktitle={Proceedings of the Fifth International Conference on Web Information Systems and Technologies - Volume 1: WEBIST,},

year={2009},

pages={654-659},

publisher={SciTePress},

organization={INSTICC},

doi={10.5220/0001822106540659},

isbn={978-989-8111-81-4},

}

#### in EndNote Style

TY - CONF

JO - Proceedings of the Fifth International Conference on Web Information Systems and Technologies - Volume 1: WEBIST,

TI - TOPIC EXTRACTION FROM DIVIDED DOCUMENT SETS

SN - 978-989-8111-81-4

AU - Yokoi T.

AU - Yanagimoto H.

PY - 2009

SP - 654

EP - 659

DO - 10.5220/0001822106540659