MAPPING KNOWLEDGE DOMAINS - Combining Symbolic Relations with Graph Theory

Eric SanJuan

Abstract

We present a symbolic and graph-based approach for mapping knowledge domains. The symbolic component relies on shallow linguistic processing of texts to extract multi-word terms and cluster them based on lexico-syntactic relations. The clusters are subjected to graph decomposition basing on inherent graph theoretic properties of association graphs of items (authors-terms, documents-authors, etc). These include the search for complete minimal separators that can decompose the graphs into central (core topics) and peripheral atoms. The methodology is implemented in the TermWatch system and can be used for several text mining tasks. We also mined for frequent itemsets as a means of revealing dependencies between formal concepts in the corpus. A comparison of the frequent itemsets extracted on each dataset and the structure of the central atom shows an interesting overlap. The interesting features of our approach lie in the combination of state-of-the-art techniques from Natural Language Processing (NLP), Clustering and Graph Theory to develop a system and methodology adapted to uncovering hidden sub-structures from texts.

References

  1. Agrawal R., Imielinski T., Swami A., Mining association rules between sets of items in large databases. In ACM SIGMOD Conf. Management of Data, May 1993.
  2. Bar-Ilan J., Informetrics at the beginning of the 21stcentury - A review, Journal of Informetrics, 2008, 2, 1-52
  3. Berry A., Krueger R., Simonet G., Ultimate Generalizations of LexBFS and LEX M. WG 2005: 199-213.
  4. Berry, M. W. (eds)., Survey of Text Mining, Clustering, Classification and Retrieval, Springer, 2004, 244p.
  5. Callon M., Courtial J-P., Turner W., Bauin S. , From translation to network: The co-word analysis. Scientometrics, 1983, 5(1).
  6. Castellanos M., HotMiner: Discovering hot topics from dirty texts, in Berry M. W. (dir.), Survey of Text Mining Systems, Springer Verlag, NY, 2004, 123-157.
  7. Chalmers M., Using a landscape metaphor to represent a corpus of documents. In Spatial Information theory, Frank A., Caspari I. (eds.), Springer Verlag LNCS 716, 1993, 377-390.
  8. Chen C., CiteSpace II: Detecting and visualizing emerging trends and transient patterns in scientific literature. Journal of the American society for Information Science, 2006, 57(3), 359-377.
  9. Chen C., Ibekwe-SanJuan F., SanJuan E., Weaver C., Visual Analysis of Conflicting Opinions, 1st International IEEE Symposium on Visual Analytics Science and Technology (VAST 2006), Baltimore - Maryland, USA, 31 Oct.-2 Nov. 2006, 59-66.
  10. Chen H., Wingyan C., Qin J., Reid E., Sageman M., Uncovering the dark web: A case study of jihad on the web. Journal of the American society for Information Science, 2008, 59(8), 1347-1359.
  11. Church K. W., Hanks P., Word association norms, mutual information and lexicography, Computational Linguistics, 16, n° 1, 1990, 22-29.
  12. Cutting D., Pedersen J. O., Karger D., Tukey J. W., Scatter/Gather: A cluster based approach to browsing large document collections. In Proceedings of the 15th Anuual ACM/SIGIR Conference, Copenhagen, Danemark, 1992, 318-329.
  13. Freeman L. C., A set of measures of centrality based on betweenness, Sociometry, 1977, 40(1), 35-41.
  14. Mane K. K, Borner K., Mapping topics and topic bursts, Proceedings of the National Academy of Sciences, USA (PNAS), 2004, 101 (suppl. 1), 5287-5290
  15. Morris S. A., Martens B., Modeling and Mapping of Research Specialties, Annual Review of Information Science and Technology, 42, 2008, 52p.
  16. Morris S. A., Yen G. G., Crossmaps: Visualization of overlapping relationships in collections of journal papers, PNAS, 2004, 101 (suppl. 1) 5291-5296.
  17. Priss U., Formal Concept Analysis in Information Science. Cronin, Blaise (ed.), Annual Review of Information Science and Technology, 2006, 40, 521-543.
  18. Prize L., Thelwall M., The clustering power of low frequency words in academic webs. Journal of the American Society for Information Science and Technology, 2005, 56 (8), 883-888.
  19. Sander G., Graph Layout through the VCG Tool, in Tamassia, Roberto; Tollis, Ioannis G., Editors: Graph Drawing, DIMACS International Workshop GD'94, Lecture Notes in Computer Science 894, 1995, 194 - 205.
  20. SanJuan E., Ibekwe-SanJuan F. Textmining without document context. Information Processing & Management, Special issue on Informetrics II, Elsevier, 2006, 42(6), 1532-1552.
  21. SanJuan E., Dowdall J., Ibekwe-SanJuan F., Rinaldi F. A symbolic approach to automatic multiword term structuring. Computer Speech and Language (CSL), Special issue on Multiword Expressions, Elsevier, 2005, 19 (4), 524-542.
  22. Wille R., Restructuring lattice theory: an approach based on hierarchies of concepts. Ordered Sets (I. Rival, ed.), Reidel, Dordrecht-boston, 1982, 445-470.
Download


Paper Citation


in Harvard Style

SanJuan E. (2011). MAPPING KNOWLEDGE DOMAINS - Combining Symbolic Relations with Graph Theory . In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: SSTM, (IC3K 2011) ISBN 978-989-8425-79-9, pages 519-528. DOI: 10.5220/0003721105270536


in Bibtex Style

@conference{sstm11,
author={Eric SanJuan},
title={MAPPING KNOWLEDGE DOMAINS - Combining Symbolic Relations with Graph Theory},
booktitle={Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: SSTM, (IC3K 2011)},
year={2011},
pages={519-528},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0003721105270536},
isbn={978-989-8425-79-9},
}


in EndNote Style

TY - CONF
JO - Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: SSTM, (IC3K 2011)
TI - MAPPING KNOWLEDGE DOMAINS - Combining Symbolic Relations with Graph Theory
SN - 978-989-8425-79-9
AU - SanJuan E.
PY - 2011
SP - 519
EP - 528
DO - 10.5220/0003721105270536