Creating Facets Hierarchy for Unstructured Arabic Documents

Khaled Nagi, Dalia Halim

Abstract

Faceted search is becoming the standard searching method on modern web sites. To implement a faceted search system, a well defined metadata structure for the searched items must exist. Unfortunately, online text documents are simple plain text, usually without any metadata to describe their content. Taking advantage of external lexical hierarchies, a variety of methods for extracting plain and hierarchical facets from textual content are recently introduced. Meanwhile, the size of Arabic documents that can be accessed online is increasing every day. However, the Arabic language is not as established as the English language on the web. In our work, we introduce a faceted search system for unstructured Arabic text. Since the maturity of Arabic processing tools is not as high as the English ones, we try two methods for building the facets hierarchy for the Arabic terms. We then combine these methods into a hybrid one to get the best out of both approaches. We assess the three methods using our prototype by searching in real-life articles extracted from two sources: the BBC Arabic edition website and the Arab Sciencepedia Website.

References

  1. Abouenour L., Bouzoubaa K., Rosso P., 2008. Improving Q/A Using Arabic Wordnet. In the International Arab Conference on Information Technology (ACIT'2008), Tunisia.
  2. Arabic Wikipedia, n.d. [Online] Available at: <http://ar.wikipedia.org/> [Accessed 13 8 2012].
  3. Arabic Wikipedia Categorization, n.d. The Arabic Wikipedia Categorization root category, [Online] Available at: <http://ar.wikipedia.org/wiki/ ?????????_??????? :????? > [Accessed 20 6 2012].
  4. Arabic WordNet, n.d. [Online] Available at: <http:// www.globalwordnet.org/AWN/> [Accessed 7 6 2012].
  5. ArabSciencepedia, n.d. [Online] Available at: http:// www.arabsciencepedia.org/ [Accessed 7 7 2012].
  6. BBC Arabic, n.d. [Online] Available at: <http:// www.bbc.co.uk/arabic/> [Accessed 7 7 2012].
  7. BinAjiba, Y., n.d. Arabic NER Corpus and Documents. [Online] Available at: <http://www1.ccls.columbia.edu/ybenajiba/download s.html> [Accessed 12 2 2013].
  8. Bing Translator, n.d. The Bing translator API [Online] Available at: <http://www.microsoft.com/web/post/ using-the-free-bing-translation-apis> [Accessed 4 9 2012].
  9. Dakka, W. & Ipeirotis, P. G., 2008. Automatic extraction of useful facet hierarchies from text databases. In IEEE 24th International Conference on Data Engineering (ICDE), pp.466-475.
  10. Google API, n.d. Google Custom Search API [Online] Available at: <http://code.google.com/ apis/ customsearch/> [Accessed 23 2 2013].
  11. GSON, n.d. Google GSON API [Online] Available at: <https://code.google.com/p/google-gson/> [Accessed 13 2 2013].
  12. Hearst, M. A., 2008. UIs for faceted navigation recent advances and remaining open problems. In Workshop on computer interaction and Information retrieval, HCIR,, Redmond, WA.
  13. Hearst, M. A. & Pedersen, J. O., 1996. Reexamining the cluster hypothesis.In Proceedings of the 19th Annual International ACM/SIGIR Conference, Zurich.
  14. JSON, n.d. The Json file format [Online] Available at: <http://www.json.org/> [Accessed 14 3 2013].
  15. JSOUP, n.d. Java HTML parser API [Online] Available at: <http://jsoup.org/> [Accessed 23 2 2013].
  16. Lim, D. et al., 2011. Utilizing wikipedia as a knowledge source in categorizing topic related korean blogs into facets. In Japanese Society for Artificial Intelligence (JSAI), Takamatsu.
  17. LingPipe, n.d. LingPipe Named Entity Recognizer, [Online] Available at: http://alias-i.com/lingpipe/ demos/tutorial/ne/read-me.html [Accessed 4 9 2012].
  18. Manning, C. D., Ranghavan, P. & Sch├╝tze, H., 2009. Introduction to Information Retrieval. Cambridge University Press.
  19. Mark, S. & Croft, B., 1999. Deriving concept hierarchy from text. In proceeding of the 22nd annual international ACM SIGIR conference on research and development in information retrieval.
  20. Naver, n.d. Naver Open API [Online] Available at: <http:// dev.naver.com/openapi/> [Accessed 15 2 2013].
  21. Wikipedia Redirect, n.d. [Online] Available at: <http://en.wikipedia.org/wiki/Help:Redirect> [Accessed 15 2 2013].
  22. Stoica, E. & Hearst, M. A., 2006. Demonstration :Using WordNet to build hierarchical networks. In the ACM SIGIR Workshop on Faceted Search.
  23. Stoica, E. & Hearst, M., 2004. Nearly-Automated Metadata hierarchy creation. In Human Language Technologies: The Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-HLT), Boston.
  24. Stoica, E., Hearst, M. & Richardson, M., 2007. Automating creation of hierarchical Faceted metadata structures. In Human Language Technologies: The Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-HLT), Rochester NY, USA.
  25. Tripathi, S. & Sarkhel, J. K., 2010. Approach to machine translation. In The Americas Lodging Investment Summit (ALIS) Vol.57, pages 388-393.
  26. Tunkelang, D., 2009. Faceted Search. Morgan & Claypool.
  27. Wikipedia Categorization, n.d. [Online] Available at: <http://en.wikipedia.org/wiki/Wikipedia:Categorizatio n> [Accessed 12 5 2013].
  28. WordNet Domains, n.d. [Online] Available at: <http://wndomains.fbk.eu/> [Accessed 13 2 2013].
  29. WordNet, n.d. [Online] Available at: <http:// wordnet.princeton.edu/> [Accessed 3 2 2013].
  30. Yahoo History, n.d. [Online] Available at: <http:// docs.yahoo.com/info/misc/history.html> [Accessed 9 8 2012].
  31. Yan, N. et al., 2010. FacetedPedia: dynamic generation of query-dependent faceted interface for wikipedia. In Proceedings of the 19th international conference on World wide web, pp.651-660.
Download


Paper Citation


in Harvard Style

Nagi K. and Halim D. (2013). Creating Facets Hierarchy for Unstructured Arabic Documents . In Proceedings of the International Conference on Knowledge Engineering and Ontology Development - Volume 1: KEOD, (IC3K 2013) ISBN 978-989-8565-81-5, pages 109-119. DOI: 10.5220/0004621201090119


in Bibtex Style

@conference{keod13,
author={Khaled Nagi and Dalia Halim},
title={Creating Facets Hierarchy for Unstructured Arabic Documents},
booktitle={Proceedings of the International Conference on Knowledge Engineering and Ontology Development - Volume 1: KEOD, (IC3K 2013)},
year={2013},
pages={109-119},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0004621201090119},
isbn={978-989-8565-81-5},
}


in EndNote Style

TY - CONF
JO - Proceedings of the International Conference on Knowledge Engineering and Ontology Development - Volume 1: KEOD, (IC3K 2013)
TI - Creating Facets Hierarchy for Unstructured Arabic Documents
SN - 978-989-8565-81-5
AU - Nagi K.
AU - Halim D.
PY - 2013
SP - 109
EP - 119
DO - 10.5220/0004621201090119