Creating Facets Hierarchy for Unstructured Arabic Documents
Khaled Nagi and Dalia Halim
Dept. of Computer and Systems Engineering, Faculty of Engineering, Alexandria University, Alexandria, Egypt
Keywords: Faceted Search, Arabic Content on the Internet, Indexing Arabic Content.
Abstract: Faceted search is becoming the standard searching method on modern web sites. To implement a faceted
search system, a well defined metadata structure for the searched items must exist. Unfortunately, online
text documents are simple plain text, usually without any metadata to describe their content. Taking ad-
vantage of external lexical hierarchies, a variety of methods for extracting plain and hierarchical facets
from textual content are recently introduced. Meanwhile, the size of Arabic documents that can be accessed
online is increasing every day. However, the Arabic language is not as established as the English language
on the web. In our work, we introduce a faceted search system for unstructured Arabic text. Since the ma-
turity of Arabic processing tools is not as high as the English ones, we try two methods for building the fac-
ets hierarchy for the Arabic terms. We then combine these methods into a hybrid one to get the best out of
both approaches. We assess the three methods using our prototype by searching in real-life articles extracted
from two sources: the BBC Arabic edition website and the Arab Sciencepedia Website.
1 INTRODUCTION
Faceted search is simply looking at a search result-
set from many different perspectives at the same
time. The first perspective is the whole result-set.
Then each facet describes a subset of the result-set
from different perspectives. Facets can be either flat
or hierarchical. A good hierarchical facet structure
is not deep. It has a reasonable number of siblings
for each node. The expanded path from root to leaf
generally fits on one screen without scrolling down.
In order to implement a faceted search system, a
well defined metadata structure for the searched
items must exist. Structured data such as product
description in an online shop are very good candi-
dates for being used in a faceted search environment.
Introducing faceted search for unstructured data,
e.g., text documents, has another degree of complex-
ity. Text documents are plain text usually without
any metadata to describe their content. Taking ad-
vantage of term extraction tools and external lexical
hierarchies, a variety of methods for extracting hier-
archical facets from plain text are recently intro-
duced.
However, term extraction tools and external lexi-
cal hierarchies are very language specific. The Ara-
bic language is not as established as the English lan-
guage on the web. Meanwhile, the size of Arabic
documents that can be accessed online is increasing
every day. There is few to no research done in
choosing the correct constellation of emerging Ara-
bic-specific term extraction tools and external lexi-
cal hierarchies to create facets.
In our work, we introduce an automatically gen-
erated faceted search system for unstructured Arabic
text. The developed prototype receives the user que-
ry and returns a Google-like result-set. Additionally,
it returns a set of hierarchical facet terms to help the
user filter the returned result-set to navigate to the
required document. For the facet creation process,
we rely on LingPipe Named Entity Recognizer
(LingPipe, n.d.). In order to create the facets hierar-
chy from the results terms, we use two methods. The
first method uses the Arabic Wikipedia hierarchy
(Arabic Wikipedia Categorization, n.d.) to build the
facet hierarchy for the result-set. The second meth-
od uses an English tool to alleviate for relatively
poor Arabic corpus in online tools. The facets are
translated to English and the hierarchy is built using
the English WordNet (WordNet, n.d.) IS-A hyper-
nym structure. Then the whole facets hierarchy is
translated back to Arabic. This workaround is used
till an Arabic version of Wordnet with a full-fledged
corpus is developed by the linguistic researchers
(Arabic WordNet, n.d.). In a later stage of our work,
we investigate a combination of both methods by
109
Nagi K. and Halim D..
Creating Facets Hierarchy for Unstructured Arabic Documents.
DOI: 10.5220/0004621201090119
In Proceedings of the International Conference on Knowledge Engineering and Ontology Development (KEOD-2013), pages 109-119
ISBN: 978-989-8565-81-5
Copyright
c
2013 SCITEPRESS (Science and Technology Publications, Lda.)
intelligently merging the best subsets of each of the
previously built hierarchies.
We assess the three methods that we built in our
prototype using two sets of articles: one extracted
from the BBC Arabic edition website (BBC Arabic,
n.d.) and the other from the Arab Sciencepedia web-
site (ArabSciencepedia, n.d.). We compare the re-
sulting facet hierarchies in terms of relevance, the
shape of the facet hierarchy, and the creation time.
We are aware that the comparison is the result of
a clear competition between a wiki-based approach –
and hence organically evolving – with a huge com-
munity of contributors versus a well structured ap-
proach driven by linguistic science. In our work, we
train our system using existing corpus in some do-
main knowledge. So, it is a nice scientific challenge
to assess the proposed facet creation mechanisms
and studying the effect of the available training cor-
pus and its linguistic structure on the quality of the
facets.
The rest of the paper is organized as follows.
Section 2 provides an overview of related work. Our
proposed system is presented in Section 3. In Sec-
tion 4, the developed prototype is explained. Section
5 contains an assessment of the generated facet hier-
archies while Section 6 concludes the paper.
2 RELATED WORK
2.1 Evolution of Information Retrieval
to Faceted Search
Information Retrieval is finding material - usually
documents - of an unstructured nature - usually text-
that satisfies an information need from within large
collections (Manning et al., 2009). The earliest in-
formation retrieval systems use a set retrieval model
(Manning et al., 2009). Set retrieval systems return
unordered document sets. The query expressions
require Boolean operations and the exact query
keywords are supposed to exist or never exist in the
searched documents.
Seeking an alternative to the set retrieval model
and the complicated query form, the ranked retrieval
model was introduced. A free-text query is submit-
ted and the system returns a large result-set of doc-
uments, ordered by their relevance to the user query.
One famous example of the ranked retrieval model
is the Vector Space Model (Manning et al., 2009).
The directory navigation model was introduced
next. This model offers users an advantage over free
text search by organizing content in taxonomy and
providing users with guidance toward interesting
subsets of a document collection. The web directory
that Yahoo! built in the mid-1990s (YahooHistory,
n.d.) is a nice example. Yet, taxonomies have a main
problem. Information seekers need to discover a
path to a piece of information in the same way that
the taxonomist created it, as each piece of infor-
mation is categorized under one taxonomy leaf. Ac-
tually, if a group of taxonomists is building taxono-
my for several terms, they would disagree among
themselves on how information should be organized.
This shortcoming in the directory navigation is
known as the vocabulary problem (Tunkelang,
2009).
Faceted search appeared to be a convenient way
to solve the vocabulary problem. Facets refer to cat-
egories, which are used to characterize items in a
collection (Hearst, 2008). In a collection of searched
items, a faceted search engine associates each item
with all facet labels that most describe it from the
facets collection. This makes each item accessible
from many facets paths thus decreasing the probabil-
ity of falling into the vocabulary problem. Faceted
search assumes that a collection of documents is
organized based on a well-defined faceted classifica-
tion system or that the underlying data is saved in a
database with a predefined metadata structure. Un-
fortunately, this is not the case for collections of
unstructured text documents.
2.2 Facet Creation for Unstructured
Documents
Data clustering is a class of solutions for creating
subject hierarchies for unstructured documents. The
advantage of clustering is that it is fully automated,
but its main disadvantages are their lack of predicta-
bility and the difficulty of labeling the groups. Some
automatic clustering techniques generate clusters
that are typically labeled using a set of
keywords (Hearst and Pedersen, 1996). Such set of
titles gives good indication of a collection of docu-
ments contents, but its presentation is very hard to
be used by the end users in a navigational interface.
Other technique for enriching unstructured text
with metadata is the subsumption algorithm (Mark
and Croft, 1999). For two terms x and y, x is said to
subsume y if: P(x|y) 0.8, P(y|x) < 1 (Stoica et al.,
2007); which means, x subsumes y and is a parent of
y, if the documents which contain y, are a subset of
the documents which contain x.
Another class of solutions makes use of existing
lexical hierarchies to build facets/categories hierar-
chies. Examples of existing lexical hierarchies are
WordNet hypernym and hyponym structures
KEOD2013-InternationalConferenceonKnowledgeEngineeringandOntologyDevelopment
110
(WordNet, n.d.) and the Wikipedia categorization
structure (Wikipedia Categorization, n.d.). Some of
the previous works in the faceted search interface
creation for unstructured documents include Casta-
net (Stoica and Hearst, 2004) and (Stoica et al.,
2007), facets for text database (Dakka and Ipeirotis,
2008), Facetedpedia (Yan et al., 2010), and facets
for Korean blog-posts (Lim et al., 2011) as an ex-
ample of Facet creation in non-latin languages.
2.3 Facet Creation Research Projects
The Castanet algorithm - (Stoica and Hearst, 2004)
and (Stoica et al., 2007) - assumes that there is a text
description associated with each item in the collec-
tion. E.g., if the collection is a set of images, each
image has an unstructured text document describing
the image. The textual descriptions are used to build
the facet hierarchies and then to assign documents to
facets.
The target terms set is a subset of the terms that
best describes the set of documents. Their selection
criterion is the term distribution. Terms with term
distribution greater than or equal to a specified
threshold are retained as the target terms. The target
terms set is divided into two categories:
Ambiguous terms: having more than one mean-
ing in the English WordNet.
Un-ambiguous terms: with only one meaning in
the English WordNet.
The core hierarchy is first built for the un-
ambiguous terms using the WordNet IS-A hypernym
structure (WordNet, n.d.). Then, the ambiguous
terms are checked against the WordNet Domains
(WordNet Domains, n.d.); which is a tool assigning
domains to each WordNet synonym set. The tool
counts occurrences of each domain for unambiguous
target terms, resulting in a list of the most represent-
ed domains in the set of documents. If the ambigu-
ous term has only one common domain for all its
senses in WordNet, it is considered unambiguous.
The core hierarchy is next augmented by the un-
ambiguated terms IS-A hypernym paths. A refine-
ment step is next done by compressing the final hi-
erarchy. Nodes with number of children less than a
threshold and nodes whose names appear in their
parents’ node name are eliminated. Finally, in order
to create a set of sub-hierarchies, the top levels (e.g.,
4 levels) are pruned. Thus, the final facets hierarchy
is created.
The algorithm for creating facets for text data-
bases is presented in (Dakka and Ipeirotis, 2008). It
is built on the observation: The terms for the useful
facets do not usually appear in the documents con-
tents. Thus, the target terms list is created using two
sets of terms.
The first set includes the significant terms ex-
tracted from the document body text using ex-
traction tools.
The second set is created by expanding the first
set with other relevant terms using WordNet
hypernym, Wikipedia contents, and the terms
that tend to co-occur with the first set of terms
when queried against the Google search engine.
The term frequency is used in the original and
the expanded terms set to identify the final candidate
facets. Infrequent terms from the first terms set with
the frequent terms from the second expanded terms
set form together the final set of facets.
Facetedpedia (Yan et al., 2010) is a project that
dynamically generates a query-dependent faceted
interface for Wikipedia searched articles. The next
definitions build the main concepts used in the algo-
rithm.
Target Articles: are the articles in the returned
result-set of the user query.
Attribute Articles: each Wikipedia article that is
hyperlinked by a target article.
Category Hierarchy: Wikipedia category hier-
archy is a connected, rooted directed acyclic
graph.
One large hierarchy is built for the target articles
as follows. Each target article is connected to all its
attributes articles. Then, a category hierarchy is built
for each attribute article. Hierarchies for all attribute
articles are merged until we find one common root
category. Then, the most appropriate set of sub-
categories is chosen from within the built hierarchy
using a cost measurement and a similarity measure-
ment developed by the author.
The work in (Lim et al., 2011) is an example of
facet creation for non-Latin languages. Non-Latin
languages such as Korean or Arabic do not have
powerful linguistic tools when compared to the Eng-
lish WordNet. Workarounds are found by research-
ers working with these languages. In (Lim et al.,
2011), the system generates flat facets interface for
Korean blog-posts. Given a search query keyword,
blog posts are searched using the search engine
“Naver Open API” for Korean (Naver, n.d.). For the
initial keyword, a set of blog posts is constructed,
where each post with its body text are successfully
extracted from the blog post. The facet generation
process is done in five steps. The system collects
Wikipedia articles that include the user query and
extracts the titles of these Wikipedia pages to use
them as facets candidates for the blog-posts. After
constructing the candidate facets terms set, only the
CreatingFacetsHierarchyforUnstructuredArabicDocuments
111
facets terms that appear more frequently are retained
as actual facets terms. Given a Wikipedia entry,
terms that are closely related to the term are auto-
matically extracted. Typical closely related terms
include bold-faced terms, anchor texts of hyperlinks
and the title of a redirect, etc. Then, from each blog
post, a term frequency vector is generated. The simi-
larity of a Wikipedia entry and a blog post is defined
as the inner product of the inverse document fre-
quency vector of the Wikipedia entry and the term
frequency vector. To each blog post, the system as-
signs a facet that maximizes the similarity of the
facet and the blog post. One important Shortcoming
of this work is that no hierarchy is generated.
3 PROPOSED SYSTEM
While Arabic content increases day per day on the
Internet, tools accessing and manipulating Arabic
contents are not increasing at the same pace. Thus
implementing a hierarchical facets search interface
for Arabic unstructured document motivates our idea
of trying two different ways in implementing the
hierarchy: one using Arabic tools and the other using
English tools.
In our implemented system, a hierarchical facet-
ed interface is created in two main steps: the facets
extraction phase and the facets hierarchy creation
phase. The facets creation phase results in a set of
target terms that most describe the list of searched
documents. This target terms set is an input to the
next phase, where two parallel processes use this set.
The first one produces a facets hierarchy using Ara-
bic tools. The other one translates the set of terms
into English, builds the facets hierarchy using Eng-
lish tools, and the hierarchy is translated back to
Arabic.
3.1 Facets Extraction Phase
The user query is sent to Google search engine using
Google custom search API (Google API, n.d.). The
resulting set of documents is processed in order to
extract the most significant terms. The significant
terms extraction is done using the LingPipe tool
(LingPipe, n.d.).
The LingPipe tool provides three different types
of significant terms extraction:
Rule based extraction model,
Dictionary based model, and
Statistical model.
The rule based extraction model needs a regular
expression as input, and then for each processed
document, all matches for the regular expression are
extracted by LingPipe. The dictionary based model
uses a list of terms (single or multi-words) as dic-
tionary. Each match from the dictionary in the pro-
cessed document is considered as significant term.
Finally, the statistical extraction model requires a
training data in the CoNLL format. The training data
should match the documents contents on which the
extraction will take place. In our system, we use all 3
models.
For every regular expression, term in a diction-
ary, or even term within a training corpus, a Tag
value is provided. This tag value gives a hint about
the term which helps in the term manipulation in the
next step of the algorithm.
Since dates play as major role in creating facets,
they are handled separately. For example, we handle
an extended format found in some Arabic docu-
ments, where the date includes the month name in
the old Syriac language then its equivalent in the
Gregorian calendar ended by the year, e.g.,
) ( ” for “June 2012”. The rule based extrac-
tion model is used in our algorithm for extracting the
date from Arabic documents, where a regular ex-
pression (in the rule based model) for the date for-
mat is formulated and fed to the LingPipe extractor.
A date extracted using this regular expression is as-
sociated with the “Date” tag and is modeled in the
standard ISO-date format.
Today the Arabic Wikipedia includes more than
200,000 articles (Arabic Wikipedia, n.d.) covering a
very wide range of terms and expressions for Arabic
and international concepts. These concepts include
historical & contemporary events, person names,
locations, scientific terms and many other concepts.
Therefore, the Arabic Wikipedia articles titles are
used as dictionary entries for the LingPipe diction-
ary based extraction model. A tag “Wiki” is associ-
ated with each term during the dictionary build pro-
cess.
The LingPipe statistical extraction model needs
a training data, covering the same domains of the
documents being processed. Training data is not
easily available for all domains. Therefore, the use
of such model must not be a key element in the algo-
rithm, but as its usage gives richer results. Our im-
plementation is adjustable to include such models
when available, and skip them otherwise. Two data
sets are used for testing, the first is news articles for
the Arabic BBC website, and the second is scientific
articles from the Arab Sciencepedia website. Benaji-
ba’s Arabic Named Entity Recognition (BinAjiba,
n.d.) training corpus is a CoNLL formatted Arabic
training data, which includes terms with location
KEOD2013-InternationalConferenceonKnowledgeEngineeringandOntologyDevelopment
112
“Loc”, organization “Org” and person “Per” tags
values. This corpus is created from several news
agencies articles, which makes it appropriate for
training LingPipe for the use in the algorithm while
searching the Arabic BBC website. The scientific
articles facets are created without the statistical ex-
traction model. If statistical extraction model is used,
a set of predefined facets is created. This set of fac-
ets is the mapping of the tags values appearing in the
training data set. For the Arabic news articles train-
ing corpus, the predefined facets are “Locations”,
“Organizations” and “Persons”.
3.1.1 Term Reduction Steps
On either data set, the terms extraction process gen-
erates more than 6,000 terms for an average of 50
searched documents. Duplicates and synonyms
terms are merged simplifying the list to about 3,000
terms. The synonym terms detection is done using
the Wikipedia redirect feature (Wikipedia Redirect,
n.d.). If two extracted terms are marked to be redi-
rected to one another or to one common article title,
then these terms are synonyms and are unified into
one term. This terms list is further reduced by omit-
ting terms with low term distribution. The final tar-
get terms list contains about 300 terms. This list is
the input for the facets hierarchy creation phase. In
Figure 1, the term reduction steps are illustrated.
Figure 1: Term reduction steps.
3.2 Facets Hierarchy Creation Phase
During our research, we generate two different fac-
ets hierarchies by using two external hierarchies: the
Arabic Wikipedia categorization hierarchy (Arabic
Wikipedia Categorization, n.d.) and the English
WordNet IS-A hypernym structure (WordNet, n.d.).
The usage of one Arabic hierarchy and another Eng-
lish hierarchy is done for testing the feasibility of
creating a facets hierarchy for a set of documents
with a specified language using an external hierar-
chy from another foreign language. For the latter
case, we use the English WordNet IS-A hypernym
structure since it showed relevant success for build-
ing facets hierarchies for English documents (Stoica
et al., 2007).
3.2.1 Using Arabic Wikipedia
Categorization Structure
The documents are classified under a huge categori-
zation graph called   (Wikipedia con-
tents) (Arabic Wikipedia Categorization, n.d.). The
Wikipedia sub-category of interest for our work is
_ (The Main Categorization). The main
categorization hierarchy is used to build the facets
hierarchy for the target terms extracted from the
document body text.
Each target term is checked against the Wikipe-
dia articles titles. Once found, the Wikipedia hierar-
chy for the term is built up to three levels depth on-
ly. The number of three levels is chosen as the Wik-
ipedia categorization is very deep, thus a shortcut
step is taken by building only three levels of the hi-
erarchy.
If a training corpus covering the searched data
domain exists, then the statistical terms extraction
model is used. The term tag for each extracted term
is checked against the predefined facets terms and
the term is classified under its corresponding prede-
fined facet if applicable. The predefined facets in the
Wikipedia hierarchy are always plain facets. Merg-
ing the predefined facets with their corresponding
facets in the Wikipedia categorization is not applica-
ble since the categorizations do not match. For ex-
ample, a name of a country in the Wikipedia catego-
rization can be an upper node of the name of a scien-
tist born in this country. In this case, a predefined
facet, such as location, would be a parent node for
the country name followed by a person name, which
is not applicable.
As the Wikipedia categorization contains some
cycles, a depth first traversal is done in order to re-
move any existing loops and convert the final graph
into a hierarchy. Finally, a tree minimization process
takes place to enhance the final structure of the hier-
archy presented to the end user. The tree minimiza-
tion is done using three different methods as follows.
Removing single child vertex: as illustrated in
Figure 2, any hierarchy vertex with only one
child vertex is omitted and its direct child takes
the place of its parent.
Figure 2: Removing single child vertex.
CreatingFacetsHierarchyforUnstructuredArabicDocuments
113
Removing redundant sub-trees: as illustrated in
Figure 3, sub-trees occurring more than once in
different depth levels are omitted from the
deeper level.
Figure 3: Removing redundant sub-trees.
Removing facets with little information: The
children of the root node are the facets to be
presented to the end user. If a facet is only lead-
ing to less than three articles, this facet delivers
low quality information. It is thus removed from
the root’s children.
3.2.2 Using WordNet IS-A Hypernym
Structure
This phase undergoes five steps as illustrated in Fig-
ure 4
:
Term translation,
Hierarchy construction,
Term disambiguation,
Hierarchy minimization, and
Hierarchy translation back to Arabic.
During the term translation step, the Bing trans-
lator API (Bing Translator, n.d.) is used. Bing uses
the statistical machine translation (SMT) paradigm
(Tripathi and Sarkhel, 2010), which means that a
document is translated according to the probability
distribution p(e|f) that a string e in the target lan-
guage is the translation of a string f in the source
language.
In order to avoid the double translation, from Ar-
abic to English then back to Arabic, the Arabic
terms are temporary stored in memory with their
translated English meanings.
During the hierarchy construction step each term
tag is checked against the predefined facets if the
LingPipe statistical terms extraction model is used.
If the tag matches any of them, the term is classified
under its corresponding facet. In the WordNet IS-A
structure, the predefined facets are merged with their
corresponding WordNet entities.
The WordNet Domains tool (WordNet Domains,
n.d.) is used to disambiguate terms that have several
senses within WordNet. A “Document-Domains”
index containing each document with all domains
appearing within the document is created in parallel
Figure 4: Five steps hierarchy construction using WordNet
IS-A.
with the next step. Each translated term is checked
against the English WordNet corpus, three different
results are possible:
The term is not found within the English
WordNet: the term is dropped unless it falls un-
der a predefined facet.
The term has only one meaning within Word-
Net: the IS-A hierarchy is built for this term up
to the WordNet root. Then, WordNet domains
are checked for this term. Then all term do-
mains are fetched and inserted in the Document-
Domains index for all documents containing the
term.
The term has several meanings within WordNet:
e.g., the word capital in the English WordNet
can be found with the meaning assets, upper-
case or city. In this case, the term is associated
with an ambiguous term list in order to be dis-
ambiguated in the next step.
By now, the hierarchy is built for the un-
ambiguous terms, and the Documents-Domains in-
dex is created. During the term disambiguation step,
the following is done for each ambiguous term:
If all the term senses belong to one domain from
the WordNet Domains tool, the first sense is
chosen.
Otherwise, for each document containing the
term, a sense with a domain already appearing
KEOD2013-InternationalConferenceonKnowledgeEngineeringandOntologyDevelopment
114
in the document is chosen. E.g., the term “Ap-
ple” can have both “Food” and “Technology”
domains; meanwhile a document that talks
about nutrition most probably contains the do-
main “Food”.
If none of the above is valid, we use the first
sense in WordNet since it is usually the most
common one. (Stoica and Hearst, 2006).
After the disambiguation step, the built hierarchy
is enlarged with the un-ambiguated terms. For each
un-ambiguated term, the WordNet IS-A hypernym
path for the term is added to the unambiguous hier-
archy.
During the hierarchy minimization step the sys-
tem converts the constructed hierarchy into moder-
ate facets sub-hierarchies; each with a reasonable
depth and breadth for each level. This is done by
removing single child vertex, removing redundant
sub-tree and removing facets with little information
as described for the Wikipedia hierarchy. Addition-
ally, the first three levels are removed in the com-
pression step since the highest level in WordNet tree
contains very abstract terms; such as objects, enti-
ties, etc.; which are not very useful as facets.
In the last step, only the inner nodes in the hier-
archy are translated from English to Arabic.
3.2.3 Using a Hybrid Hierarchy
A normal extension of the work is creating the facets
by combining both approaches mentioned in Sec-
tions 3.2.1 and 3.2.2. Each hierarchy is created in a
separate thread and then a merging step is performed
as soon as both hierarchies are ready. During the
merging step, the predefined facets from the Word-
Net hierarchy are merged with the remaining facets
from the Wikipedia hierarchy. This step takes place
only when a training corpus is available for the use
by LingPipe terms extractor for the statistical terms
extraction model.
4 THE PROTOTYPE
We develop a hierarchical faceted search system for
Arabic documents. The user enters the query and
waits for the result set to be returned in a faceted
interface as illustrated in
Figure 5.
The system is implemented in Java as a web-
application under Tomcat server. MySQL is used as
database management system. The database is ini-
tialized with all data from the Arabic Wikipedia and
the English WordNet after preparation and tuning
for performance.
Figure 5: Query entry screen.
When the user enters a search query, this query is
sent to the Google search engine using the Google
custom search API (Google API, n.d.). The Google
custom search API returns a JSON (JSON, n.d.)
encoded result-set; which is converted into java ob-
jects using the Gson API (GSON, n.d.). Each result
item is a complete web page. The Jsoup API
(JSOUP, n.d.) is used to extract the main text docu-
ment from the whole HTML file. The most signifi-
cant terms appearing in the searched documents are
then extracted using the LingPipe tool (LingPipe,
n.d.).
Facets created using the three methods are dis-
played on separate tabs as illustrated in
Figure 6.
Figure 6: Result-set with facets.
5 THE VALIDATION
The system prototype is assessed by searching in
real-life articles extracted from two sources: the
BBC Arabic edition website and the Arab Scien-
cepedia website. The main differences between both
sources are: the diversity of news versus the concise
nature of the scientific articles and the availability of
training corpus for the new articles. We compare the
resulting facet hierarchies in terms of relevance,
structure of the hierarchies as per number of facets
and the depth of the hierarchy, and the creation time.
CreatingFacetsHierarchyforUnstructuredArabicDocuments
115
5.1 Relevance of Terms, Articles, and
Facets
5.1.1 Terms Coverage
The set of candidate terms is a subset of the set of
terms extracted from the text articles. These terms
are used in building the facets hierarchies. The final
set of terms that appears in the hierarchies is usually
only a subset of candidate terms. It depends on the
ability of the external resource –Wikipedia and/or
WordNet- to recognize the terms and on the hierar-
chy compression step. We define the terms coverage
to be the number of terms in the facets hierarchy
divided by the number of candidate terms. Any loss
in the terms coverage is not preferred.
As shown in
Figure 7, Wikipedia hierarchy co-
vers 80% of the target terms while the loss of terms
in the WordNet hierarchy achieves more than 50%.
The bad performance of WordNet for the scientific
articles is attributed to the lack of training corpus.
Figure 7: Terms coverage.
5.1.2 Articles Coverage
The search operation made at the beginning of the
process returns a set of articles. The created facets
are used to navigate subsets of the articles. Depend-
ing on the terms coverage and the minimization step
of the built hierarchy, links to some articles can be
removed. Articles coverage is a very important
measure, because it shows the reliability of the built
hierarchy. We define the article coverage to be the
number of articles accessed by the facets hierarchy
divided by the total number of articles.
As illustrated in Figure 8, the Wikipedia hierar-
chy has nearly 100% articles coverage but the
WordNet hierarchy has significantly lower value.
The Scientific articles in the WordNet hierarchy has
higher value than the news articles because usually
scientific expressions related to the user query are
always available in the returned articles, as scientific
contents are usually more concise than news articles
contents. Nevertheless, the Wikipedia hierarchy
achieves better results in the articles coverage.
Figure 8: Articles coverage.
5.1.3 Accuracy of Facets
In this assessment, each facet is inspected manually.
A value of 0 is assigned to the facet if the facet is not
related to the assigned articles, and a value of 1 is
assigned to the facet that accurately describes its
assigned articles. This measure indicates whether the
facets terms extracted from the articles accurately
describe the text articles or not. We define the facet
accuracy to be the number of facets closely related
to the text topic divided by the number of facets as-
signed to text topics. As shown in Figure 9, the hier-
archies have good 87-97% accurate facet terms for
both methods and for both data sets.
Figure 9: Accuracy of facets.
5.1.4 Significance of the Facet Tree
This assessment is also performed manually. We
check whether a facet term is correctly placed in the
hierarchy or not. A facet is in a correct place if the
parent facet accurately describes its children, and a
child facet is placed under the correct parent.
We define the Significance factor to be the num-
ber of significant facets in the hierarchy divided by
KEOD2013-InternationalConferenceonKnowledgeEngineeringandOntologyDevelopment
116
the number of facets in the hierarchy. Again, all hi-
erarchies have 86-96% facet significance.
We are also interested in the amount of infor-
mation lost during the translation from Arabic to
English and vice versa in the WordNet approach.
Again, manually, each term in the hierarchy is as-
signed a 0 or 1 value for the translation factor. The 0
value is assigned to indicate a bad translation that
affected the meaning of the term. The 1 value is as-
signed to indicate a correct translation. We define
the translation factor to be the number of facet terms
with exact translation divided by the number of fac-
ets in the hierarchy. The translation factor is about
95%; which means a very low level of loss of mean-
ings within the facets terms.
5.2 Comparison of the Facet
Hierarchies
In this section, we compare the shapes of the facet
hierarchies. A hierarchical facet structure is said to
be good if it has a reasonable number of siblings for
each node and the expanded path from root to leaf
fits in one screen without scrolling down.
5.2.1 Number of Facets
As shown in Figure 10, the Wikipedia hierarchy has
an average of 27 facets for the BBC news articles,
while WordNet has only 9 facets. For the Arab Sci-
encepedia articles, both the Wikipedia hierarchies
and the WordNet hierarchies have an average of 15
facets. In the Wikipedia hierarchies, the number of
facets decreases from 27 in the news articles to 15 in
scientific articles because most of the extracted
terms are related to a small number of topics with
more concentrated categories as compared to those
extracted from the BBC news articles; which is typi-
cal to the nature of related scientific articles as
compared to the broad natured news articles. The
WordNet numbers of facets increased, from an aver-
age of 9 facets in the BBC news articles to 15 in
Arab Sciencepedia articles, because WordNet under-
stands most of the scientific expressions, so better
results are obtained.
5.2.2 Average Tree Depth
The average tree depth is computed as follow.
tfacets listhe final facets in number of
t) each face. depth ofsum (
pthaverage_de
max
The average tree is computed three times for each
hierarchy: once for the LingPipe facets, once for the
Figure 10: Average number of facets.
non-LingPipe facets, and once for the whole hierar-
chy facets.
The Wikipedia hierarchy predefined facets are
flat facets, while the WordNet predefined facets are
hierarchical, which makes the WordNet hierarchy
depth average better than the Wikipedia hierarchy
depth average. The depth average is illustrated by
the graph in Figure 11. This is one of the rare cases
where the WordNet yields better results than Wik-
ipedia.
Figure 11: Average depth of facet trees.
5.3 Creation Time
The total time needed for building each of the hier-
archies is illustrated in Figure 12. The system con-
figured only to use WordNet hierarchy is always
slower than the Wikipedia hierarchy. In the hybrid
hierarchy, both hierarchies are created in separate
threads. Then a fast merging step is done. So, it is
clear that the creation time for the hybrid model is
slightly higher than the maximum of each measure-
ment.
CreatingFacetsHierarchyforUnstructuredArabicDocuments
117
Figure 12: Creation time in seconds.
6 CONCLUSIONS AND FUTURE
WORK
In our work, we develop a hierarchical faceted
search system for Arabic documents. Building the
facets involves two steps: creating the set of facets
then building the facets hierarchy. The set of facets
is created by extracting the most significant terms
from the searched documents. Then two different
paradigms are used in the hierarchy building process
in order to check whether it is better to use Arabic
tools and content or to use English. English tools,
such as WordNet, already showed success for creat-
ing faceted interface for English content. Using Eng-
lish tools imply a translation step from Arabic to
English and vice versa.
The hierarchy built using the Wikipedia catego-
rization is built in a button-up manner. Each term in
the candidate terms list is checked against the Wik-
ipedia categorization content, and if found all its
parents are inserted into the hierarchy. This step is
recursively done for each parent until a three-level
hierarchy is obtained.
To use the WordNet IS-A hypernym structure,
candidate terms are translated into English. Then,
each term is checked against the WordNet structure.
Once found, all its parent hierarchy is inserted in the
hierarchy under construction. If a term is not found
in the WordNet structure but has a tag value equal to
one of the predefined facets, it is directly placed
under this facet. Any ambiguous term is resolved
using the WordNet Domains tool. Finally the terms
inserted in the hierarchy are translated to Arabic.
One major drawback appeared while using the
English WordNet with Arabic text. Arabic cultural
expressions, even if well translated into English, are
not covered in the English WordNet content. This
results in dropping significant number of candidate
terms. Building the facets hierarchy with only a sub-
set of the candidate terms results in low terms and
articles coverage. Whereas building the facets hier-
archy using the Arabic Wikipedia categorization
structure shows better performance in both news
articles and scientific articles.
One note must be considered for the usage of the
faceted search interface framework with different
documents contents. If LingPipe statistical extrac-
tion model is used, a training corpus related to the
documents contents must be used in order to get
acceptable results. LingPipe statistical model tags
can be checked versus the English WordNet corpus.
Once found, hybrid facets hierarchy is created by
merging the two hierarchies giving an enhanced
facet hierarchy.
Finally, we are aware that the comparison is the
result of a clear competition between a wiki-based
approach – and hence organically evolving – with a
huge community of contributors versus a well struc-
tured approach driven by linguistic science. This
benchmarking should be periodically revisited as
results would change with the development of better
linguistic tools in Arabic language. An interesting
extension would be using Arabic WordNet within a
few years in the same way it was used to improve
QA for Arabic (Abouenour et al., 2008).
REFERENCES
Abouenour L., Bouzoubaa K., Rosso P., 2008. Improving
Q/A Using Arabic Wordnet. In the International Arab
Conference on Information Technology (ACIT'2008),
Tunisia.
Arabic Wikipedia, n.d. [Online] Available at:
<http://ar.wikipedia.org/> [Accessed 13 8 2012].
Arabic Wikipedia Categorization, n.d. The Arabic
Wikipedia Categorization root category, [Online]
Available at: <http://ar.wikipedia.org/wiki/
_:> [Accessed 20 6
2012].
Arabic WordNet, n.d. [Online] Available at: <http://
www.globalwordnet.org/AWN/> [Accessed 7 6 2012].
ArabSciencepedia, n.d. [Online] Available at: http://
www.arabsciencepedia.org/ [Accessed 7 7 2012].
BBC Arabic, n.d. [Online] Available at: <http://
www.bbc.co.uk/arabic/> [Accessed 7 7 2012].
BinAjiba, Y., n.d. Arabic NER Corpus and Documents.
[Online] Available at:
<http://www1.ccls.columbia.edu/~ybenajiba/download
s.html> [Accessed 12 2 2013].
Bing Translator, n.d. The Bing translator API [Online]
Available at: <http://www.microsoft.com/web/post/
using-the-free-bing-translation-apis> [Accessed 4 9
2012].
KEOD2013-InternationalConferenceonKnowledgeEngineeringandOntologyDevelopment
118
Dakka, W. & Ipeirotis, P. G., 2008. Automatic extraction
of useful facet hierarchies from text databases. In
IEEE 24th International Conference on Data
Engineering (ICDE), pp.466-475.
Google API, n.d. Google Custom Search API [Online]
Available at: <http://code.google.com/ apis/
customsearch/> [Accessed 23 2 2013].
GSON, n.d. Google GSON API [Online] Available at:
<https://code.google.com/p/google-gson/>
[Accessed 13 2 2013].
Hearst, M. A., 2008. UIs for faceted navigation recent
advances and remaining open problems. In Workshop
on computer interaction and Information retrieval,
HCIR,, Redmond, WA.
Hearst, M. A. & Pedersen, J. O., 1996. Reexamining the
cluster hypothesis.In Proceedings of the 19th Annual
International ACM/SIGIR Conference, Zurich.
JSON, n.d. The Json file format [Online] Available at:
<http://www.json.org/> [Accessed 14 3 2013].
JSOUP, n.d. Java HTML parser API [Online] Available
at: <http://jsoup.org/> [Accessed 23 2 2013].
Lim, D. et al., 2011. Utilizing wikipedia as a knowledge
source in categorizing topic related korean blogs into
facets. In Japanese Society for Artificial Intelligence
(JSAI), Takamatsu.
LingPipe, n.d. LingPipe Named Entity Recognizer,
[Online] Available at: http://alias-i.com/lingpipe/
demos/tutorial/ne/read-me.html [Accessed 4 9 2012].
Manning, C. D., Ranghavan, P. & Schütze, H., 2009.
Introduction to Information Retrieval. Cambridge
University Press.
Mark, S. & Croft, B., 1999. Deriving concept hierarchy
from text. In proceeding of the 22nd annual
international ACM SIGIR conference on research and
development in information retrieval.
Naver, n.d. Naver Open API [Online] Available at: <http://
dev.naver.com/openapi/> [Accessed 15 2 2013].
Wikipedia Redirect, n.d. [Online] Available at:
<http://en.wikipedia.org/wiki/Help:Redirect>
[Accessed 15 2 2013].
Stoica, E. & Hearst, M. A., 2006. Demonstration :Using
WordNet to build hierarchical networks. In the ACM
SIGIR Workshop on Faceted Search.
Stoica, E. & Hearst, M., 2004. Nearly-Automated
Metadata hierarchy creation. In Human Language
Technologies: The Annual Conference of the North
American Chapter of the Association for
Computational Linguistics (NAACL-HLT), Boston.
Stoica, E., Hearst, M. & Richardson, M., 2007.
Automating creation of hierarchical Faceted metadata
structures. In Human Language Technologies: The
Annual Conference of the North American Chapter of
the Association for Computational Linguistics
(NAACL-HLT),
Rochester NY, USA.
Tripathi, S. & Sarkhel, J. K., 2010. Approach to machine
translation. In The Americas Lodging Investment
Summit (ALIS) Vol.57, pages 388-393.
Tunkelang, D., 2009. Faceted Search. Morgan &
Claypool.
Wikipedia Categorization, n.d. [Online] Available at:
<http://en.wikipedia.org/wiki/Wikipedia:Categorizatio
n> [Accessed 12 5 2013].
WordNet Domains, n.d. [Online] Available at:
<http://wndomains.fbk.eu/> [Accessed 13 2 2013].
WordNet, n.d. [Online] Available at: <http://
wordnet.princeton.edu/> [Accessed 3 2 2013].
Yahoo History, n.d. [Online] Available at: <http://
docs.yahoo.com/info/misc/history.html>
[Accessed 9 8 2012].
Yan, N. et al., 2010. FacetedPedia: dynamic generation of
query-dependent faceted interface for wikipedia. In
Proceedings of the 19th international conference on
World wide web, pp.651-660.
CreatingFacetsHierarchyforUnstructuredArabicDocuments
119