(WordNet, n.d.) and the Wikipedia categorization
structure (Wikipedia Categorization, n.d.). Some of
the previous works in the faceted search interface
creation for unstructured documents include Casta-
net (Stoica and Hearst, 2004) and (Stoica et al.,
2007), facets for text database (Dakka and Ipeirotis,
2008), Facetedpedia (Yan et al., 2010), and facets
for Korean blog-posts (Lim et al., 2011) as an ex-
ample of Facet creation in non-latin languages.
2.3 Facet Creation Research Projects
The Castanet algorithm - (Stoica and Hearst, 2004)
and (Stoica et al., 2007) - assumes that there is a text
description associated with each item in the collec-
tion. E.g., if the collection is a set of images, each
image has an unstructured text document describing
the image. The textual descriptions are used to build
the facet hierarchies and then to assign documents to
facets.
The target terms set is a subset of the terms that
best describes the set of documents. Their selection
criterion is the term distribution. Terms with term
distribution greater than or equal to a specified
threshold are retained as the target terms. The target
terms set is divided into two categories:
Ambiguous terms: having more than one mean-
ing in the English WordNet.
Un-ambiguous terms: with only one meaning in
the English WordNet.
The core hierarchy is first built for the un-
ambiguous terms using the WordNet IS-A hypernym
structure (WordNet, n.d.). Then, the ambiguous
terms are checked against the WordNet Domains
(WordNet Domains, n.d.); which is a tool assigning
domains to each WordNet synonym set. The tool
counts occurrences of each domain for unambiguous
target terms, resulting in a list of the most represent-
ed domains in the set of documents. If the ambigu-
ous term has only one common domain for all its
senses in WordNet, it is considered unambiguous.
The core hierarchy is next augmented by the un-
ambiguated terms IS-A hypernym paths. A refine-
ment step is next done by compressing the final hi-
erarchy. Nodes with number of children less than a
threshold and nodes whose names appear in their
parents’ node name are eliminated. Finally, in order
to create a set of sub-hierarchies, the top levels (e.g.,
4 levels) are pruned. Thus, the final facets hierarchy
is created.
The algorithm for creating facets for text data-
bases is presented in (Dakka and Ipeirotis, 2008). It
is built on the observation: The terms for the useful
facets do not usually appear in the documents con-
tents. Thus, the target terms list is created using two
sets of terms.
The first set includes the significant terms ex-
tracted from the document body text using ex-
traction tools.
The second set is created by expanding the first
set with other relevant terms using WordNet
hypernym, Wikipedia contents, and the terms
that tend to co-occur with the first set of terms
when queried against the Google search engine.
The term frequency is used in the original and
the expanded terms set to identify the final candidate
facets. Infrequent terms from the first terms set with
the frequent terms from the second expanded terms
set form together the final set of facets.
Facetedpedia (Yan et al., 2010) is a project that
dynamically generates a query-dependent faceted
interface for Wikipedia searched articles. The next
definitions build the main concepts used in the algo-
rithm.
Target Articles: are the articles in the returned
result-set of the user query.
Attribute Articles: each Wikipedia article that is
hyperlinked by a target article.
Category Hierarchy: Wikipedia category hier-
archy is a connected, rooted directed acyclic
graph.
One large hierarchy is built for the target articles
as follows. Each target article is connected to all its
attributes articles. Then, a category hierarchy is built
for each attribute article. Hierarchies for all attribute
articles are merged until we find one common root
category. Then, the most appropriate set of sub-
categories is chosen from within the built hierarchy
using a cost measurement and a similarity measure-
ment developed by the author.
The work in (Lim et al., 2011) is an example of
facet creation for non-Latin languages. Non-Latin
languages such as Korean or Arabic do not have
powerful linguistic tools when compared to the Eng-
lish WordNet. Workarounds are found by research-
ers working with these languages. In (Lim et al.,
2011), the system generates flat facets interface for
Korean blog-posts. Given a search query keyword,
blog posts are searched using the search engine
“Naver Open API” for Korean (Naver, n.d.). For the
initial keyword, a set of blog posts is constructed,
where each post with its body text are successfully
extracted from the blog post. The facet generation
process is done in five steps. The system collects
Wikipedia articles that include the user query and
extracts the titles of these Wikipedia pages to use
them as facets candidates for the blog-posts. After
constructing the candidate facets terms set, only the
CreatingFacetsHierarchyforUnstructuredArabicDocuments
111