determined keywords in a specific domain, a pass fil-
ter passes a page if the page title matches or contains
one of these keywords, and blocks it otherwise. Given
a set of pre-determined keywords not in the given do-
main, a block filter blocks a page if the title of the
page matches or contains one of the keywords, and
passes it otherwise. Constructing either type of fil-
ters, however, is challenging, for it is difficulty to
predict what kind of keywords will appear or not ap-
pear in the title of a page and the outgoing link could
point to a completely different and unexpected cate-
gory. In particular, an overstrict pass filter may fail
to include pages that should be included, while an in-
sufficient block filter may include pages that should
not be included. There is another major drawback in
page tracing: It may fail to discover pages that are
not reachable from other pages. Such “hidden” pages
may be newly created pages that have not been linked
to from any existing page, pages that contain fringe
topics known to only a small number of people, or
pages that consist of rarely used organic compounds
in chemistry.
To overcome these drawbacks, we devise an effec-
tive method to extract a domain knowledge network
from Wikipedia based on page titles and page cate-
gories. The page category, a feature of Wikipedia,
is used to classify pages. Each page belongs to one
or more categories and each category contains several
pages or sub-categories. The categories of a page can
be found at the end of the page, and the category itself
can be viewed as a special page. Although typically
hidden from the user, the category information can
be obtained by explicitly loading the page with the
“Category:” prefix. We use page categories to gen-
erate new pages and check whether a page should be
included in the network. Our extraction mechanism
consists of the following three steps.
Step 1: Extract Domain Category Hierarchy.
Each category may contain a number of pages or sub-
categories, thus categories themselves could form a
hierarchy, representing a framework of the domain
knowledge. We use page categories to generate new
pages in this step.
An ideal domain hierarchy should be a directed
acyclic graph containing exactly those nodes in the
domain, but the Wikipedia categoryhierarchies are far
from being ideal. In particular, a category hierarchy
may contain two types of loops. They are local loops
and out-domain loops. By local loops it means that
two or more closely related categories (sometimes
they are actually the same but with slightly different
categorical descriptions)contain each other. This type
of loops has no important effecton the domain knowl-
edge generation. By out-domain loops it means loops
containing a node in a different domain. Out-domain
loops could be catastrophic if not handled properly,
for they might lead to a super-category that contains
chemistry as a subcategory. Take the following out-
domain loop as an example:
Chemistry → ··· → Silicon → ··· → Memory →
Knowledge → Science → ··· → Chemistry
where the ambiguity of “Memory” leads to a misin-
terpretation as human memory instead of computer
memory as it is intended to be, which in turn leads to
the super-category of “Knowledge” of human knowl-
edge. If such out-domain loop is not handled prop-
erly, all categories under “Knowledge” would be in-
cluded. Fortunately, such out-domain loops are rare,
as the category misclassification errors are reported
by users everyday, and Wikipedia editors correct such
errors efficiently.
Deploying keyword-based block filters can avoid
such misinterpretations by properly selecting a set of
keywords. We repeat the process of generating the
domain category hierarchy several times and update
the block filter accordingly until a satisfactory quality
is obtained.
Step 2: Extract Pages. With the hierarchy of do-
main categories at hand, we are ready to extract pages
that belong to the hierarchy. We note that some pages
listed under a category in the hierarchy may not be-
long to the domain of interest and should not be in-
cluded. For example, a biography of a chemist might
be listed in the hierarchy and should not be included
in the Wikipedia Chemistry network. The category
information of a page can be used to check for in-
clusion. Since a page might belong to several dif-
ferent categories, the page which should not be in-
cluded would contain keyword in the titles of its con-
catenated categories. For instance, chemists also be-
long to the category of “People”. To avoid adding
chemists into our knowledge network, we can simply
add a keyword “People” to the block filter. Other sim-
ilar block-keywords would be “Birth”, “Prize”, and
“Facility”, to name only a few.
After extracting pages that pass the block filter, we
scan each page for linkage and build links within the
current set of pages. No new pages will be included
at this point.
Step 3: Trim Disconnected Component. Note that
certain pages in the network after the previous two
steps might belong to a cluster that are disconnected
from any other node in the network. This can be used
WIKIPEDIA AS DOMAIN KNOWLEDGE NETWORKS - Domain Extraction and Statistical Measurement
161