the domain are about financial news and comments,
and their contents are professional and limited to the
specific field. In order to adapt to the characteristics
of finance domain, we need to add the domain
knowledge to our model. Besides that, the
calculation of web page authorities is the same as
introduced in Section 3.
Our experiment is based on the web pages
crawled from the Internet. Pages need to be
processed after crawling to extract the necessary
information for our experiment. Our authority model
is applied to the web pages after that. In order to
better evaluate the experiment result, we use a
method to partition the authorities of web pages into
different ranks, and a manually annotated set is used
for evaluation. The detailed description and analysis
are presented below.
4.1 Adding the Domain Knowledge
In our experiment, the authority model is applied to
the finance domain. Therefore, the domain
knowledge is quite necessary to judge the authority
of web pages. The method of adding the domain
knowledge to our model is mainly to adjust the
importance of sources according to the features of
the domain. In Section 3.1 we introduced our
method of getting importance scores from Alexa,
which are the general rankings on the basis of daily
visits to websites. However, the area of finance has
its own characteristics, which cannot be obtained
simply from Alexa. For example, China Stock is a
famous and professional website on finance in
China, but its importance in Alexa is not ranked
highly. Due to the specialty of financial websites and
their limits of scopes, the websites usually do not
have many visits, and their visitors are people who
are interested in finance and have the background
knowledge, rather than the normal Internet users.
Hence, we may find those professional websites to
be ranked lowly in Alexa, which should not
represent their real rankings.
Therefore, adding the domain knowledge to our
previous rankings is necessary for calculating the
authorities of financial web pages. We find some
resources about the rankings of Chinese finance
newspapers, periodicals and websites. Based on the
resources and the opinions of some domain experts,
the importance of some sources is adjusted, i.e. the
scores of some professional and important financial
websites are increased, the less important financial
websites are re-ranked lowly, and the scores of some
well-known portals are decreased, since their main
scopes are not finance. With the process of
adjustment, we are able to build a database for
source rankings. Moreover, more sources will be
added to the database with the use of our model.
Consequently, the database will contain more and
more information about sources in the domain. This
is useful knowledge for the authority calculation and
can be reused in the future. Therefore the effort of
adjustment is quite worthy. Through the adaptation
to finance domain, the importance of sources
accords more with the real situation within the
domain, with which we will acquire more accurate
result in our experiment.
4.2 Data Collection and Preprocessing
The process of data collection and preprocessing
obtains the necessary information for our authority
model, which includes link structure, source
information and related information.
The process of getting link structure includes
web page crawling, hyperlink extraction, hyperlink
filtering and link relationship establishment. The
web pages used in our experiments are crawled from
Sina Finance (http://finance.sina.com.cn/), which
contains thousands of financial news at home and
abroad. These pages form the original set for our
experiment. After the pages are downloaded from
the Internet, their contents are analyzed, and the
hyperlinks in them are extracted. In order to limit the
web pages to the finance domain and research the
relationships of financial pages, a filtering process is
done after hyperlink extraction, which restricts the
hyperlinks to Sina Finance and removes
advertisement and navigation hyperlinks. In this way
we make sure that all the web pages left are about
finance. With the hyperlink lists of the original set,
the corresponding pages of out-links are added. The
in-links that point to the original set are also taken
into consideration. These in-links are extracted from
Site Explorer of Yahoo!
(http://siteexplorer.search.yahoo.com/), and during
the extraction, the number of in-links for every web
page is limited to 50. For the new added pages of
in-links and out-links, the link relationships among
them are also established to completely form the link
structure for all the web pages.
Besides that, the source information of web
pages is also extracted from pages, and the
importance scores are obtained from Alexa and then
normalized. Then the process of adjustment is done
to source importance to add the domain knowledge.
Related hyperlinks in the web pages are also picked
up and the corresponding relationships are
established. The process of data collection and
A DOMAIN-RELATED AUTHORITY MODEL FOR WEB PAGES BASED ON SOURCE AND RELATED
INFORMATION
249