Google search engines through their APIs. The
results are extracted in JSON format and are
analyzed in order to extract the returned URLs.
Consequently, evaluation is performed against PA,
DA, mR, sR (simple Ranking), and the Visibility
Score – VS (combined ranking), and the top λ results
are retained. Next, webpage analysis is performed;
the body of the content along with the anchor text
and metadata are extracted, stemming and cleaning
is performed, while regular expressions and stop
words are removed.
Semantic analysis via LDA is performed on the
text of the retained results for a given query, in order
to recognize the dominant words that compose the
dominant content for this query. The output of the
semantic analysis is a list that contains the most
probable words for the given queries. Based on the
most dominant words of the list, a set of queries is
designed. All the possible combinations of words are
formed, but only the l most powerful combinations
are retained. This is based on the observation that a
user query typically contains a finite number of
words. The similarity of the new queries with the
original query is calculated in means of NGD-
Normalized google distance (Cilibrasi and Vitanyi,
2007), and the top Κ queries are selected.
The above process (of creating queries and
identifying dominant words) is repeated until list of
words from the current round of semantic analysis
contains at least β% of common words with the
previous round.
All the mechanism parameters are defined
through a configuration file, which is parsed during
the initiation phase of the mechanism. Table 1
depicts the configuration parameters:
Table 1: LDArank configuration parmeters.
InputParameters
‐Userquery(q
1,
q
2,…
q
n
)
‐Searchengineresultsthreshold(λ)
‐LDAranktopicsofanalysisthreshold(τ)
‐LDArankbetaparameter(β)
‐LDAranknumberofiterations
Μ
‐LDAranknumberoftopwords/topic(α)
‐LDArankprobabilitythreshold (ξ)
‐NGDthreshold(Κ)
‐Maximumwordsitemset(
)
‐Convergencelimit(c
)
‐Performancelimit(
)
‐TypeofSEemployed(Google,Bing,Yahoo!,all)
‐SEOmozmetric,(mR,externalmR,PA,DA,VS,all
4 EXPERIMENTS AND RESULTS
In order to provide evidence on the applicability of
our model, we discuss an indicative test case. Let’s
assume that a web content provider would like to set
up a website on Software Engineering practices. In
order to increase website visibility, and given the
preceding analysis on the importance of website
content in SE ranking, he/she would like to identify
the dominant keywords that he/she should use, in
order to achieve his/her goal. To this end, LDArank
is employed. The following analysis provides a set
of experiments and conclusions; nevertheless one
may perform an even wider range of experiments, by
tuning any of the LDArank mechanism parameters.
4.1 Experiment Setup
Various alternatives have been explored in order to
illustrate LDArank versatility and ease-of-use (some
omitted due to space limitations). The aims of the
experiments were to: a) identify whether the size of
the resulting word cloud is related to SE ranking of
webpages, b) identify whether the type of words in
residing in a webpage (generic or more specialized)
affects SE ranking and, c) to evaluate the
convergence capabilities of all the metrics
considered.
To this end, two sets of terms are considered for
the analysis: a) a set comprising 15, more generic
terms on Software Engineering and b) a set
comprising 40 terms, more focused on software
engineering processes.
Parameters M, K, l, cl, and pl were kept constant
in the performed LDArank experiments. The
experiments performed had varying values with
respect to α, λ, ξ and τ, and were evaluated against
the core SE metrics identified: sR, mR, PA, DA, VS,
mR with merged engine (mRm), PA with merged
engine (PAm), and DA with merged engine (DAm).
4.2 Results
Experiments run on the first set of terms (generic)
resulted into three groups, according to the size of
word cloud generated, with respect to the values of
the number of topics, number of top words and the
probability threshold. These groups are: a) Group A
– a small scale group, b) Group B – a medium scale
group and, c) Group C – a large scale group.
Group A produced a total of 44 words, group B
554 words, and group C 921 words. Comparing the
top words of group A against the top words of the
other two groups (Figure 3) it can be argued that the
groups are well separated.
Group C produced more content than the group
B, which produced more content than group A.
WEBIST2012-8thInternationalConferenceonWebInformationSystemsandTechnologies
274