
tion on probabilistic models of language, for instance
methods like the probabilistic Latent Semantic Index
(pLSI) (Hofmann, 1999) first and the Latent Dirich-
let Allocation (LDA) (Blei et al., 2003) later, are well
know probabilistic methods for automatic categoriza-
tion of documents.
As mentioned above the existing Web search en-
gines primarily solve syntactic query, and as a side
effect the majority of search results do not completely
satisfy user intentions, especially when the queries
are informational. This work will show as a clas-
sic search engine can improve its search results, and
then bring them closer to the user intentions, using
a tool for automatic creation and manipulation of on-
tologies based on an extension of LDA, quoted above,
called the topic model. To this end, a new search
engine, iSoS, based on the existing open source soft-
ware Lucene, was developed. More details will be ex-
plained in the next sections together with experiments
aimed to make a comparison between iSoS and a clas-
sic engine behaviours . The comparing method relies
on an innovative human judgment based procedure,
which we broadly discuss next, and which represents
the real core of this paper.
2 iSoS: A TRADITIONAL WEB
SEARCH ENGINE WITH
ADVANCED
FUNCTIONALITIES
As discussed above, iSoS is a web search engine
with advanced functionalities; it’s a web based server-
side application, entirely written in Java and Java
Server Pages programming languages, which embeds
a customized version of the open source API Apache
Lucene
1
for indexing and searching functionalities.
In next sections we show the main properties and its
functionalities of iSoS framework. Some use cases
will be shown, including how to build a new index, in-
clude one or more ontologies, perform a query, build
a new ontology.
2.1 Web Crawling
Each web search engine works by storing informa-
tion about web pages retrieved by a Web crawler, a
program which essentially follows every link it finds
browsing the Web. Due to hardware limitations,
our application doesn’t implement its own crawling
system, but a smaller environment is created in or-
der to evaluate performance: the crawling stage is
1
http://lucene.apache.org/
performed by submitting a specific query to the fa-
mous web search engine Google (www.google.com),
and extracting the URLs from the retrieved results.
Then the application downloads the corresponding
web pages to be collected in specific folders and in-
dexed. The GUI allows users to choose the query and
the number of pages they want to index.
2.2 Indexing
The main aim of the indexing stage is to store statis-
tics about terms to make their search more efficient.
A preliminary document analysis is needed in order
to recognize tag, metadata, informative contents: this
step is often referred as parsing. A standard Lucene
index is made of a document sequence where each
document, represented as an integer number, is a field
sequence, with every field containing index terms;
such an index belongs to the inverted index family
because it can list, for a term, the documents that
contain it. Correct parsing helps to make well cat-
egorized field sets, improving subsequent searching
and scoring. There are different approaches to web
pages indexing. For example, it must be said that
some engines don’t index whole words but only their
stems. The stemming process reduces inflected words
to their root form and is a very common element
in query systems such as Web search engines, since
words with the same root are supposed to bring si-
miliar informative content. In order to avoid index-
ing of common words such as prepositions, conjuc-
tions, articles which don’t bring any additional in-
formation, stopwords filtering can also be adopted.
Since stemmed words indexing and stopwords filter-
ing often result in a lack of search precision, although
they could help reducing the index size, they’re not
the choice of important search engines (like Google).
For this application, we developed a custom Lucene
analyzer which allows to index both words and their
stems without stopwords filtering; it is possible than
to include in the searching process ontologies made
of stemmed words and thus optimize ontology-based
search without penalizing original query precision.
2.3 Searching and Scoring
The earth of a search engine lays in its ability to rank-
order the documents matching a query. This could be
done through specific score computations and ranking
policies. Several information retrieval (IR) operations
(including scoring documents on a query, documents
classification and clustering) often rely on the vector
space model where documents are represented as vec-
tors in a common vector space (Christopher D. Man-
A NEW TECHNIQUE FOR IDENTIFICATION OF RELEVANT WEB PAGES IN INFORMATIONAL QUERIES
RESULTS
71