matching and semantic matching categories, since these techniques are decentralized
and have proven to be effective in other search domains. Within these categories the
approaches can be differentiated by the degree of user involvement required. Although
several standards have been proposed to add semantic annotations to web services [3],
all of these methods require both choosing an ontologyand adding semantic annotations
to the services. On the other hand, approaches based on information retrieval methods
use the information already available in the service description files. We believe these
latter approaches are likely to be more successful because currently there is little incen-
tive for developers to manually annotate web services.
Several researchers have implemented service discovery tools that rely solely on
the information contained in the web service description files. It is difficult to measure
the relative performance of these implementations against each other, and the results
found in the literature vary. In [3] the authors compared several approaches to web
service discovery, including information retrieval based methods and semantics based
methods, and found that the semantic based approach using Latent Semantic Analysis
outperformed the information retrieval approach. However, in [1] the authors showed
that when comparing the results of a system based on Latent Semantic Analysis with
the results of a system based on the Vector Space Model, recall improved but precision
fell.
In our own experiments we implemented a web service search engine based on the
Vector Space Model [4] and a semantics based engine using Explicit Semantic Analysis.
In the classic Vector Space Model approach, text documents (in this case, WSDL files)
are represented as vectors in a term space. If a term occurs in a document then its
value in that dimension of the vector is non-zero. The value is typically derived from
the frequency of the term in the document and the commonness of the term in the
whole document collection. One of the most common weighting formulas used is Term
Frequency Inverse Document Frequency, although several variations of this formula
exist. In our implementation a variation of TFIDF that uses sublinear scaling was used.
The relevant documents for a query are calculated by representing the query as a vector
in the same term space and then calculating the similarity between the query vectors and
the vector representations of the documents in the collection. In this implementation
cosine similarity, which compares the angle between the vectors, was used.
Explicit Semantic Analysis is a method introduced in [5–7] in which text in nat-
ural language form is represented in a concept space derived from articles found in
Wikipedia. In this method each article in Wikipedia is treated as a concept in a general-
purpose ontology and the text in the articles is used to determine the degree of related-
ness between the concept and a text snippet. An inverted index is used to build a vector
for each individual term. The inverted index keeps track of the articles in Wikipedia that
contain each term and their weight (which is typically calculated using some variant of
TFIDF). The resulting vector for the text snippet is the centroid vector of the vectors
derived for each term. The similarity between two text snippets can then be calculated
based on these ”concept vectors” by using standard vector-based similarity similarity
metrics such as cosine similarity. The method is said to be explicit because the concepts
are explicitly defined in Wikipedia in the form of articles.
Compared to the Vector Space Model, Explicit semantic analysis produces vectors
13