purpose of locating appropriate component gadgets
to use in a mashup, far more precision is required.
By indexing all available metadata in a semantic
format, it will be possible to locate suitable
candidates using very specific criteria.
6.2 Indexing Gadget Metadata
The first step in providing search results must be to
identify and index the data which will provide such
results. Google provides a directory of gadgets
available for public syndication purposes. The
primary item of interest for each item in this
directory is the web address where the gadget’s
specification file can be located; this is where the
gadget’s full metadata and details can be found.
Also it provides access to its gadget directory in
two formats: the standard, graphical interface for
end users, and an RSS feed more suitable for
programmatic access. It is this second which is used
for retrieving the list of gadget specification file
URLs. Since both this RSS feed and the listed
gadget specification files meet specific XML
schema requirements, Java’s JAXB (Java
Architecture for XML Binding) was used to access
the target data elements within the files.
As threads run to retrieve these lists of gadget
specification file URLs, 100 results at a time, other
threads run to request those XML files from the
various hosts across the internet where their authors
uploaded them. Using the official XSD (XML
Schema Document) provided as the standard by
Google, JAXB is used to validate those specification
files. Some of the files are now missing from their
web host, contain syntax errors, or for various other
reasons do not meet the Google Gadget standard;
those gadgets are discarded before proceeding
further.
Those gadgets meeting the standard are then
processed. The Jena library is used to store all
gadgets’ metadata in a semantic format, with TDB
providing the high-speed storage backend. For each
gadget, each item of metadata is treated as a 3-tuple
(triple), with the XML specification file’s URL as
the subject, the name of the metadata item as the
predicate, and the gadget author’s provided data as
the object.
Once all gadgets have been processed, this
database of metadata of publically listed gadgets is
exported into RDF format. While not strictly
necessary for the indexing phase, this allows the data
to potentially be used in other ways in the future.
With this summary of all gadget metadata to
work from, an index suitable for search engine use is
built next. Lucene is a powerful search engine
library, and SIREn provides a semantic extension
which preserves the semantic nature of the data
rather than requiring a specific, structured set of
fields. Lucene organizes data in a different manner
than a triple-based system such as that provided by
Jena; any item which the user wishes to retrieve as a
result must be a “document”. Thus the data is
indexed by considering each gadget XML
specification file as a document, with all of the
associated metadata statements being filed as
contents of that document.
At this point, with all of the metadata having
been indexed by Lucene through the semantic filter
of SIREn, the search pre-processing is complete. It
must be noted that this crawling and indexing
process is distinct from the gadget and associated
servlet that the user accesses to make use of this
data. Additionally, this entire process must be re-run
periodically to take into account new entries into the
public gadget directory.
7 CONCLUSIONS
The creation of composite web widgets – mashups –
using other web widgets and components shows
great promise. Though most web widgets have not
been developed with the intent of re-use as a
component, there are an enormous number of web
widgets available which serve a multitude of
purposes. Component use of web widgets can result
in highly useful end products while taking advantage
of the benefits of a modular design.
The prerequisite tools developed as part of this
paper provide a crucial stepping stone towards
further work in this area.
ACKNOWLEDGEMENTS
The first author would like to thank NSERC for
funding this project as well as funding Mr. Adam
McLellan to work on this project under the first
author supervision.
REFERENCES
Hoyer, V., Stanoevska-Slabeva, K., Janner, T., and
Schroth, C. 2008. Enterprise Mashups: Design
Principles towards the Long Tail of User Needs. In
A SEMANTIC APPROACH FOR WEB WIDGET MASHUP
183