Profiles are used by the system to determine
users’ interest. Profiles are formed by one or more
queries. Each query is associated to only one kind of
information (news, currency conversion or weather
forecasting).
In the case of news, user has to determine the
keywords to be used by the system to find relevant
news. This strategy works like Boolean queries in
traditional Web search engines. It is possible to use
operators like AND (when two or more words are
put together in the same query) and NOT (adding the
signal “–” before the word).
The OR operator may be generated by creating
different queries. The AND operator leads the
system to find texts where all the words are present,
and the NOT operator limits the results to only texts
where the associated word is nor present.
It is also possible to use expressions or compost
words. In this case, the expression must be between
quotes (for example, “information filtering”). Other
possibility is to use radicals for finding word
variations (for example, “work*” is used to represent
“work”, “working”, “worker”, etc).
In each query (news, currency conversion or
weather forecasting), the user has to mark the
information sources where the current query should
be performed over. Besides that, the user can state
the week days when the query will be performed and
the initial and final time. This service allows the user
to receive information in only the specified days and
time periods. This is especially useful for the
currency conversion information (for example, to
receive the final value of a day or to not receive the
information in holidays or weekends). Time may be
useful to allow receiving information only once in a
day (for example, the weather forecasting).
Queries work while the user marks them as
active. But it is possible to set temporarily a certain
query as inactive or even change a query
specification.
The information resulting from the selective
search (or filtering) is sent to the count registered by
the user (an e-mail address or a cellular phone
number. Information are sent to cellular phones
using Short Message Services (SMS). Counts must
be informed by the user in his/her profile. It is
allowed to the user to register different counts for
receiving different kinds of information.
The information sources are predefined.
Currently, there are about 15 different sources, each
one dealing with a special kind of news (for
example, general about the World and Brazil, sports,
economy, computers, etc.).
To extract information from the sources, the
system was developed with special templates, one
for each source. These templates are designed to
“understand” the structure and format of each Web
source. The template identifies the region where the
desired information appears (Information Filtering)
and is able to extract exactly this information and
nothing more (Information Extraction).
Templates were generated manually by human
experts analyzing the structure and format of each
Web source. Each template only works extracting
information from the specified source. If the source
changes its format or structure, the template may not
work appropriately.
The system is under test in the Knowledge
Discovery Portal, available at the URL
www.descobertadeconhecimento.com.br.
The history of information sent to the user is
registered, allowing users to review messages or to
visualize the original Web source.
Additionally, user can control the maximum
number of messages to receive per day.
4.1 Predefined sources X New
information sources
Traditionally, clipping or information filtering
systems use predefined sources, that is, the search is
made in sources from an existing list.
Unfortunately, the predefined list never covers
all the desired sources, because some relevant
information may be only present in other sources.
The alternative would be to find new information
sources.
In this way, PERKOWITZ ET AL. (1997)
presents the ShopBot system, that visits Web sites of
CD and Software vendors in order to extract
information about products.
The novelty in this system is that the extraction
rules are automatically discovered by the system
based on analysis of preexisting sites. Using
supervised learning strategies, the system receives
from human experts a list of sites about some subject
and identifies patterns common to all of them (for
example, position, keywords, formats and proper
names used for refereeing a specific information).
After, the system search through the Web
looking for similar sites. When it finds a similar one,
it uses its rules for extracting information.
At the moment, FastNews analyzes only
predefined information sources (Web sites). An
additional functionality is being developed to allow
users to enter an URL that they want to monitor.
Using automatic machine learning techniques
and some help from the user (supervised learning),
the system will be able to automatically extract
information from new sources (Web sites).
The system receives from the user an URL and
automatically identifies the structure of the related
site. After, the system presents this structure for the
FASTNEWS: SELECTIVE CLIPPING OF WEB INFORMATION
453