a strategic initiative of the Polish Ministry of Education and Scientific Research aim-
ing to create regional ICT infrastructure supporting storing, processing and sharing of
scientific research data and results.
This paper is structured as follows. Chapter 2 describes challenges targeted by the
project and existing research results which can be used to solve them. In Chapter 3
requirements for the PASSIM project are listed. In Chapter 4 solutions for system im-
plementation are proposed. Finally, Chapter 5 summarizes the results.
2 Challenges and Existing Solutions
As mentioned in the introduction, one of the main challenges in PASSIM is automated
acquisition of knowledge from various structured and unstructured sources. Among
these sources the Internet will play a major role. Despite the overwhelming amount
of irrelevant and low quality data, there are several useful resources. This includes
researchers’ homepages and blogs, homepages of research and open source projects,
emerging open access journals, university tutorials, software and hardware documenta-
tion, conference and workshop information etc. Finding, evaluating and harvesting such
information is a complex task but nevertheless it has to be taken up in order to provide
PASSIM users with a wide range of up to date resources regarding science as well as
past and ongoing research activities.
Several approaches to harvesting information form the Internet have been proposed
in the past. The most popular approach nowadays is the use of search engines. The
improvement in search quality caused that a vast majority of users say the Internet is
a good place to go for getting everyday information [6]. Sites like Google.com, Ya-
hoo.com, Ask.com provide tools for ad-hoc queries based on the keywords and page
rankings. This approach, while very helpful on the day to day basis, is not sufficient
to search for large amounts of specialized information. General purpose search engines
harvest any type of information regardless of their relevance, which reduces efficiency
and quality of the process. Another, even more important drawback for scientists is that
they constitute only a tiny fraction of the population generating web traffic and really
valuable pages constitute only a fraction of the entire web. Page ranks built by general
purpose solutions, suited for general public will not satisfy quality demands of a scien-
tist. One can use Google Scholar, Citeseer or other sites to get more science-oriented
search solutions. Although this may work for scientific papers and some other types of
resources, still countless potentially valuable resources remain difficult to discover.
Another approach to the problem is web harvesting, based on creating crawlers,
which search the Internet for pages related to a predefined subject. This part of infor-
mation retrieval is done for us if we use search engines. However, if we want to have
some influence on the process and impose some constraints on the document selection
or the depth of the search, we have to perform the process by ourselves. A special case
of web harvesting is focused crawling. This method introduced by Chakrabarti et al. [3]
uses some labeled examples of relevant documents, which serve as a starting point in
the search for new resources.
The task of retrieving scientific information form the web has already been ap-
proached. In [8] it is proposed to use meta-search enhanced focused crawling, which
83