scrapers to extract both, the full text and the meta
data from the website.
The current prototype consists of two
applications:
A standard Java application fetches the newest
press releases every few minutes by reading
the RSS feeds and web scraping additional
data for any newly retrieved message.
The search engine itself is implemented as a
standard Java web application running on an
Apache Tomcat web server.
All data is currently stored in a MySQL
database, but we will soon migrate to a Lucene
search index to improve retrieval performance.
4 DISCUSSION AND
CONCLUSIONS
Within this paper we introduced a prototype of a
new press release search engine that offers advanced
search possibilities by taking into account the
specific structure and meta data of press releases.
The findings so far indicate that systematic press
release observation can help exposing facts and
developments, which are highly useful within
strategic innovation management. Nevertheless the
examples we provided are still rather simple. To get
more interesting results, more and more carefully
selected data is needed.
One of our findings is that the publishing activity
among the available press release distribution portals
forms a typical long-tail. A way to improve the
search data corpus size thus lies in integrating as
many data sources as possible, i.e., not only the four
or five biggest portals have to be accounted for, but
also the many small providers.
The extraction and processing of meta data must
be enhanced; for instance, company names like IBM
and IBM Corp. are currently treated as two different
organisations. In future releases we intend to apply
named entity recognition algorithms to reliably
identify and unify company names. Additionally, the
usage of a well-maintained toponym reference list
will enhance the geo locating functions. The
classification of press releases currently depends on
explicit markup, which is not offered by all
providers. We are therefore evaluating a stochastic
topic detection approach to automatically classify
press releases based on their textual content.
To further improve the search result quality we
are currently adding a semantic-based query
expansion mechanism. This will increase the recall
of the meta search engine and improve the outcomes
of the trend monitoring and cluster identification
approaches - as both rely heavily on the amount of
relevant data retrieved.
The feedback we received from cooperating
companies so far support our assumption that the
tool is regarded useful by innovation professionals.
But more systematic user studies have to be
arranged to evaluate formally how useful the
implemented features are with regard to different
user types and different business areas.
As part of our further activities we will integrate
the press release search engine as one module
among others into a so-called innovation mining
cockpit. This cockpit will form a single point of
entry for all of the innovation professional’s web
search related activities (see Stathel et al, 2008, for
details).
REFERENCES
Goss, P., and Hagenhoff, S., 2003.
Strategisches Innovationsmanagement: Eine
Bestandsaufnahme. In Schumann, M. (ed.),
Arbeitsbericht Nr. 11/2003 des Instituts für
Wirtschaftsinformatik der Georg-August-Universität
Göttingen, Göttingen.
Heyer, L. J., Kruglyak, S., and Yooseph, S., 1999.
Exploring Expression Data: Identification und
Analysis of Coexpressed Genes. In Genome Res. 1999
9: 1106-1115.
Magnani, M., and Montesi, D., 2007. A study on company
name matching for database integration. Technical
Report UBLCS-07-15. May 2007.
Novanet, 2006. Information in der Internetökonomie. 2nd
newsletter of the NovaNet project, http://www.nova-
net.de/fhg/Images/nova-net_2-Newsletter_tcm231-
60869.pdf. Accessed September 9th, .2008.
Stathel, S., Finzen, J., Riedl, C., and May, N., 2008.
Service Innovation in Business Value Networks.
In Proceedings of the XVIII International RESER
Conference.
Stock, W. G., and Lewandowski, D., 2006.
Suchmaschinen und wie sie genutzt werden. WISU
35(2006)8-9, 1078-1083.
1
The project was funded by means of the German Federal
Ministry of Economy and Technology under the promotional
reference “01MQ07012”. The authors take the responsibility
for the contents.
STRATEGIC INNOVATION MANAGEMENT ON THE BASIS OF SEARCHING AND MINING PRESS RELEASES
353