INNOVATION MINING

Supporting Web Mining in Early Innovation Phases

Jan Finzen and Maximilien Kintz

Fraunhofer IAO, Nobelstreet 12, 70569 Stuttgart, Germany

Keywords: Innovation management, Open innovation, Web searching, Semantic search.

Abstract: Fraunhofer IAO conducted a study among 1,000 German innovation professionals regarding their web-

based information acquisition needs. The study showed the need for search tools and methods optimised for

the target group of innovation professionals. In this paper we deduce accordant concepts from the study’s

results. We suggest an “Innovation Mining Process” as a structured approach to web-based information

acquisition for early innovation phases. Our software prototype - the Innovation Mining Cockpit (IMC) -

picks up essential concepts of this process and implements them as an easy-to-use web portal. The IMC is

intended as a central point of contact for innovation-related search activities.

1 WEB SEARCHING IN

INNOVATION MANAGEMENT

Innovative ideas can be the result of both formal and

unstructured search processes and can have many

different origins. The Internet provides access to

external innovation sources in multiple and

comfortable ways. However, the analysis process

remains a complicated one: it is often unclear where

the relevant information is located. Furthermore,

classical search engines do not offer a sufficient

precision to filter the results and separate the

relevant ones from the large amount of irrelevant

ones.

In a 2009 survey among innovation

professionals, Fraunhofer IAO analyzed the Web

searching requirements of innovation professionals

in Germany (Finzen, Krepp and Heubach, 2009).

Figure 1 summarizes the findings regarding the

importance of different Web-based information

sources regarding different steps of the early

innovation phases: “innovation push”, “idea

collection”, “idea creation”, and “idea evaluation”.

While online journals and research and technology

portals are used for finding innovation impulses and

collecting ideas, encyclopedias and especially patent

databases are most useful for idea evaluation.

Internet-based information sources are considered

most useful for collecting ideas, but less useful for

the actual creation of ideas.

Figure 1: Importance of Internet-based information

sources for different (early) innovation phases (n=142).

We questioned the respondents about the most

annoying problems they encounter with search

engines. More interesting than the obvious aspects

quality and time-efficiency however seem the

remaining ones: almost half of the respondents

claimed that they miss a ranking according to up-to-

dateness and almost 40 percent are not satisfied with

the available filter mechanisms.

241

Finzen J. and Kintz M..

INNOVATION MINING - Supporting Web Mining in Early Innovation Phases.

DOI: 10.5220/0003276202410247

In Proceedings of the 7th International Conference on Web Information Systems and Technologies (WEBIST-2011), pages 241-247

ISBN: 978-989-8425-51-5

 2011 SCITEPRESS (Science and Technology Publications, Lda.)

Figure 2: Importance of search engine features (n=142).

The results clearly show that there is a demand

for a new generation of search tools tailored to the

needs of innovation professionals.

2 INNOVATION MINING

CONCEPTS

From both the results of our survey described above

as well as from discussions with industrial customers

we deduce several requirements that innovation

mining tools must take into consideration. These

requirements are discussed in the following sections.

2.1 Incorporate User Knowledge

Innovation professionals are considered to have a

deep domain competence and bring along a broad

knowledge of where and how to find information

meeting their requirements and needs (“source

competence”). It seems promising to make use of

both the domain and the source competence as much

as possible within a professional information

acquisition tool, e.g. by

 allowing the users to integrate their favourite

information source with little effort, and

 integrating domain knowledge into the search

process. This can be done, for example, by

hierarchically or ontologically structuring domain

knowledge and providing both search queries and

result sets for respective nodes.

2.2 Deal with Multitude

of Relevant Information Sources

As shown in Figure 1 there are many kinds of

information sources that are relevant to innovation

management. Depending on the actual phase of the

innovation process, some of these information

sources are more important than others. According

to our survey, the majority of respondents liked the

idea of having multiple information sources

integrated and accessible via a unified point of

access. We divide the information sources available

on the web into three different levels:

1. The Document Level: On the most basic level,

single documents are accessed directly. E.g., the

Fraunhofer website can be accessed by the user

assuming to find the site at www.fraunhofer.de.

2. The Database or Search Engine Level: For most

types of documents, specialised search tools already

exist that offer advanced functionality to retrieve and

present information. E.g., there are numerous search

engines for scientific content, and web-based patent

databases make it easy to search for patents on a

given subject. To integrate such sources, a tool has

to “speak” and “understand” the same “language”,

i.e., know how to formulate and send a search query

to the search engine and how to grab and interpret

the results. This is commonly known as “meta

search”.

3. The Meta Search Level: For some kind of

information, meta search engines already exist.

Accessing meta search engines requires basically the

same requirements and process steps as for “normal”

search engines, but is has to be kept in mind that one

relies on the “language translation step” being

executed by the meta search engine and thus not

being under one’s own control.

One challenge of providing an integrated

information acquisition tool for innovation mining

thus lies in the handling of “different languages” of

different information sources.

2.3 Use Document-specific Metadata

There are many different document types relevant

for innovation mining process, like e.g. patent data,

scientific papers, press releases, or blog entries.

Each of these document types has specific attributes

and metadata that can be exploited within innovation

mining: Press releases always have a publishing date

and address information which make it easy to order

and visualize them accordingly (Finzen, Kintz, Koch

and Kett, 2009). Scientific literature can be a basis

for expert identifiaction using co-authorship

analyses. Patent data can be exploited for e.g. white

spot analysis (Siwcyk, 2009).

WEBIST 2011 - 7th International Conference on Web Information Systems and Technologies

242

2.4 Improve Ranking Mechanisms

One very clear result from our survey was the

immense importance of ranking algorithms: People

are quite unhappy with today’s search engine’s

result ranking methods being only partly transparent

and only poorly adaptable to the user. For innovation

professionals, especially a ranking according to up-

to-dateness and regional aspects showed up to be

important. An information acquisition tool for

innovation professionals thus should provide

adequate possibilities. The determination of up-to-

dateness, of course, is a challenging task and heavily

depends on the document being analysed: Certain

documents, like blog entries, patents, or press

releases usually contain appropriate metadata. If

explicitly marked, e.g. using an appropriate HTML

or XML-tag, identifying them is trivial. Sometimes,

however, they have to be extracted from the text –

which is rather straight-forward by using regular

expressions. In case of arbitrary web documents, the

task is much more difficult: though timestamps may

be found in the document it is not always obvious if

it denotes the point in time the information in the

text refers to. If no timestamp is given at all, the

freshness might be estimated by comparing the

content with another version of the content that has

been indexed the last time the page was visited. Web

monitoring tools do exactly that: they visit a web

page in certain intervals and detect certain changes.

By integrating carefully selected web monitoring

patterns, a search engine might consider such

documents that have been recently changed in the

ranking method.

2.5 Support Long-term

Information Needs

Search queries are often classified as being either

1. navigational (searching one or more specific

document(s)),

2. informational (seeking information for a given

topic), or

3. transactional (perform a particular action),

depending on the user’s intention (Broder, 2002,

Manning, Raghavan and Schütze 2007).

Almost 50 percent of all queries that general

purpose search engine users utter belong to the first

class (Broder, 2002). Consequently, current general

purpose search engines mainly address navigational

information needs. Nevertheless, the information

needs of professional end-users (like market

researchers, innovation professionals, etc.) often fall

into the second or third categories. Springer (2006)

stated that „the further development of tools to

enable the detection of trends and the finding of

information in the Internet can account for

improving the innovation performance of

companies”.

Recurring searches are widely used within

professional information management (cf. Finzen et

al., 2009). The possibility to save and automatically

repeat complex queries thus should be combined

with effective ways to notify the user on newly

found results. According to our study’s results,

techniques like RSS-feeds are still considered less

important than more traditional communication

channels like e-mail. Nevertheless we expect that the

acceptance of such techniques will grow in the

future, as they offer good means of integrating

search results with further applications (e.g.

knowledge management systems).

2.6 Offer Advanced Interaction

and Visualization Concepts

Common use cases in innovation management

include patent mining, competitor observation, and

trend monitoring. To support these use cases

information must be either extracted from one or

more documents, or interpolated given an amount of

documents. Such information needs do not only

require special result presentation techniques, but

also affect the query frontend: the users have to

clarify that they do not want to be confronted with a

list of documents but rather with, e.g., information

extracted from documents (“new products of

competitor X”) or statistic information (“pie chart

comparing the positive and negative utterances of

forum users who wrote about company X in the last

month”, “bar chart showing the trend for a recent

search topic for selected companies”). Search tools

for professional end-users thus require suitable

navigation concepts and powerful user interfaces for

both, search query formulation and result

interpretation.

2.7 Foster Integration

and Collaboration

Web-based information gathering is a very

individual task even in professional information

work. However, with the size of an organisation the

need to exchange search artifacts rises. It is quite

common for larger companies to outsource search

tasks to a special department (Finzen et al., 2009).

With people searching for information together, the

INNOVATION MINING - Supporting Web Mining in Early Innovation Phases

243

need for methods and tools to foster collaboration

arises. Ways to achive this include: exchange of

bookmarks, persistable search spaces, sharing of

search results and analysis reports.

3 INNOVATION MINING

PROCESS

Building on the “tech mining” process described by

Porter and Cunningham (2004) we suggest a five-

step Web mining approach depicted in Figure 3.

1. Identify Information Need: Although the

outcomes of any mining process might be surprising

(after all, the idea of data mining is the detection of

previously unknown facts), it should be as directed

as possible. This means that the overall strategic

goal of the mining process should be defined as the

very first step, as it influences subsequent steps like

source selection or visualization parameterization.

2. Collect Information: Depending on the

information need, a variety of information sources is

selected for mining. If the information need

embraces temporal developments, e.g., trend

analysis of a given topic, the document corpus must

be created over a larger timeframe. However, for the

case of innovation mining, it is usually important to

gain information as soon as possible, i.e. ideally in

real-time – as soon as they show up on the web.

Therefore, either sophisticated crawl-and-scrape

approaches or (even better) a push supply of

information is needed.

3. Process Results: Once the source corpus is

defined (and most probably being frequently

expanded), the result processing starts. Though the

approaches applied might vary very much, the main

task of this step is a matching of a specified

information need (e.g., a search query) against the

documents of the corpus. In navigational search the

tasks usually ends with weighting the respective

document’s relevance in relation to the information

need. This allows a suitable ranking of documents in

a subsequent step. Information needs that are of a

more informational kind typically involve additional

information processing steps: information or

metadata extraction forms a basis for appropriate

analyses in the subsequent step.

4. Analyze and Interpret: The information collected

in the processing step are analyzed and condensed,

and finally put into graphs that suit the information

need. This step aims at supporting the analysis of the

data by the user as good as possible. It includes

choosing the best-fitting visualization, the tailoring

of the visualization regarding the results to display

as well as possibly providing additional information

that helps interpreting the data in the right way.

5. Disseminate and Act: Once interesting results are

found, subsequent processes can be triggered:

further results showing up in the future might be

automatically taken into account and thus resulting

in new versions of the result analysis. When new

findings are available, the user might want to be

notified by an appropriate alerting mechanism.

Results might be saved and reloaded, printed,

exported into complementary software tools, and

passed on or shared to other users

4 INNOVATION MINING

COCKPIT

To evaluate our deductions regarding search tool

requirements of innovation professionals, we

implemented a search engine prototype. The

Innovation Mining Cockpit (IMC) picks up essential

concepts of this process and implements them as an

easy-to-use web portal.

Identify

Information

need(s)

Coll e ct

information

Process

results

Analyze

and

interpret

Disseminate

and act

 Internal data

 Co m p et i t o r s‘

websites

 Pat en t

databases

 Press releases

 Scientific

content

 Bl o g s, f or u m s…

 …

 Text and data

mining

 Se m a n t i c

annotation

 Statistical

approaches

 …

 Da sh b o a r d s/

cock p it s

 Trend s and

event s

 Rep o r t i n g

 …

 Monitoring

 Automatic

notification

(e-mail, SM S,

RSS)

 Collaboration &

integration

 …

 Co m p et i t o r s

 Technologies

 Pr o d u ct s

 Tenders

 Eve n t s

 Cam p a i g n s

 User ideas

 …

Figure 3: Innovation Mining Process.

WEBIST 2011 - 7th International Conference on Web Information Systems and Technologies

244

We will present the IMC’s main features

regarding the requrements discussed above in the

following section.

4.1 Source Identification

The source identification module implements a meta

search engine approach: The keywords entered in

the text field are passed to various general purpose

search engines. The results of the different search

engines are combined, ranked, and displayed in a

typical search results list. Either the complete URL

or the Web domain can be added to the search space

(the websites initially show up in a special folder

“New sites” within the Search Space Configuration

module). The Source Identification module aims

mainly at step 2 (“collect information”) of the

Innovation Mining Process and is particularly

important to quickly add new information sources to

the mining process. In combination with the Search

space configuration module described in the next

section, the source identification module illustrates

one basic design paradigm of the IMC. The search

and mining process is restricted to such sources that

the user considers potentially relevant. This

approach, of course, not only lowers the number of

irrelevant results in the search process but also

restricts the resource requirements (regarding

computation and storage hardware) significantly as

opposed to a broad crawling approach. On the other

hand, it makes finding and selecting the significant

sources a most important preprocessing step.

4.2 Search Space Configuration

The Search Space Configuration module supports

step 1 (“identify information need”) and 2 (“collect

information”) of the Innovation Mining Process. It

combines a sophisticated bookmarking system with

several search engine specific adjustments, like

crawl depth, support of named entity recognition, or

automatic recognition of RSS feeds. Figure 4 shows

a screenshot of the Search Space Configuration

module.

It allows distinguishing between “normal”

Websites and feeds that offer an information push

supply. The main difference lies in the information

supply paradigm. As RSS-based information offer

several advantages compared to a classical website,

we built the IMC’s whole data aggregation

mechanism on the feed principle. Based on some

heuristics (like text block length), significant

changes and new content are detected and provided

to the IMC in an RSS-like format. Thus, the IMC

internally builds its own feeds for any website that

may not offer one explicitly. This way, the user can

easily be notified on any (relevant) newly found

information on any site either by RSS, E-Mail or

SMS.

Figure 4: Search space configuration.

4.3 Feed Search

The feed search portlet offers advanced functionality

to search and analyze feeds:

 Any configured search can be turned into an RSS

or Atom feed using the respective buttons at the

bottom of the dialog.

 The timeframe can be limited using the calendar

pickers at the top of the dialog.

 A feed search configuration can be saved and

loaded. This is especially important as

configurations can become quite complex –

consisting of source restriction, filter settings and the

actual search query. This supports the long-term

information needs of innovation professionals.

Figure 5: Feed search frontend.

INNOVATION MINING - Supporting Web Mining in Early Innovation Phases

245

Figure 5 depicts the feed search module showing

the results for a “wind energy” query. The result list

is accompanied by trend chart and tag cloud

visualizations are provided to (i) provide additional

information about “hot topics” and (ii) further

navigate through the search result set.

4.4 Change Monitoring

and Notifications

Information needs in innovation management are

often rather informational than navigational and

rather long-term than ad-hoc (Finzen et al., 2009).

The Feed Search portlet therefore provides means to

automatically execute the query in the background

and notify the user on newly found results either by

e-mail or using a feed reader. The e-mail notification

report can be scheduled for any search either on a

regular basis (e.g., once per day or week) or as soon

as new results have been found (which of course

depends on the interval the website’s content is

being compared by the IMC’s crawler, or the feed

polling interval configured in the Search Space

Configuration module. E-mail adresses are

configured within the portal server’s user

management system.

4.5 Semantic Annotations

Depending on the document type and the

information source, a document may embrace

metadataor even semantic markups which can be

utilised to offer advanced result visualisations and

browsing functionality, e.g. facetted search.

Unfortunately the amount of semantically annotated

web content today is still very limited.

The IMC integrates the OpenCalais web service

(http://www.opencalais.com) on demand during the

search space configuration process. As OpenCalais

is currently not available in German, we also

integrated the AlchemyAPI web service

(http://www.alchemyapi.com) by orchest8, which

offers similar functionality.

Figure 6 shows results in the Feed Search portlet

that have been annotated via the OpenCalais web

service. Countries and organisations are provided in

the left meta data columns. Clicking on a meta data

link restricts the current search accordingly, e.g.

when clicking on “U.S. Department of Energy”,

only results that include a reference to this

organisation will be displayed.

Figure 6: Semantic annotations.

4.6 Visualization

Special emphasis has been put on search result

visualization.

Frequency Analysis: trend monitoring and event

detection are important use cases within the area of

technology and innovation management. The

innovation mining cockpit uses tag clouds, classical

bar and line diagrams for any possible search query.

Geographical Analysis: oftentimes geographical

information can be extracted from texts fairly easily

(e.g., country names can be easily maintained in

look-up lists). For geographical information, the

longitude and latitude can easily be assigned using

respective geo-coding web services. This allows the

visualization of objects on maps.

Association Analysis: association graphs are used to

show and analyze relations between objects, e.g.

companies or persons. Generally, the nodes

represent objects and the edges the relations between

these objects. Various layout algorithms help to

analyze structural information within the graph. For

example, the degree of cross linking may indicate a

company’s importance (or at least its activeness)

within a technological area.

5 CONCLUSIONS

AND OUTLOOK

The current version of the IMC has been presented

to several industrial partners and obtained very

encouraging reactions. We are currently running

through a long-term evaluation with an industrial

partner of the automotive sector to evaluate how the

different modules are accepted in an innovation

professional’s daily work. Even though final

evaluation results are not available yet, we already

WEBIST 2011 - 7th International Conference on Web Information Systems and Technologies

246

achieved valuable feedback that will be addressed

during short-term development of the tool:

 The current implementation forces users to index

a whole website (using the main URL combined

with a high crawl depth). This leads to a large

amount of irrelevant content, as the main intention

of the IMC is to mine only such content that is very

fresh. In the next version of the IMC, a wizard will

help users to identify the relevant parts of a website

by generating a sitemap of a website and let the user

select branches of this maps, e.g., company news,

press releases, discussion forums, etc.

 Our association analysis is currently solely based

on co-occurrence analysis. This proves useful in

selected use cases such as “who knows whom” or

“who works with whom” analysis but demands for a

very well selected source corpus, as results can

easily be polluted by a misplaced source containing

lots of named entities, like e.g., a stock report

mentioning 100 companies. Although using

thresholds regarding the occurrence number proves

very helpful to get rid of such problems, combining

the co-occurrence approach with more sophisticated

approaches based on linguistic and semantic analysis

seems promising.

Table 1 summarizes further needs and plans for

future extensions.

Table 1: Future work.

Requirement Future work

User

knowledge

Semantic domain models to utilize

user domain knowledge

Integration Specific support for scientific

literature, patent data, forums & blogs,

and open innovation portals

semantic

annotation

Use uniform annotation framework

based on Apache UIMA

(

http://uima.apache.org)

Ranking Improve configurability of relevance

parameters

Collaboration Reports, private and public spaces

More information on Innovation Mining can be

found on http://www.innovation-mining.net.

ACKNOWLEDGEMENTS

The project was funded by means of the German

Federal Ministry of Economy and Technology under

the promotional reference “01MQ07017”. The

authors take the responsibility for the contents.

REFERENCES

Broder, A. (2002). A taxonomy of web search. SIGIR

Forum 36, 2 (2002): 3-10.

Finzen, J., Kintz, M., Koch, S., Kett, H. (2009). Strategic

Innovation Management on the Basis of Searching and

Mining Press Releases. Proceedings of the 5th

International Conference on Web Information Systems

and Technologies (WEBIST). Lisbon.

Finzen, J, Krepp T, Heubach D. (2009). Web Searching in

Early Innovation Phases: a Survey among German

Companies. Proceedings of the 2nd ISPIM Innovation

Symposium. New York.

Manning, C. D., Raghavan, P., and Schütze, H. (2007).

Introduction to Information Retrieval, Cambridge

University Press, Cambridge, England.

Porter, A. L., and Cunningham, S. W. (2004). Tech

Mining: Exploiting New Technologies for

Competitive Advantage. Wiley Series in Systems

Engineering and Management. John Wiley & Sons.

Springer, S. (2006). Nutzung von Internet und Intranet für

die Entwicklung neuer Produkte und

Dienstleistungen“ nova-net Werkstattreihe, Stuttgart:

Fraunhofer Verlag.

Siwczyk, Y. (2010). IT-gestützte White-Spot-Analyse:

Potenziale von Patentinformationen am Beispiel

Elektromobilität erkennen. Stuttgart: Fraunhofer

Verlag.

INNOVATION MINING - Supporting Web Mining in Early Innovation Phases

247