Collaboration Spotting Cite: An Exploration System for the

Bibliographic Information of Publications and Patents

Andr

e Rattinger

1,2

, Jean-Marie Le Goff

and Christian Guetl

IPT-DI, CERN, Espl. des Particules, Meyrin, Switzerland

Institute of Interactive Systems and Data Science, Graz University of Technology, Graz, Austria

Keywords:

Bibliometrics, Information Retrieval, Relevance Feedback, Visual Analytics, Patent Analysis.

Abstract:

Collaboration Spotting is a knowledge discovery web platform that visualizes linked data as graphs. This

platform enables users to perform operations to manipulate the graph to see and explore different facets of

complex networks with multiple node and edge types. It combines information retrieval and graph analysis

to effectively explore arbitrary data-sets. The platform is designed in a way that non-expert users without

data science knowledge can explore it. For this, the data has to be speciﬁcally crafted in a form of a schema.

The paper explores the platform in a bibliometrics context and demonstrates its search and relevance feedback

mechanisms which can be applied through the navigation of an underlying knowledge graph based on publi-

cation and patent metadata. This demonstrates a novel way to interactively explore linked datasets through the

combination of visual analytics for graphs with the combination of relevance feedback.

1 INTRODUCTION

Deﬁning and solving problems often starts with the

exploration of data. Exploring publication and patent

metadata and the textual content is a complex and

time consuming task especially when a person is new

to particular domain. Without a clear view on what

is available or having some particular knowledge of

a domain it is difﬁcult to know which way a prob-

lem can be solved or if it is even solvable with what

is at hand. As a newcomer to a new scientiﬁc ﬁeld,

it takes a great amount of time because one can eas-

ily get overwhelmed by the massive amount of pub-

lications and information that is available. Finding

the most relevant authors, papers, companies, univer-

sities or topics of a ﬁeld can be a challenge that takes

up a lot of time. If someone wants to create a new

invention for example, searching the patents for ex-

isting work or similar work is a very time-consuming

task that takes expert knowledge in the choice of key-

words and categories. Collaboration Spotting is a tool

that can help with this data exploration problem. It is

designed so that it can work with any kind of data,

but preferably it should be heavily linked data or even

a knowledge graph associated with textual content.

Collaboration Spotting Cite is a speciﬁc version of

the Collaboration Spotting platform developed at IN-

STITUTION to explore bibliometric data from pub-

lications and patents. It enables user to view differ-

ent facets of their connected data and manipulate cer-

tain aspects of it, such as the selection of subsets or

the viewpoint on the data. The version combines this

graphical navigation with information retrieval pro-

cedures. The ﬁrst step as later shown in this paper is

retrieving a list of indexed documents and automati-

cally transforming them to a graph based on a schema

blueprint. Afterwards the user navigation in the graph

takes place where the user has different option to ma-

nipulate the graph so that the system shows a subset

which is closer to his information need. In addition

a new search can than be performed using the users

input from the navigation as relevance feedback for

a renewed retrieval. The remainder of this paper is

organized as follows: The next section outlines some

related work on science mapping and information re-

trieval with relevance feedback. Section 3 contains

a description of the Collaboration Spotting platform

with its navigation and retrieval mechanisms. Section

4 demonstrates how the platform can be used with ci-

tation data and Section 5 concludes the paper and out-

lines future work.

548

Rattinger, A., Goff, J. and Guetl, C.

Collaboration Spotting Cite: An Exploration System for the Bibliographic Information of Publications and Patents.

DOI: 10.5220/0008366105480554

In Proceedings of the 11th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K 2019), pages 548-554

ISBN: 978-989-758-382-7

2 RELATED WORK

Mapping science and scientiﬁc processes through ci-

tation data has been explored by (Small, 1999), where

multiple approaches are reviewed and the data is ar-

ranged in different ways. In addition, there are multi-

ple visualization applications for general graph based

data (Bastian et al., 2009) as well as co-citation net-

works or bibliometric networks. The available view-

ers provide different tools to view and visualize the

graph based on its properties as well as performing

different manipulations on the graph data. A sim-

ilar tool as the work presented here is VOSviewer.

VOSviewer (van Eck and Waltman, 2009) is a tool for

the visualization of bibliometric data and combines

this with natural language processing to also create

term co-occurrence networks from textual informa-

tion. Collaboration Spotting is a tool which also can

be applied to generic data, and offers information re-

trieval methods and novel ways to navigate through

the data which the normal bibliometric visualization

tools do not provide. CiteSpace (Chen, 2006) is an-

other tool that helps to explore and visualize the sci-

entiﬁc knowledge domains. Key differences are in

how the retrieval aspects of the navigation are han-

dled. Similar to VOSviewer, CiteSpace does not offer

information retrieval functionality, which is included

in collaboration spotting. The correct information in

the visualization platforms have to be provided be-

forehand from external datasets. In the case of CiteS-

pace they can also be directly downloaded from the

Web of Science search interface. The procedure in

Collaboration Spotting has the advantage of users be-

ing able to rapidly performing multiple searches and

even being able to combine them to create a suitable

result graph for their data exploration. In comparison

to other systems, the data can come directly from the

indexed documents, but a manual blueprint of the data

mapping has to be created. Parts of the retrieval pro-

cess relies on methods that can be described as rele-

vance feedback through graph navigation. Relevance

feedback as a way to reﬁne the information retrieval

process has been well deﬁned and explored in liter-

ature (Rocchio, 1971) (Salton and Buckley, 1990),

and there are a lot of approaches who use the fully

automated pseudo-relevance feedback method to re-

ﬁne queries to good success (Cao et al., 2008). In

addition, there are even methods of utilising pseudo

relevance feedback for citation recommendation (Liu

et al., 2014), but the authors do not know of any meth-

ods that directly use graph exploration and navigation

as a mechanism for the application of relevance feed-

back.

Figure 1: A principle representation of a schema as it is

used for the transformation of the data and the navigation

in the graph. The publication builds the central point for

navigation between available metadata. Search (START)

represents the connection to the search keywords or seed

document the graph is based on, publication references the

actual document.

3 COLLABORATION SPOTTING

CITE

Collaboration Spotting is a visualisation and naviga-

tion platform for exploring and manipulating large

and complex data-sets (Agocs et al., 2017). It com-

bines aspects of information retrieval and visual ana-

lytics to let users explore their data without having a

background in data science or other related ﬁelds. A

typical search and navigation process in the Cite ver-

sion of the web application is performed in multiple

steps: Retrieval of the relevant documents and con-

struction of the graph, Navigation and exploration of

the data and ﬁnally reﬁning the search through rele-

vance feedback. The following sections explain each

of the stages in more detail.

3.1 Information Retrieval and Graphs

The system operates in the following way: First, the

user performs a full text search on the indexed docu-

ments and the retrieval process returns a list of items

and their relevance. The Collaboration Spotting plat-

form is not limited to text documents, but the search

procedures have been optimized for this application.

Parts of the retrieval process are described in more de-

tail in (Rattinger et al., 2018a). The retrieval process

takes either full documents or keywords deﬁned by

the user to perform the initial search, as for search of

patents and publications source documents are mostly

available. In this case, keywords are extracted from

the different sections and weighted by tf-idf (Ramos

et al., 2003). The list of result documents is then

transformed into a graph according to a predeﬁned

Collaboration Spotting Cite: An Exploration System for the Bibliographic Information of Publications and Patents

549

Figure 2: A sample graph based on the schema. The publications have multiple elements that connect them which enables the

navigation in the system. As classiﬁcations do not have any connection to more than a single node, they would also not have

any connections in the facets view.

schema. The schema acts as a blueprint for trans-

formation and for later navigation in the graph and

provides the knowledge to the system on how data

has to be transformed to ﬁt into the graph structure.

For publications and patents, a star-like schema is

the simplest schema for transformation and naviga-

tion with the text document forming the central ele-

ment in the system. Other more extensive schemas are

possible for the navigation as well, but overly com-

plex schemas might be difﬁcult for a user to construct

or make it difﬁcult to interpret the data. This might

be alleviated by speciﬁc domain knowledge. A ba-

sic star schema can be seen in Fig. 1. In this simple

example, there is only a limited amount of metadata

in the graph. The document is in addition attached to

a search nodes, which allows for the combination of

different search graphs, which is an additional search

reﬁnement or expansion mechanism. The search node

provides the initial starting point for further naviga-

tion. One of the facets of this graph is then visualized

starting from this search node. The search node for a

single search has only a single instance and is named

after the keywords or the seed document. Fig. 2

shows a sample graph based on the previous schema.

Each of the nodes in the schema other than the start-

ing search node will have multiple instances with con-

nections between them. A node of a certain type will

never be connected to another node of the same type

directly. This is an important principle on how the

navigation takes place later on. The procedure of vi-

sualizing and performing graph navigation operations

is explained in the next section.

Figure 3: Extract of only the facets publications and authors

from the sample graph. this is the basis for the ﬁnal repre-

sentation in the application. In this example the publication

nodes are used as a reference node to visualize relationships

between authors.

Figure 4: The remaining graph as seen by the user in the col-

laboration spotting application. Only nodes from the single

facet ”author” are represented in the ﬁnal graph.

3.2 Graph Navigation

An important aspect of Collaboration Spotting is how

a graph that has been created by the search can be ex-

plored. The principle is always the following: The

user chooses a single facet of the data. A facet of the

data corresponds in this case to one of the nodes in

the schema shown in Fig. 1 (author, citation, classi-

ﬁcation, publication). In the next step only the rele-

vant facets for the navigation are selected as can be

seen in Fig. 3. In this example the user selects the

facet ”author” which should be visualized from the

perspective of the publications. We call the publica-

tion nodes in this case the ”reference” nodes, as they

are used as a basis for the resulting graph. As long as

a connection exists, any direction can be visualized.

KDIR 2019 - 11th International Conference on Knowledge Discovery and Information Retrieval

550

Figure 5: Screenshot of the facet keywords in collaboration spotting. This represents a different view of the same knowledge

graph after a facet switch.

The resulting graph from this user interaction, which

is called navigation in collaboration spotting can be

seen in Fig. 4. The same process can be repeated

with the citation facet. It is notable that the system

does not allow direct connections between the same

type of facet. This means that the citation facet is rep-

resented as a separate type from the publication facet.

In addition, before visualizing the results, the louvain

community detection algorithm (Blondel et al., 2008)

is applied. Fig. 6 shows an example of an author

network for the search of the keyword ”superconduc-

tor”. The user can then either change his view to an-

other facet of the graph or select a subset of the graph

to further explore. Subsets can be either the detected

communities, connected components or one or more

nodes separately selected by the user. Other selec-

tions based on graph metrics would also be possible,

but are not in the system at the moment. This can be

done by coloring based on separate metadata proper-

ties and then selecting on the basis of them. Based on

the selection that the user has made, a new graph is

now created in the same fashion explained in the be-

ginning of this section, but only based on the selected

reference nodes. Changing from the author view with

a subset of central authors, one can arrive at another

facet through navigation, as seen in the collaboration

spotting screenshot seen in Fig. 5. In this way, the

graph selection can be applied to publication, patent

or other highly connected textual data. Another possi-

bility is to combine multiple keyword searches to see

how they overlap or what they have in common. As

the resulting networks are always connected to an en-

try facet, called ”Search”, it is possible to select mul-

tiple ones at the same time, to combine the search re-

sults. Additional descriptions about technical aspects

and mathematical descriptions of the navigation oper-

ations can be found at (Agocs et al., 2017).

3.3 Relevance Feedback

The user can arrive at a entirely new version of the

graph, through the navigation process. This happens

either by ﬁltering by the selection of subsets or by

the combination of multiple searches. With this, the

user might have arrived at a better expression of their

search interest than just their initial keywords or doc-

ument as they can only provide limited information

(no synonyms, user might not know the ﬁeld well ini-

tially). The system can apply the following relevance

feedback mechanism to the new reﬁned subset of the

graph:

As the central node in the blueprint is always con-

Collaboration Spotting Cite: An Exploration System for the Bibliographic Information of Publications and Patents

551

nected to a text document, we ﬁnd a vector represen-

tation for each of the documents. For this purpose

document embeddings based on the doc2vec models

(Mikolov et al., 2013), (Le and Mikolov, 2014) were

trained. The training of those document embeddings

for patents and the application to information retrieval

with bibliographic information is described in (Rat-

tinger et al., 2018b), (Rattinger et al., 2018a). The

process for publications is the same as the one for

patents. A separate doc2vec model was trained for

each of the document types. Next, every document

from the graph the user selected is assigned a vector

by the model. A clustering algorithm (Hartigan and

Wong, 1979) is then applied to ﬁnd clusters of top-

ics. We select the N closest documents for the new

graph, where N is a hyper-parameter deﬁned by the

user. This hyper-parameter will be set automatically

in the future. The newly performed search creates a

new search node to attach the latest retrieved search

results, so that the user can continue his reﬁnement

process as it is possible to repeat the same process as

many times as the user wants.

4 USE-CASE

This section presents a typical use-case in bibliomet-

ric search with collaboration spotting. A subset of

articles is selected based on the user keywords. This

presents the current system and the data which it is

utilized with and shows some of the capabilities in

graph exploration.

4.1 Data

Collaboration Spotting can run on any data-set that

contains highly connected data. Two different types

of data are used in the current version, publications

and patents. The metadata records and textual in-

formation of publications come from the Web of

Science

Core Collection (Analytics, 2017). Patent

texts and data come from the PATSTAT database de-

veloped by the European Patent Ofﬁce (EPO) (Of-

ﬁce, 2017) and full text documents provided by the

United States Patent Ofﬁce (USPTO)

. The subset

that is chosen for the current system is made up of

all patent documents between 2004 and 2016. This

still provides the system with an enormous amount of

data to work with as it consists out of 2,843,182 doc-

uments for the patents alone.

https://bulkdata.uspto.gov/

Figure 6: Sample author network for the search keyword

”superconductor”. All the examples of connected compo-

nents, communities, single nodes or combinations of all

three can be used for further navigation and selection.

4.2 Application

A user searches for a very general abbreviation ”tsv”,

which results in documents from different domains,

notably one domain related to physics called “through

silicon via”, a chip interconnection technique and an-

other topic related to the medical domain called “taura

syndrome virus”. Fig. 7 shows the citation network

resulting from the search. As mentioned before there

are multiple ways to color the nodes, in the example

they were colored by the automatically detected com-

munities. Notable is the big pink community in the

upper left corner. The user can now select a commu-

nity with right-click and switch to another network

view such as the keyword network shown in Fig. 5

or use this new selection to start a new reﬁned search

from the documents referenced by the selection. In

this case all of the communities other than the one that

references “taura syndrome virus” could be chosen.

Fig. 8 shows the keywords of the new selection after

the facet switch has been undertaken from citations

giving an overview which keywords are important for

this particular community of citations. The general

most relevant keywords weighted by tf-idf after this

process also change so they do not include taura, syn-

drome or virus anymore. With this, another search

only based on the most important keywords could be

performed as well.

KDIR 2019 - 11th International Conference on Knowledge Discovery and Information Retrieval

552

Figure 7: Selection of a community from the search. A

search displays this citation network. In the upper right cor-

ner the user selected a community which should be used for

further navigation or to create a new search based on his

relevance feedback.

Figure 8: Network of the new selection made by the user

included newly calculated communities.

5 CONCLUSION AND FUTURE

WORK

This paper demonstrates the information retrieval and

navigation mechanism of the Collaboration Spotting

web platform. The platform enables its users to ef-

fectively navigate complex data-sets and make use of

the navigation capabilities to reﬁne the search process

and create more relevant search results. This is shown

with a qualitative example of a sample research prob-

lem where a subset based on a community in the cita-

tion graph is chosen to show a more pertinent version

of another facet of the graph, the keyword graph. This

keyword graph can then be used for another retrieval

run based on document embeddings. This reﬁnement

process can be repeated multiple times by the user

to create a better knowledge graph representing the

search interest.

The search functionality in Collaboration Spotting

Cite is still a work in progress, and need to be eval-

uated on a quantitative basis. Some of the optimal

values for the hyper-parameters have to be identiﬁed.

The number of relevant documents retrieved by the

relevance feedback method has to be speciﬁed by the

user at the moment and could be automated. The

proximity of the embedded documents to the cluster

center or the relevance of the document to the im-

proved search in the ranking would be two methods

for automation. The way to represent the search as

a graph and to utilize this graph to represent the en-

hanced search interest of the user is a novel way to

effectively explore and search even if the user is not

familiar with the explored data or the person is new to

a ﬁeld and wants to ﬁnd out the most important con-

cepts, people or institutions.

REFERENCES

Agocs, A., Dardanis, D., Forster, R., Le Goff, J.-M., Ou-

vrard, X., and Rattinger, A. (2017). Collaboration

spotting: A visual analytics platform to assist knowl-

edge discovery. ERCIM NEWS, (111):46–47.

Analytics, C. (2017). Web of science core collection. Cita-

tion database. Web of Science.

Bastian, M., Heymann, S., and Jacomy, M. (2009). Gephi:

an open source software for exploring and manipulat-

ing networks. In Third international AAAI conference

on weblogs and social media.

Blondel, V. D., Guillaume, J.-L., Lambiotte, R., and Lefeb-

vre, E. (2008). Fast unfolding of communities in large

networks. Journal of statistical mechanics: theory

and experiment, 2008(10):P10008.

Cao, G., Nie, J.-Y., Gao, J., and Robertson, S. (2008).

Selecting good expansion terms for pseudo-relevance

feedback. In Proceedings of the 31st annual interna-

tional ACM SIGIR conference on Research and de-

velopment in information retrieval, pages 243–250.

ACM.

Chen, C. (2006). Citespace ii: Detecting and visualizing

emerging trends and transient patterns in scientiﬁc lit-

erature. Journal of the American Society for informa-

tion Science and Technology, 57(3):359–377.

Hartigan, J. A. and Wong, M. A. (1979). Algorithm as

136: A k-means clustering algorithm. Journal of the

Collaboration Spotting Cite: An Exploration System for the Bibliographic Information of Publications and Patents

553

Royal Statistical Society. Series C (Applied Statistics),

28(1):100–108.

Le, Q. and Mikolov, T. (2014). Distributed representations

of sentences and documents. In International confer-

ence on machine learning, pages 1188–1196.

Liu, X., Yu, Y., Guo, C., and Sun, Y. (2014). Meta-path-

based ranking with pseudo relevance feedback on het-

erogeneous graph for citation recommendation. In

Proceedings of the 23rd acm international conference

on conference on information and knowledge manage-

ment, pages 121–130. ACM.

Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013).

Efﬁcient estimation of word representations in vector

space. arXiv preprint arXiv:1301.3781.

Ofﬁce, E. P. (2017). Patstat - worldwide patent statistical

database.

Ramos, J. et al. (2003). Using tf-idf to determine word rele-

vance in document queries. In Proceedings of the ﬁrst

instructional conference on machine learning, volume

242, pages 133–142. Piscataway, NJ.

Rattinger, A., Le Goff, J.-M., and Guetl, C. (2018a). Local

word embeddings for query expansion based on co-

authorship and citations.

Rattinger, A., Le Goff, J.-M., Meersman, R., and Guetl,

C. (2018b). Semantic and topological patent graphs:

Analysis of retrieval and community structure. In

2018 Fifth International Conference on Social Net-

works Analysis, Management and Security (SNAMS),

pages 51–58. IEEE.

Rocchio, J. J. (1971). Relevance feedback in information

retrieval. The SMART retrieval system: experiments

in automatic document processing, pages 313–323.

Salton, G. and Buckley, C. (1990). Improving retrieval per-

formance by relevance feedback. Journal of the Amer-

ican society for information science, 41(4):288–297.

Small, H. (1999). Visualizing science by citation mapping.

Journal of the American society for Information Sci-

ence, 50(9):799–813.

van Eck, N. and Waltman, L. (2009). Software survey:

Vosviewer, a computer program for bibliometric map-

ping. Scientometrics, 84(2):523–538.

KDIR 2019 - 11th International Conference on Knowledge Discovery and Information Retrieval

554