Discovering Communities of Similar R&D Projects

Martin V´ıta

NLP Centre, Faculty of Informatics, Botanick´a 68a, 602 00, Brno, Czech Republic

Keywords:

Document Similarity, Latent Semantic Analysis, Community Discovery, Eigenvector Centrality.

Abstract:

Datasets about research projects contain knowledge that is valuable for several types of subjects working in the

R&D ﬁeld – including innovative companies, research institutes and universities even individual researchers

or research teams, as well as funding providers. The main goal of this paper is to introduce a software tool

based on a reusable methodology that allows us to deal with similarity of projects in order to group them and

provide a deeper insight into a structure of considered set of projects in a visual way. In our approach we use

several concepts developed in social network analysis.

1 INTRODUCTION

Successful cooperation in R&D requires up-to-date

knowledge about the state in the particular ﬁeld from

different perspectives – including reports about re-

search projects, research teams and companies in-

volved. Research teams preparing a new project pro-

posal are interested in information concerning insti-

tutions and teams working on similar problems, suc-

cessfully completed projects and projetcs in progress.

Policy makers are interested in condensed informa-

tion about the orientation of research ﬁnanced from

public sources. Data about researchers participating

on certain groups of projects are interesting for HR

departments of innovative companies in order to build

talent pools.

The main goal of this work is to develop a hand-

ful software tool based on a reusable methodology

for exploring the structure of a given collection of

projects with respect to their content similarity (afﬁn-

ity). Since project descriptions are stored in a tex-

tual form, it can be considered as a text mining issue.

Hints of the implementation in

are included in this

paper.

Our basic requirement is simplicity and reusability

in practice and opportunity of easy implementation

using standard packages/libraries for text mining and

visualzation (in

or Python). Hence we also do not

deal with explicit knowledge artifacts like ontologies

in the manner presented in (Ma et al., 2012).

2 METHODOLOGY

The key steps of our work are summarized in the out-

line in the next subsection. This approach is inspired

by the work (Trigo and Brazdil, 2014) and (Brazdil

et al., 2015) but it differs in the following two aspects:

• Domain – we are obtaining an afﬁnity graph

where nodes are projects, since in (Trigo and

Brazdil, 2014) and (Brazdil et al., 2015) deal with

researchers,

• Using LSA – computation of similarity/afﬁnity

among projects is improved by the latent seman-

tic analysis that provides a particular solution of a

problem of synonymy and a problem of similarity

of text snippets describing similar things by dif-

ferent words (i. e. with a low or zero number of

common words.

Feasibility of LSA – and text mining approaches

in general – for detecting similarity between patent

documents and scientiﬁc publications was discussed

in (Magerman et al., 2010). We would also point out

that these steps can be naturally embedded into a stan-

dard data mining methodologies such as CRISP-DM

(Wirth and Hipp, 2000).

Outline of Our Approach

1. Creating a corpus – collection of text documents

(“proﬁles of projects”) – and obtaining their vec-

tor representations

2. Computing the dissimilarity matrix using LSA

3. Discovering communities of similar projects

460

Víta, M..

Discovering Communities of Similar R&D Projects.

In Proceedings of the 7th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K 2015) - Volume 3: KMIS, pages 460-465

ISBN: 978-989-758-158-8

4. Visualization of the similarity graph

5. Identifying important projects in the similarity

graph

2.1 Creating the Corporus

In our approach, each project is represented by its ti-

tle, keywords and its abstract (summarized aims of

the project etc.). Obviously, the process of gathering

these data depends on the used source. For extract-

ing mentioned ﬁelds from webpages, different tech-

niques of parsing html code can be employed (in

environment, the

XML

package can be usually success-

fully used, other approaches may use XSLT transfor-

mations etc.)

Textual data representing a single projects form

plaintext ﬁles, i. e. the set of considered projects is

1-1 mapped onto a collection of plaintext ﬁles. A se-

quence of standard preprocessing issues is performed

after tokenizations of these texts. It contains:

1. Transformation to lowercase

2. Punctuation removal

3. Numbers removal

4. Stopwords removal

5. Whitespace stripping

This sequence seems to be sufﬁcient for English

textual data – probably the most typical case. In a case

of dealing with highly inﬂected fusional languages,

there is a need for application of standard NLP proce-

dures such as lemmatization or stemming.

The collection of preprocessed texts is turned into

a term-documentvector representation, i. e. we obtain

a term-document matrix (TDM). The tf-idf weighting

is used (Feldman and Sanger, 2007) and words shorter

than 3 characters are not taken into the account.

2.2 Computing the Dissimilarity Matrix

using LSA

Unlike the approach presented in (Brazdil et al.,

2015), the similarity among document is not com-

puted directly from the TDM using cosine similarity,

but at ﬁrst, the TDM matrix is decomposed using la-

tent semantic analysis (LSA for short).

In TDM, rows represent unique words, columns

represent documents. LSA is a method for lower-

ing the rank of this matrix

– since TDM is usually

very sparse – based on a singular value decomposition

(SVD). SVD has a solid linear algebraic backgroud.

Using LSA in information retrieval context is some-

times called latent semantic indexing (LSI).

Having a term-document matrix M, the LSA process-

ing computes its rank-k approximation M

, where k

is a chosen number of factors, called latent seman-

tic dimensions. The value of k is usually between

100 and 300 and it is chosen empirically, (Rehurek,

2008). The number of dimensions have to be big-

ger than the number of documents involved. In this

contribution we omit the mathematical form, since

it is deeply studied and described in literature. We

point out only the basic idea: LSA is a technique that

maps documents into the space of latent semantic di-

mensions, whereas words that are semantically simi-

lar (measured by the ratio of co-occurances in docu-

ments) are mapped into same dimensions and words

semantically different into different dimensions. So,

instead of dealing with a matrix “word × document”

we have a matrix “concept × document”, where the

number of concepts is just the latent semantic dimen-

sion. For our purposes, LSA has two main – closely

related – advantages: it can handle synonymy and it

can be used for computing similarity of documents

that have similar content but low number of common

words.

The LSA decomposition of our TDM is computed

using standard packages that are available for all

major programming or data manipulation languages.

The dissimilarity matrix is obtained by cosine simi-

larity from matrices of LSA process (paricularly from

·V

, where M

= U

· S

·V

; for notation and detail

explanation see (Rehurek, 2008)).

From dissimilarity matrix we can ﬁnd out the sim-

ilarity of all pairs of considered projects. Values are

truncated to two digits and those lower than a certain

threshold are considered as irrelevant and set to zero.

2.3 Discovering Communities of Similar

Projects

The dissimilarity matrix can be regarded as an ad-

jacency matrix of a (similarity) graph – undirected

graph with weighted edges. The main aim of our work

is the discovery of communities, i. e. groups of simi-

lar projects (unlike hard clustering where each entity

is assinged to just only one cluster, in communities we

admit a situation when an entity belongs to more than

one community). From a graph theory point of view,

community is a densely connected subgraph. Com-

munity discovery is a common task in social network

analysis (Combe et al., 2010).

We use the Walktrap algorithm (Pons and Lat-

apy, 2006) for detecting communities. This algo-

rithm is based on the idea that short random walks

tends to stay in the same community (see manpage to

igraph

). The length of the random walk k is a param-

Discovering Communities of Similar R&D Projects

461

eter of the algorithm – after experiments we have cho-

sen k = 4. Roughly said, shorter ones lead to “bigger

amount of communities consisting of smaller number

of nodes”, whereas longer paths lead to “a small num-

ber of big communities”. In

igraph

implementation

of Walktrap algorithm, k = 4 is set as a default value.

To each node (project) a set of idents of communi-

ties is assigned. Results of Walktrap algorithm can be

provided also in a form of a dendrogram.

2.4 Visualization of the Similarity

Graph

Visualization of graph-like data is a traditional task,

hence many tools are available for this purpose. In

our setting, we are going to visualize a graph repre-

sented by the adjacency (similarity) matrix. Nodes

correspond with projects and the thickness of the edge

connecting two nodes (projects) is proportional to the

similarity value. Communities are bounded by shapes

in the background.

2.5 Identifying Typical Projects in the

Given Set

For identifying important nodes in a social network

several measures of centrality, such as degree central-

ity, betweenness centrality or eigenvector centrality

have been introduced (Ruhnau, 2000). Since we deal

with graph structures, we can apply them also in our

setting.

In order to select typical projects of a given set

we use the eigenvector centrality. We are going to

demonstrate the idea behind this measure in the orig-

inal social network setting: the person is more central

if it is in relation with other persons that are them-

selves central, therefore the centrality of a given node

does not only depend on the number of its adjacent

nodes, but also on their value of centrality. Trans-

forming this idea into our “project-similarity environ-

ment”, projects with a high eigenvector centrality are

similar with a big number of projects that are them-

selves similar to many projects – hence we can treat

them as characteristic representatives of a given set.

From the opposite point of view, low values of be-

tweenness centrality indicate that a given project is an

outlier in the sense of similarity.

The computation of eigenvector centrality is based

on eigenvalues of the adjacency matrix and it can be

found in (Ruhnau, 2000) or (Bonacich, 1972). Again,

these computations are implemented in relevant pro-

gramming and data manipulation languages including

or Python.

3 EXAMPLES OF RESULTS AND

POSSIBLE INTERPRETATION

At this “proof-of-concept” stage, this methodology

was applied on a real-world data, particularly on the

data about research projects funded by public sources

of the Czech Republic. This choice was done because

of simplicity of obtaining the data in a suitable for-

mat. All research projects funded by any of public

providers in the Czech Republic have to be registered

in the ISVAV system (Information System of the Re-

search, Experimental Development and Inovations)

run by the Czech authorities. It gathers information

about all the R&D projects from the mid 90th and cur-

rently contains data about more than 42 000 projects.

This system providesa web interface for querying and

ﬁltering by different criteria. Results can be easily ex-

ported in the form of zipped HTML ﬁles containing a

single HTML table with the considered items (data

and metadata of projects). For purposes of this paper

it is also an advantage that these data sets are probably

not known to a wide community, hence it constitutes

a good source for experiments in data explorations.

As an example we have chosen innovative and

research projects in the Informatics, Computer Sci-

ence branch being solved during the year 2014. This

dataset contains 157 projects. An example of content

of one plaintext ﬁle – project (ident: TA02010182,

title: “Inteligent library - INTLIB”) – is provided

below:

Intelligent Library - INTLIB / processing

of technical data - self-learning system -

ontologies - data semantics - Linked Data

/ The aim of the project is creation of a

certified methodology and a self-learning

system for processing of semantics of

technical documents and respective semantic

searching. In particular we will focus

on processing of legislative documents and

documents from the area of environment. We

will utilize and connect results from areas of

linguistics, data mining, databases, Linked

Data, user interfaces etc. and we will

create a SW that will have both theoretical

background and practical application.

3.1 Selected Features of the

Implementation

The implementation was done within the

environ-

ment. Widely known libraries

lsa

and

igraph

http://www.isvav.cz

KITA 2015 - 1st International Workshop on the design, development and use of Knowledge IT Artifacts in professional communities and

aggregations

462

Figure 1: Overview of the projects space.

were used. The whole code without preprocessing

scripts contains less than 100 lines of code.

Key functions used are:

•

Corpus

– from package

– for creating a corpus

from text ﬁles in a given directory

•

lsa

– from package

– for constructing LSA

space

•

walktrap.community

– from package

igraph

–

for computing communities

•

evcent

– from package

igraph

– for obtaining

the eigenvector centrality of each node.

The overall result with marked communities can

be observed on Figure 1, due to high number of el-

ements it serves for getting the ﬁrst impression and

in practice it is reasonable to manipulate with it in an

interactive way in colored mode. Marked communi-

ties correspond with different disciplines of informat-

ics/computer science/IT.

For instance, project TA02010182 belong to

a three element community containing project

TD020277 (title: “Public sector budgetary data in the

form of Open Data”, keywords: Public sector - Open

Data - Linked Open Data - public sector budgetary

data) and project TD020121 (title: “Publication of

statistical yearbook data as Open Data”, keywords:

Linked open data - public pension statistics - presen-

tation of data - predictive modelling - data transfor-

mation - public administration - Open Government)

Roughly said, this community can be described as

Linked Open Data group.

After obtaining the similarity graph, according to

our methodology, we have computed the eigenvector

similarities for each node (project) and selected top-5

of them. In our case, top ﬁve projects having the high-

est eigenvector similarities are focused on algorithms,

graphs and complexity (on ﬁgure they all belong to

the left big community):

1. GA14-10003S – Restricted computations: Algo-

rithms, models, complexity

2. GA13-03538S – Algorithms, Dynamics and Ge-

ometry of Numeration systems

3. GA14-03501S – Parameterized algorithms and

kernelization in the context of discrete mathemat-

The projects can be inspected using ISVAV:

http://www.isvav.cz/projectDetail.do?rowId=ABC, where

ABC stands for the ident of the project, e. g. TD020121

Discovering Communities of Similar R&D Projects

463

ics and logic

4. GA13-21988S – Enumeration in informatics and

optimization

5. GP14-13017P – Parameterized Algorithms for

Fundamental Network Problems Related to Con-

nectivity

According to the meaning of the experts, these

ﬁelds belong to priorities in computer science in the

Czech Republic.

4 CONCLUSION AND FURTHER

WORK

We have proposed a software tool for visualizing the

structure of collections of research projects with re-

spect to their content similarity. The approach is

based on the application of latent semantic analy-

sis and it can be easily implemented in

or Python

language. The results are easy-to-understand im-

ages/graphs that provide a quick overview of the con-

sidered set of projects. In future, this visualization

tool

Communities of similar projects can be subse-

quently elaborated: reports in the form of lists of in-

stitutions/researchers participating on projects in the

community can be also generated.

The plans of further work contain development of

evaluation methods and improvements that concern

mainly:

• Experimenting with Different Representations of

Projects: in this experiment we use only titles,

keywords and abstracts. We will investigate the

inﬂuence of taking more textual data – full pro-

posals, descriptions of project results (abstract of

papers assigned to the project etc.)

• Other Methods of Calculating Similarity: when a

big corpus of textual data is available, we will use

word2vec model (Mikolov et al., 2013) for simi-

larity computations

• Enriching the Visualization by Additional Data:

the size of node can be proportional to the budget

of the project, opacity of the node can represent a

value of a certain centrality measure in the graph,

a classiﬁcation of a project (fundamental/applied

research etc.) can be represented by different col-

ors

• Employing External Data Sources: in our work,

the edges represent content similarity. We can

also add an additional layer where edges (in

different color) will represent other connections

among projects (e. g. an edge can link a pair of

projects having a common institution as a partici-

pant).

4.1 Other Possible Applications

Application of the proposed tool is not limited only

to projects domain. Analogously it can be used for

patent proposals grouping etc. In R&D environment,

other possible applications are:

• Exploration of the structure of research institu-

tions: each institution can be represented as a

plaintext ﬁle containing titles, keywords and ab-

stracts of projects in which has the institution par-

ticipated

• Project reviewer matching and/or expert search:

in our setting it is not necessary that all enti-

ties are of the same type. We can analogously

together represent researchers (by lists of titles

of their publications and keywords as in (Trigo

and Brazdil, 2014)) and calculate mutual sim-

ilarities of type “researcher-project (proposal)”.

Researchers that have the highest similarity to a

given project proposal can be considered as poten-

tial reviewers (after satisfying possible constraints

such as “independence of researcher on the re-

viewed project”). This principle can be also ap-

plied for searching experts for a newly prepared

project.

REFERENCES

Bonacich, P. (1972). Factoring and weighting approaches

to status scores and clique identiﬁcation. Journal of

Mathematical Sociology, 2(1):113–120.

Brazdil, P., Trigo, L., Cordeiro, J., Sarmento, R., and Val-

izadeh, M. (2015). afﬁnity mining of documents sets

via network analysis, keywords and summaries. Oslo

Studies in Language, 7(1).

Combe, D., Largeron, C., Egyed-Zsigmond, E., and G´ery,

M. (2010). A comparative study of social network

analysis tools. In International Workshop on Web In-

telligence and Virtual Enterprises, volume 2, page 1.

Feldman, R. and Sanger, J. (2007). The text mining hand-

book: advanced approaches in analyzing unstruc-

tured data. Cambridge University Press.

Ma, J., Xu, W., Sun, Y.-h., Turban, E., Wang, S., and Liu,

O. (2012). An ontology-based text-mining method to

cluster proposals for research project selection. Sys-

tems, Man and Cybernetics, Part A: Systems and Hu-

mans, IEEE Transactions on, 42(3):784–790.

Magerman, T., Van Looy, B., and Song, X. (2010). Ex-

ploring the feasibility and accuracy of latent semantic

KITA 2015 - 1st International Workshop on the design, development and use of Knowledge IT Artifacts in professional communities and

aggregations

464

analysis based text mining techniques to detect simi-

larity between patent documents and scientiﬁc publi-

cations. Scientometrics, 82(2):289–306.

Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013).

Efﬁcient estimation of word representations in vector

space. arXiv preprint arXiv:1301.3781.

Pons, P. and Latapy, M. (2006). Computing communities in

large networks using random walks. J. Graph Algo-

rithms Appl., 10(2):191–218.

Rehurek, R. (2008). Semantic-based plagiarism detection.

Ruhnau, B. (2000). Eigenvector-centralitya node-

centrality? Social networks, 22(4):357–365.

Trigo, L. and Brazdil, P. (2014). Afﬁnity analysis between

researchers using text mining and differential analysis

of graphs. ECML/PKDD 2014 PhD session Proceed-

ings, pages 169–176.

Wirth, R. and Hipp, J. (2000). Crisp-dm: Towards a stan-

dard process model for data mining. In Proceedings of

the 4th International Conference on the Practical Ap-

plications of Knowledge Discovery and Data Mining,

pages 29–39. Citeseer.

Discovering Communities of Similar R&D Projects

465