Retrieval, Visualization and Validation of Afﬁnities between Documents

ıs Trigo

, Martin V

ıta

, Rui Sarmento

and Pavel Brazdil

LIAAD – INESC TEC, Porto, Portugal

NLP Centre, Faculty of Informatics, Botanick

a 68a, 602 00, Brno, Czech Republic

Keywords:

Information Retrieval, Knowledge Artifacts, Graph-based Representation of Documents, Centrality Measures,

Afﬁnity Network, Comparison of Rankings.

Abstract:

We present an Information Retrieval tool that facilitates the task of the user when searching for a particular

information that is of interest to him. Our system processes a given set of documents to produce a graph,

where nodes represent documents and links the similarities. The aim is to offer the user a tool to navigate in

this space in an easy way. It is possible to collapse/expand nodes. Our case study shows afﬁnity groups based

on the similarities of text production of researchers. This goes beyond the already established communities

revealed by co-authorship. The system characterizes the activity of each author by a set of automatically

generated keywords and by membership to a particular afﬁnity group. The importance of each author is

highlighted visually by the size of the node corresponding to the number of publications and different measures

of centrality. Regarding the validation of the method, we analyse the impact of using different combinations

of titles, abstracts and keywords on capturing the similarity between researchers.

1 INTRODUCTION

A great part of today’s knowledge appears in the form

of documents that can be accessed via Internet, or di-

rectly on web pages. These knowledge sources can

be easily accessed. The aim of Information Retrieval

(IR) is to facilitate the task of the user when searching

for a particular information that is of interest to him.

The information is normally returned in the form of

a rank list of documents which is not very helpful, if

the number of documents is large and the user has not

a clear picture of what he is looking for. Additional

techniques, like the characterization of documents by

keywords or brief snippets, help to focus on an item

of interest or else to skip it. Another technique that

is sometimes employed is clustering of documents of

interest as they are being returned. This way the user

can focus on a relevant cluster while searching for the

relevant item.

Our system starts by processing the documents in

a usual manner that is common to many text min-

ing (TM) tasks. To facilitate the processing, nor-

mally only parts of the documents are used, like ti-

tles or abstracts, as these are sufﬁcient to get an idea

about what the document is about. Usual bag-of-

words (BoW) representation has been adopted.

In the next stage, we calculate similarity among

the individual documents and represent the resulting

information in the form of a graph. This can be shown

to the user for inspection. The aim is to offer him a

tool to navigate in the space of documents in an easy

way.

It is clear that larger graphs represent a problem,

as the user may easily get lost. This is one of the mo-

tives that the graph is processed to discover what we

call afﬁnity groups. The user can thus consider the

node of interest (e.g. representing a particular doc-

ument or set of documents) and identify the afﬁnity

group that this node belongs to. Our system enables

the identiﬁcation of all other nodes with the greatest

afﬁnity score. If a node represents, for instance, some

author, it is possible to identify all authors that work

on similar topics (as judged by words in BoW repre-

sentation).

As was mentioned before, inspecting large graphs

may be a bit tedious. Therefore the system offers the

facility to condense all nodes of a given afﬁnity group

into a single metanode. This is useful for obtaining a

global perspective of the full network or to obliterate

those groups that are of lesser interest to the user. Any

condensed nodes can be expanded at will.

One special facet is provided that helps the user

to determine whether a particular node is important

or not. This is done by attaching keywords to each

452

Trigo, L., Víta, M., Sarmento, R. and Brazdil, P..

Retrieval, Visualization and Validation of Afﬁnities between Documents.

In Proceedings of the 7th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K 2015) - Volume 3: KMIS, pages 452-459

ISBN: 978-989-758-158-8

afﬁnity group. The user can thus scan these keywords

and determine whether the group is of interest or not

and condense/expand the details accordingly.

It is a common knowledge that not all nodes in

a graph are equally important. Various graph-based

measures were introduced, such as betweenness and

eigenvector centralities, to enable to identify the im-

portant nodes. Our prototype software uses these

measures exactly with this aim. As the absolute val-

ues are not meaningful in themselves, we provide a

relative value in a boxplot. This again helps to make

the search/retrieval more effective.

The methods described are general and applica-

ble to many diverse domains. These can include doc-

uments describing R&D projects, legal documents,

court cases or medical procedures.

The case study we present in this paper used these

techniques to analyse afﬁnities between researchers.

This enables sensing uncovered relations that go be-

yond the already established communities revealed by

co-authorship networks.

As a validation step, we investigate the impact of

including additional information into the researchers’

proﬁles – in our case, besides titles, we consider

adding paper keywords and abstracts. The key ques-

tion we wish to answer is, whether and how the com-

puted similarity among researchers changes if more

information is taken into account. In other words, the

question is whether the keywords and abstracts pro-

vide some additional value for dealing with similari-

ties. In order to evaluate three types of “publication-

based” proﬁles – titles only, titles+keywords and ti-

tles+keywords+abstracts, the computed similarities

were compared to real-world data – quantiﬁed opin-

ions about similarity among researchers provided by

them. These real-world data were obtained via ques-

tionnaires disseminated within a concrete research in-

stitute. Our results show that adding keywords ac-

companying articles is beneﬁcial, but adding also ab-

stracts does not seem to lead to further improvements.

1.1 Our Prototype in the Light of

Knowledge Artifacts

As some researchers (Goldstone and Rogosky, 2002)

have pointed out, the meaning of a concept – in our

work “document” - depends on the relationship with

the other concepts in the conceptual framework. Our

prototype (Gallicyadas, 2015) can be seen as a repre-

sentational knowledge artifact aiming to give the user

a broad sense of the document space enabling fast

browsing and efﬁcient decision support.

Humans are very efﬁcient in processing visual in-

formation and obtaining insights regarding proper-

ties and their relationships and that enables an Aug-

mented Intelligence approach (Schmitt, 1998). Thus,

in the context of searching for learning objects (doc-

uments) (Bednar et al., 2007), the simple relational

model providing content and its contextual material

(the related content) enables a good user interpreta-

tion of the given knowledge space. Regarding the

views that documents represent users in a commu-

nity, our system may be seen as capturing a structured

representation aiming to identify potential collabo-

rations/knowledge sharing in some collective. New

knowledge potentiates new practices and its exploita-

tion in a spiral (Nissen et al., 2007).

2 RELATED WORK

Considering that our case study targets the academic

domain, the related work discussed here was chosen

accordingly.

The discovery of similarities between researchers

was addressed before (Price et al., 2010), and its aim

was to facilitate the process of paper distribution to re-

viewers. Their web-based methodology, called Sub-

Sift, retrieves researchers’ proﬁles based on their pub-

lications. These proﬁles enable a typical Information

Retrieval task. The papers submitted to a scientiﬁc

conference – playing the role of Query in IR – are

compared with different proﬁles, in order to optimize

the task of attributing articles to the suitable reviewer.

Other application in the academic ﬁeld considers

the curricula organization of some courses and anal-

yses ”communities” and centrality of their learning

units (V

ıta et al., 2015).

Regarding the process of automatic extraction of

publications for each researcher, there are some chal-

lenges that were addressed by others (Bugla, 2009).

Beyond the bibliographic sources, the main issue

in retrieval of publications is name disambiguation

which helps to overcome two problems. The ﬁrst

one involves attributing a publication to someone else

with the same name. The second one is failing to at-

tribute a publication to the correct person simply be-

cause he used a different variant of his name. One of

the techniques to determine whether a given publica-

tion of P in some bibliographic database should be at-

tributed to person P’ on a given site, involves a check

to determine whether both (i.e. P and P’) have the

same home institution. Such strategies are followed

in a web application used by the University of Porto

(Authenticus, 2014), which provides a compilation of

authors publications from several major bibliographic

databases (ISI, SCOPUS, DBLP, ORCID and Google

Scholar).

Retrieval, Visualization and Validation of Afﬁnities between Documents

453

There is also work on recommender systems for

academic papers. In some literature, the relation-

ships between academic papers are often represented

in graphs that include authors in the nodes (Arnold

and Cohen, 2009) , (Zhou et al., 2008). A content-

based approach was proposed, including terms from

the papers’ titles in the graph (Lao and Cohen, 2010).

More recently, some approaches build graph networks

representing papers that are connected through cita-

tions (Baez et al., 2011) (K

uc¸

uktunc¸ et al., 2012).

Others (Lee et al., 2013) use a colaborative ﬁltering

approach to recommend papers (items) to researchers

(users). AMRec (Huang et al., 2013) extracts con-

cepts from academic corpora, that are categorized as

tasks and methods, as well as their relations. These

are processed to provide recommendations of meth-

ods to researchers.

3 METHODOLOGY

In our practical application we considered that each

document includes the list of publication titles of a

particular author/researcher. This section presents the

main steps undertaken to uncover the unknown infor-

mation regarding afﬁnities as well as its validation.

The method involves the following steps:

• Identify institutions and obtain researchers’

names

• Use web/text mining to process researchers’ pub-

lications

• Discover of potential communities linked by

afﬁnities

• Identiﬁcation of important nodes (researchers) in

the graph

• Characterization of nodes using keywords

• Comparing the rankings obtained from dissimilar-

ity matrices and rankings obtained via question-

naires

The ﬁrst step is to select the institutions and ob-

tain the researchers names in their webpage. This in-

formation can be extracted by an expression in XPath

query language to obtain their names from the web-

site. Each researcher’s name can be used in the search

through the chosen bibliographic database, such as

DBLP, which enables direct access to each researcher

list of publications. The matching between authors

names in the institution’s website and bibliographic

databases brings name ambiguity problem that was

addressed before (Bugla, 2009). The resulting pub-

lications related to each author were kindly provided

by the Authenticus team, which in fact eliminated the

need to use web mining.

Publications titles are stored into plain text

ﬁles/documents, each representing a particular au-

thor. The text ﬁles are retrieved and preprocessed in

the usual manner. We use bag-of-words (BoW) and

vector representation (Feldman and Sanger, 2007),

and perform usual preprocessing including removal

of numbers, stop-words, punctuation and other spuri-

ous elements. After this task, the list of documents

is transformed into a document-term matrix represen-

tation, each line (document) representing a vector of

its terms, with tf-idf weighting. The vector represen-

tation is used to obtain the cosine similarity matrix.

This matrix can be visualized in the form of an afﬁnity

graph and is used as the basis for further processing.

The afﬁnity network enables to calculate some

measures of the importance of individual researchers.

Two centrality measures (Iacobucci, 1994) were ex-

tracted from the graph. The betweenness centrality

indicates the number of times a vertex joins two other

vertices on the shortest path. The eigenvector central-

ity gives more importance to nodes that are connected

to the most inﬂuent nodes.

For the afﬁnity group extraction task, we have

selected the Walktrap algorithm (Pons and Latapy,

2005). This technique ﬁnds densely connected sub-

graphs, also referred to as communities, through ran-

dom walks. It assumes that short random walks tend

to stay in the same community.

Furthermore, TextRank algorithm (Mihalcea and

Tarau, 2004) was used to extract keywords from the

existing text (publication titles).

The validation process compares the rankings re-

trieved from the graph and rankings obtained via

questionnaires. More details on validation can be

found in section 5.

4 CASE STUDY

Our method and the corresponding prototype uses

data involving 120 researchers of seven units of a

Portuguese R&D institution (INESC-TEC, 2015) and

their 4153 publications.

In the main screen, the prototype presents the

afﬁnity network of the 7 R&D units/centers consid-

ered. Further exploration and browsing features are

presented below.

In the graph, nodes represent researchers and links

similarities. There are several ways to infer the im-

portance of a particular author/node. The most im-

mediate visual clues are the node size, proportional to

the number of the author’s publications, and the num-

KITA 2015 - 1st International Workshop on the design, development and use of Knowledge IT Artifacts in professional communities and

aggregations

454

ber and quality of connections to other researchers.

The strength of the similarity between researchers is

represented by the thickness of the edge. The node

border color identiﬁes the afﬁliation unit and the core

color identiﬁes the afﬁnity group. If the user selects

one of the research units in the left-hand drop down

menu, the corresponding network is shown in the can-

vas. Fig. 1 shows the R&D network and their three

afﬁnity groups discriminated by the core color of the

nodes.

Figure 1: Network of an R&D unit/center.

The other graphic elements for this relevance as-

sessment are betweenness and eigenvector centralities

boxplots (Fig. 2) that show the relative position of the

author in respect to the others.

Figure 2: Betweenness and eigenvector centralities box-

plots for the selected researcher.

The user has several options to select the node(s)

he is looking for. It can be made both visually, by

clicking on a node, or by textual search – by typing re-

searcher’s name in the main tab or keywords in a sec-

ondary tab. Each node (researcher) is characterized

by keywords. Keyword descriptors for the selected

author are presented in the left-hand side panel. Key-

word descriptors of the researcher’s afﬁnity group are

presented in the right-hand side (Fig. 3).

The quality of the characterizing keywords gener-

ated by our prototype is quite reasonable. So far, we

have performed an informal evaluation, by just com-

paring the keywords generated with the keywords ex-

tracted from researchers’ web pages. For instance,

for the selected author, the keywords that were col-

lected from his webpage (Data Mining and Decision

Support; Knowledge Discovery from Data Streams;

Artiﬁcial intelligence) have a signiﬁcant overlap with

the ones that were automatically extracted. We plan to

Figure 3: Keyword descriptors for a selected researcher in

the left-hand side panel and for his/her afﬁnity group in the

right hand-side.

carry out a more thorough quantitative study later us-

ing conventional term overlap metrics (e.g. precision,

recall).

Beyond the visual clues in the network canvas, the

interface presents two distribution pie charts that give

an insight about the distribution of the members that

belong to the same afﬁnity group and R&D unit of the

selected researcher.

Figure 4: Searching network authors by keyword.

The prototype also includes extra tabs at the top.

One of the tabs enables the search for the researchers

that are characterized by some particular keyword.

Thus, ﬁnding authors associated with speciﬁc key-

words or research areas is simple and intuitive. Fig. 4

shows a speciﬁc search by the keyword parallel.

There is an additional aspect that we would like to

mention here – folding and unfolding. Fig. 5 shows

the folding of the full network in existing afﬁnity

groups. On the right-hand side, the network is folded,

hiding many of the details of the full network. The

size of nodes is proportional to the number of ele-

ments of the afﬁnity group and the width of the con-

nection/edges is determined by the mean of the sim-

ilarity weights of the connections between the corre-

sponding groups.

Retrieval, Visualization and Validation of Afﬁnities between Documents

455

Figure 5: Unfolded and folded version of the full network.

5 VALIDATING RETRIEVAL

Our prototype only used paper titles for generating

the similarity matrix. In this section we describe the

work carried out to validate our method and test if

adding abstracts and paper keywords improves the

results. The validation step was applied to the data

about the Institute for Fuzzy Modeling and Applica-

tion (IRAFM, 2015), University of Ostrava, Czech

Republic having slightly more than 30 researchers,

including Ph.D. students. About two thirds of the re-

searchers completed the questionnaire.

The source of the data about researchers’ publi-

cations is the Information System of the Research,

Experimental Development and Innovations (ISVAV,

2015) run by the Czech authorities. It gathers infor-

mation about all the R&D results throughout more

than 10 years. The advantage of using this data

source is that we do not need to deal with traditional

problems of IR: different variants of the researchers’

names and same names of different persons. Names

are stored in a normalized variant and researchers

have a unique ID – with some exceptions, that were

handled manually.

We have focused this task on the graph that has

been automatically generated and analyse each node.

The similarities to other nodes are compared to simi-

larities obtained from questionnaires.

5.1 Computing the Dissimilarity Matrix

for Each Corpus – Preprocessing

Issues

For each researcher we have generated a single plain

text ﬁles that contains:

• titles (T)

• titles plus keywords (T+K)

• titles plus keywords plus abstracts (T+K+A)

These three corpora: T, T+K, T+K+A consist

from the corresponding sets of ﬁles for each re-

searcher. Using preprocessing tools contained in tm

package, a standard sequence of preprocessing steps

was applied on these text data (transformation to low-

ercase, then punctuation, numbers and white spaces

were removed).

Analogously as in previous work (Brazdil et al.,

2015) we have chosen a bag-of-words representa-

tion for our documents in each collection (corpus)

and a document-term-matrix was generated (tf-idf

weighting (Feldman and Sanger, 2007) was used),

so we obtained 3 DTM for each corpus (T, T+K,

T+K+A) and from those three dissimilarity matrices

were computed using the cosine similarity. Values

were rounded to two decimal digits and values lower

than a certain threshold were replaced by zeros.

By observing the dissimilarity matrices, we can

assign a list of k most similar researchers to a given

researcher. The initial setting for the further analy-

sis is that we have a list of top-5 most similar re-

searchers obtained from the questionnaire (i.e., the

“true ordering”) and also three top-5 lists of most sim-

ilar researchers obtained from corresponding dissimi-

larity matrices (T, T+K, T+K+A). Hence, we are able

to compare true orderings/rankings to our “computed

ones” and evaluate them.

5.2 Ranking Comparison Measures

For comparing these pairs of rankings, we have used

the three measures. But before introducing these mea-

sures, a small example will show how the rankings

have been obtained.

Let us assume we have a following order-

ing (“true ordering”) of top ﬁve elements: L

) and a “computed ordering” L

). Fig. 6 shows the graphs that are

used as the basis for this ordering. For instance, the

element P

precedes P

in the true ordering, because

Sim(P

) = 0.6 and Sim(P

) = 0.4 and obviously

0.6 >0.4 (here Sim represents similarity).

Let us now examine the measures. The ﬁrst and

most simple measure is the size of the overlap of the

two rankings. We used the normalized variant, i.e.,

the size of the overlap divided by the length of the

lists. Thus, as we deal with top lists, we can obtain

only six values – 0, 0.2, ..., 1 – where 1 is the result

for equal lists, 0 for lists having no common elements.

Returning to our example, the lists L

and L

have 3

elements in common, so the overlap measure has a

value of 0.6.

The obvious advantage of this method is a triv-

ial implementation and straightforward interpretation.

KITA 2015 - 1st International Workshop on the design, development and use of Knowledge IT Artifacts in professional communities and

aggregations

456

0.35

0.17

0.24

0.11

0.31

“True”“Computed”

0.6

0.5

0.2

0.4

0.1

Figure 6: ”Computed” and ”True” similarity graphs that are

the basis for the ordering of the top 5 elements.

However, this measure cannot capture the differences

arising from the changes in the ordering within the

(top-ﬁve) lists.

The second measure is based on the Spearman’s

footrule (Bar-Ilan et al., 2006). Since this measure

is designed for use on two rankings of the same set,

it has been applied it to reduced lists where non-

overlapping elements have been removed from both

lists. That is, we have obtained two lists of the

same length containing only elements common to

both original lists. Moreover, the mutual ordering of

the remaining elements is the same as in the original

lists. Reduced lists can be represented as ranks σ

and σ

from the set {1, . . . , |S|} where S is a set of all

common elements of the original lists.

Suppose, for instance, we have two lists includ-

ing elements L

and L

shown earlier. The reduced

lists containing only common elements are L

) and L

= (P

), while L

represents

the correct ordering. Variables σ

and σ

represent

the ranks of the elements of L

in each list. So for σ

we get the obvious ordering (1, 2, 3) and for σ

have (2, 1, 3).

The value of the Spearman footrule for permuta-

tions σ

and σ

can be computed in a following way

(Bar-Ilan et al., 2006):

|S|

(σ

,σ

) =

|S|

∑

i=1

|σ

(i) − σ

(i)| (1)

Formula for Fr

|S|

calculates the sum of differences

of all ranks. In our example, we get |1 − 2| + |2 −

1| + |3 − 3| = 2. Obviously, the value of Fr

|S|

is 0

in case of σ

= σ

, i.e. when the lists are identical.

To obtain a normalized value of Spearman footrule,

it is necessary to divide Fr

|S|

by the maximal value

maxFr

|S|

. The maximal value is

|S|

for |S| even and

(|S| − 1)(|S| + 1) for |S| odd.

Let us see what happens in our example. As have

an odd number of elements, we apply the formula

∗

(|S|−1)∗(|S|+1), which gives

∗(3−1)∗(3+1) =

∗ 2 ∗ 4 = 4. In our example, the value of Sp

|S|

1 − Fr

|S|

/maxFr

|S|

= 1 − 2/4 = 0.5.

The value of the correlation Sp

|S|

is calculated as

follows:

|S|

(σ

,σ

) = 1 −

|S|

(σ

,σ

)

maxFr

|S|

(2)

Note that if two original lists contain two common

elements having the same ordering in both lists but

they differ in the placement of each list, the value of

Sp remains 1.

The last measure is an extension of the previous

method, described in (Bar-Ilan et al., 2006) and is

called G-measure.

The main idea of this approach is assigning the

rank k + 1 to the element that did not appear in top k

list. For two permutations, say τ

,τ

, on the same set

containing n elements, the extended metric for k top

elements is deﬁned as

k+1

(τ

,τ

) = 2(k − z)(k + 1) + (3)

∑

i∈Z

|τ

(i) − τ

(i)| −

∑

i∈S

(i) −

∑

i∈T

(i)

where Z is the set of common elements (i.e. elements

appearing in both top-k lists), z = |Z|, S is the set of

elements that appear only in L

, and T is the set of

elements that appear only in L

. A normalization is

needed and in case of this measure we obtain the G −

measure (Fagin et al., 2003):

(k+1)

= 1 −

(k+1)

maxF

(k+1)

(4)

It can be easily proved that the max F

(k+1)

value

is k(k + 1).

Let us consider our example. To compute G

(k+1)

for k = 5 we use the rule for F

(k+1)

. In our case, z,

i.e. the cardinality of set of common elements |Z| is

equal to 3, and so the ﬁrst expression 2(k −z)(k +1) is

equal to 24. The ﬁrst sum of the differences of ranks

of common elements of both lists is 3. This is because

the differences of ranks for P

is 2, por P

is 0 and

for P

is 1. Then we sum the rankings of elements

contained only in list S, containing the elements in L

but not in L

, i.e. elements P

and P

, is equal to 8.

The sum of the ranks of elements contained only in

, i.e. P

and P

, is 5. Therefore we have F

(k+1)

24 + 3 − 8 − 5 = 14. Since max F

(k+1)

= k(k + 1) =

30, the G

(k+1)

= 1−

. After rounding, we get 0.533.

In conclusion, the value of the G-Measure is inﬂu-

enced also by the positions of the common elements

in the original lists, while the Spearman’s footrule

value takes into the account only the mutual orderings

of the common elements in the original lists.

Retrieval, Visualization and Validation of Afﬁnities between Documents

457

5.3 Evaluation and Results on Rankings

The three ranking measures were applied to the dif-

ferent combinations of document elements are sum-

marized in Table 1. The table shows that adding key-

words to the corpus provides slightly better results

with respect to all measures, most notably with re-

spect Spearman. On the other hand, adding abstracts

has no impact on two measures out of three. Only at

G-measure there is a slight improvement.

Table 1: Evaluation of ranking results for the different com-

binations of document elements (T, T+K, T+K+A).

Corpus Overlap Spearman G-measure

T 0.478 0.500 0.426

T+K 0.496 0.533 0.452

T+K+A 0.496 0.533 0.458

We show distributions of Sp and G −measure val-

ues for several combinations of T, T+K and T+K+A.

We note that the distributions differ quite substan-

tially. We observe for instance that if we add key-

words, the frequency of values of Spearman coefﬁ-

cients changes (Fig. 7 and 8).

Freq.

0.0 0.2 0.4 0.6 0.8 1.0

0 6

Figure 7: Distribution of Sp for T corpus.

Freq.

0.0 0.2 0.4 0.6 0.8 1.0

0 6

Figure 8: Distribution of Sp for T+K corpus.

In case of adding abstracts, the frequency of val-

ues of G-measure values changes too (Fig. 9 and 10).

Freq.

0.0 0.2 0.4 0.6 0.8 1.0

0 6

Figure 9: Distribution of G measure for T+K corpus.

Freq.

0.0 0.2 0.4 0.6 0.8 1.0

0 4

Figure 10: Distribution of G measure for T+K+A corpus.

The values of G-measures of T+K and T+K+A

corpora are highly correlated – this can be observed

in the following graph.

Thus, we can state a hypothesis that adding key-

words to titles leads to a signiﬁcant improvement,

0.1 0.2 0.3 0.4 0.5 0.6 0.7

0.1 0.4 0.7

T+K

T+K+A

Figure 11: Correlation between G-measures of T+K and

T+K+A.

while adding abstract does not provide a signiﬁcant

difference. This needs to be conﬁrmed in future work,

where more data needs to be collected.

6 CONCLUSION AND FUTURE

WORK

The prototype described here enables a better under-

standing of a large set of documents for various rea-

sons. It portrays the information in the form of a

graph which can be inspected visually. Various means

are provided to navigate in the graph when search-

ing for relevant information. This includes compact-

ing/expanding nodes, annotating nodes with automat-

ically generated keywords, as well as graph central-

ity measures that show the relative importance of

each document. Instead of showing absolute num-

bers which are difﬁcult to interpret, the values are rel-

ativised and presented in the form of a boxplot.

Scaling up the network presents some challenges

for some text mining operations, such as the similar-

ity calculation that is needed for the graph. We plan

to adopt a streaming approach to overcome this difﬁ-

culty. Besides, we also need to address the problem of

visualizing large graphs. This could be done by auto-

matically collapsing nodes that have no interest to the

user.

Regarding the validation step, this preliminary

study shows that abstracts do not add signiﬁcant in-

formation to the titles. In contrary, keywords seem to

have some contribution for the calculation of similar-

ity. Our future work will aim to scale up the question-

naire to more researchers in different environments in

order to get more conﬁdence in these results.

The methods described are general and applica-

ble to many diverse domains. These can include doc-

uments describing R&D projects, legal documents,

court cases or medical procedures. One of the authors

of this paper applied this methodology to medical cur-

ricula (V

ıta et al., 2015). Our case study focused one

R&D institution, permitting to obtain potentially im-

portant information about possible collaborators for

each researcher.

KITA 2015 - 1st International Workshop on the design, development and use of Knowledge IT Artifacts in professional communities and

aggregations

458

ACKNOWLEDGEMENTS

This work has been partially funded by FCT/MEC

through PIDDAC and ERDF/ON2 within project

NORTE-07-0124-FEDER-000059 and through the

COMPETE Programme (operational programme for

competitiveness) and by National Funds through the

FCT – Fundac¸

ao para a Ci

encia e a Tecnologia

(Portuguese Foundation for Science and Technology)

within project FCOMP-01-0124-FEDER-037281.

The authors wish to express their gratitude to the

Authenticus team for providing the list of titles of

publications of INESC Tec researchers needed for this

study and to all members of Institute for Fuzzy Mod-

eling an Applications, University of Ostrava, Czech

Republic for cooperation in the process of obtaining

the data for validating the system.

REFERENCES

Arnold, A. and Cohen, W. W. (2009). Information extrac-

tion as link prediction: Using curated citation net-

works to improve gene detection. In Proc. of the

3rd Int. Conf. on Weblogs and Social Media, ICWSM

2009, San Jose, California, USA, May 17-20, 2009.

Authenticus (2014). Authenticus bibliographic database.

https://authenticus.up.pt/. Accessed: 2014-09-30.

Baez, M., Mirylenka, D., and Parra, C. (2011). Understand-

ing and supporting search for scholarly knowledge.

In 7th European Computer Science Summit, Milano,

Italy.

Bar-Ilan, J., Mat-Hassan, M., and Levene, M. (2006). Meth-

ods for comparing rankings of search engine results.

Computer networks, 50(10):1448–1463.

Bednar, P., Welch, C., and Graziano, A. (2007). Learn-

ing objects and their implications on learning: A case

of developing the foundation for a new knowledge in-

frastructure. Learning objects: Applications, implica-

tions & future directions.

Brazdil, P., Trigo, L., Cordeiro, J., Sarmento, R., and Val-

izadeh, M. (2015). Afﬁnity mining of documents sets

via network analysis, keywords and summaries. Oslo

Studies in Language, 7(1).

Bugla, S. (2009). Name identiﬁcation in scientiﬁc publi-

cations. Master’s thesis, FCUP, University of Porto,

Portugal.

Fagin, R., Kumar, R., and Sivakumar, D. (2003). Compar-

ing top k lists. SIAM Journal on Discrete Mathemat-

ics, 17(1):134–160.

Feldman, R. and Sanger, J. (2007). The text mining hand-

book: advanced approaches in analyzing unstruc-

tured data. Cambridge University Press.

Gallicyadas (2015). Afﬁnity miner online prototype.

http://gallicyadas.pt/afﬁnity-miner.

Goldstone, R. L. and Rogosky, B. J. (2002). Using relations

within conceptual systems to translate across concep-

tual systems. Cognition, 84(3):295–320.

Huang, S., Wan, X., and Tang, X. (2013). Amrec: An intel-

ligent system for academic method recommendation.

In Workshops at the Twenty-Seventh AAAI Conference

on Artiﬁcial Intelligence.

Iacobucci, D. (1994). Graphs and Matrices. In: Wasser-

man, S. (eds), Social network analysis: methods

and applications. PP. 92-166. Cambridge University

Press, New York.

INESC-TEC (2015). Inesc tec. http://www.inesctec.pt/.

IRAFM (2015). Institute for fuzzy modeling and applica-

tion. http://irafm.osu.cz/.

ISVAV (2015). Information system of the research,

experimental development and inovations.

http://www.isvav.cz.

uc¸

uktunc¸, O., Saule, E., Kaya, K., and C¸ ataly

urek,

U. V. (2012). Recommendation on academic net-

works using direction aware citation analysis. CoRR,

abs/1205.1143.

Lao, N. and Cohen, W. W. (2010). Relational retrieval us-

ing a combination of path-constrained random walks.

Machine Learning, 81(1):53–67.

Lee, J., Lee, K., and Kim, J. G. (2013). Personalized aca-

demic research paper recommendation system. arXiv

preprint arXiv:1304.5457.

Mihalcea, R. and Tarau, P. (2004). TextRank: Bringing Or-

der into Texts. In Conference on Empirical Methods

in Natural Language Processing, Barcelona, Spain.

Nissen, H.-E., Bednar, P., and Welch, C. (2007). Use and

Redesign in IS: Double Helix Relationships? Inform-

ing Science.

Pons, P. and Latapy, M. (2005). Computing communities in

large networks using random walks. In Proceedings

of the 20th International Conference on Computer

and Information Sciences, ISCIS’05, pages 284–293,

Berlin, Heidelberg. Springer-Verlag.

Price, S., Flach, P. A., and Spiegler, S. (2010). Subsift: a

novel application of the vector space model to support

the academic research process. In WAPA, pages 20–

27.

Schmitt, G. (1998). Design and construction as computer-

augmented intelligence processes. Caadria, Osaka.

ıta, M., Komenda, M., and Pokorn

a, A. (2015). Exploring

medical curricula using social network analysis meth-

ods. 5th International Workshop on Artiﬁcial Intelli-

gence in Medical Applications, Lodz, Poland.

Zhou, D., Zhu, S., Yu, K., Song, X., Tseng, B. L., Zha, H.,

and Giles, C. L. (2008). Learning multiple graphs for

document recommendations. In Proceedings of the

17th International Conference on World Wide Web,

WWW 2008, Beijing, China, April 21-25, 2008, pages

141–150.

Retrieval, Visualization and Validation of Afﬁnities between Documents

459