Efﬁcient Academic Retrieval System Based on Aggregated Sources

Virginia Niculescu

, Horea Grebl

, Adrian Sterca

and Darius Bufnea

Computer Science Department “Babes¸-Bolyai” University, 1. M. Kog

alniceanu, Cluj-Napoca, Romania

Keywords:

Data Engineering, Retrieval Systems, Recommender Systems, Research Paper Databases, Academic Sources,

Big-Data Processing, NLP, Apache Spark, Graph-Databases.

Abstract:

On account of the extreme expansion of the scientiﬁc research paper databases, the usage of searching and

recommender systems in this area increased, as they can help researchers ﬁnd appropriate papers by searching

in enormous indexed datasets. Depending on where the papers are published, there might be stricter policies

that force the author to also add the needed metadata, but still there are other for which these metadata are

not complete. As a result, many of the current solutions for searching and recommending papers are usually

biased to a certain database.

This paper proposes a retrieval system that can overcome these problems by aggregating data from different

databases in a dynamic and efﬁcient way. Extracting data from different sources dynamically and not only

statically, based on a certain database, is important for assuring a complete interrogation, but in the same time

incur complex operations that may affect the performance of the system. The performance could be maintained

by using carefully designed architecture that relies on tools that allow high level of parallelization.

The main original characteristic of the system is represented by the hybrid interrogation of static data (stored

in databases) and dynamic data (obtained through real-time web interrogations).

1 INTRODUCTION

Recommender systems have a large number of appli-

cations in many ﬁelds including economic, education,

and scientiﬁc research. On the other hand, data en-

gineering is the practice of designing and building

systems for collecting, storing, and analyzing data at

scale. These systems collect, manage, and convert

raw data into usable information for data scientists

and business analysts to interpret. The goal of these

systems is to make data accessible, and so they enable

subsequent data analysis (Reis and Housley, 2022).

In academic research it is essential to be able to

easily ﬁnd the papers that treat speciﬁc themes in or-

der to have a complete and correct view over the pre-

vious work. In addition, it is very useful to ﬁnd ex-

perts and possible collaborators in speciﬁc domains,

based on their published work.

Paper recommender systems aim to help re-

searchers mitigate information overload and ﬁnd rel-

evant papers for their research, while author recom-

https://orcid.org/0000-0002-9981-0139

https://orcid.org/0000-0002-8529-5797

https://orcid.org/0000-0002-5911-0269

https://orcid.org/0000-0003-0935-3243

mender systems can provide collaborators sugges-

tions for researchers and help them ﬁnd specialists in

certain domains.

Nowadays, there are different scientiﬁc databases

that index scientiﬁc papers such as Scopus, Web of

Science, DBLP, Crossref, ACM Digital Library, IEEE

Xplore, Semantic Scholar, Google Scholar, etc. Also,

researchers may share their research ﬁndings and pub-

lications via digital platforms (e.g. arXiv, Research-

Gate) for free for knowledge exchange (Sun et al.,

2014). A complete search implies looking in many (if

not all) these sources in order to identify all needed in-

formation. Many of the current solutions (e.g.: Con-

nectedPapers (Alex Tarnavsky et al., 2020), OpenCi-

tations (Peroni and Shotton, 2020), SciGraph (Sci-

Graph, 2022)) that index scientiﬁc papers are usually

biased to a certain scientiﬁc database.

Given that the number of published papers grows

exponentially, it’s hard to keep track of the latest ﬁnd-

ings in a ﬁeld of interest. For this reason, the metadata

(i.e. – information that describes the resource: name,

creator, description, date of creation, keywords) asso-

ciated with them is crucial in a modern information

system, to easily ﬁnd the corresponding resources.

Depending on where the papers are published, there

might be strict policies that force the author to add

436

Niculescu, V., Grebl

a, H., Sterca, A. and Bufnea, D.

Efﬁcient Academic Retrieval System Based on Aggregated Sources.

DOI: 10.5220/0011850600003464

In Proceedings of the 18th International Conference on Evaluation of Novel Approaches to Software Engineering (ENASE 2023), pages 436-443

ISBN: 978-989-758-647-7; ISSN: 2184-4895

 2023 by SCITEPRESS – Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)

the needed metadata or not so strict.

Even though the metadata is missing, they are still

a key part of an information retrieval system, and it

would be useful if they could be automatically ex-

tracted if necessary.

There are ranked databases that allow publishing

the work even in the manuscript form without even

requiring important meta-ﬁelds, such as the subject,

the abstract, the afﬁliations of the authors. Some

of those can be automatically generated (subject, ab-

stract), while others might be ﬁlled in through a pro-

cess called metadata harvesting which aggregates the

same resource from multiple sources, which might

have those ﬁelds ﬁlled in, or have some additional in-

formation (Gill et al., 2008).

For any recommender or retrieval system it is es-

sential to have fast access to a complete and up-to-

date set of data from which it can extract the most

appropriate solutions.

In addition, nowadays systematic literature re-

views are more and more important for a solid re-

search. A system that could easily extract all the rel-

evant results of a particular domain is extremely use-

ful.

Objectives and Contribution. The main objective

of this research is investigate how to efﬁciently col-

lect and aggregate academic papers from multiple

sources, in order to obtain a complete academic base

for academic recommendation. Through a complete

academic base we understand an interrogation space

(that may have static but also dynamic parts) of scien-

tiﬁc papers with complete metadata information.

• A pure dynamic approach would imply dy-

namic searching (using web crawlers) on different

sources, aggregation of the results and only tem-

porary store these results.

• A pure static approach would imply to prior col-

lect all the information from the remote sources

and periodically update the locally stored global

database. These implies not only very high initial

costs, but also continuous costs required by the

periodical updating.

We propose here a hybrid variant that stores in-

ternally data from one source (up to a certain date)

and then uses a dynamic approach to obtain informa-

tion from other sources. In order to demonstrate the

proposed approach we developed a tool called ARRS

(Aggregator Research Retrieval System) that allows

the user to investigate a ﬁeld of interest: see what pa-

pers have already been published on that topic, the au-

thors that worked on a speciﬁc paper and all the other

authors they collaborated with.

We assert that a graph database would be appro-

priate to be used for the system storage due to its char-

acteristics that allow establishing the natural connec-

tions between the data (papers and authors). In addi-

tion, the storage is updated and supplemented through

any usage of the system. The users of the system will

be those that accomplish the maintenance.

Performance is another important issue that could

be achieved by employing parallel computing for the

dynamic search and database updating. For this rea-

son, the metadata normalization process is done on

an Apache Spark cluster (Haines, 2022; Apache Soft-

ware Foundation, 2022).

Most of the open-source solutions (e.g. (Alex Tar-

navsky et al., 2020), (Corporation for Digital Schol-

arship, 2006), (Peroni and Shotton, 2020), (SciGraph,

2022)) that allow a user to query papers on a speciﬁc

topic have behind the scenes an indexed dataset which

is used to deliver a response in real-time. Even though

this aspect greatly increases responsiveness, this also

means that the newest papers might not be present in

this dataset. It is also the case for the tools that rely

entirely on an external data source (e.g.: VOSviewer

(Centre for Science and Technology Studies, )), which

allows the user to build a network of papers dynami-

cally, using API calls.

In contrast, the proposed approach besides using

an indexed dataset stored in a graph database, it also

interrogates multiple academic data sources to update

the dataset while each user searches for the topics

of interest. Thus, this hybrid approach allows re-

sponsiveness through the use of an indexed dataset,

brings newly published papers using APIs for the plat-

forms that provide such services, such as IEEE, Sco-

pus and Crossref, and crawlers to also gather data

from sources that don’t have an available API.

In terms of efﬁciency, the ARRS was designed as

a scalable solution, by using multiple parallelization

techniques.

In the same time, the system represents a proof

of concept for a general hybrid big data system that

interrogates static data (stored in databases) and dy-

namic data (obtained through web interrogations).

Paper Structure. After this ﬁrst section that gives

a short introduction into the context, objectives and

contribution, section 2 analyzes the related work in

the ﬁeld of academic recommender and search sys-

tems, with a special interest on freely web accessible

tools. In section III the design of the proposed solu-

tion is described: the system requirements and func-

tionalities, the architecture and some implementation

challenges. The evaluation is done based on three cri-

teria: usability, accuracy, and performance, and these

are reported in section IV. Finally, the conclusions are

drawn, and the future research directions that could

be taken.

Efﬁcient Academic Retrieval System Based on Aggregated Sources

437

2 RELATED WORK

Recommender systems (also known as recommenda-

tion engines) are tools that offer useful item sugges-

tions based on the user input or proﬁle. Ideally, a

recommender system provides recommendations au-

tomatically by inferring the needs from the user’s item

interactions. Alternatively, the recommender system

asks users to specify their needs by providing a list of

keywords or through some other methods. Even if by

introducing keywords the system becomes more sim-

ilar to a search engine or a retrieval system, it is es-

timated that about 80% of the recommender systems

requires users to either explicitly provide keywords or

to provide text snippets (Beel et al., 2016).

In this regard, recommender and retrieval sys-

tems are alike, meaning they both search for relevant

items/documents to the user’s query. Still, a recom-

mender system can also provide a ranked list of sug-

gestions by checking the importance of each resource

found (Ricci et al., 2022) or additional information

that could be discovered from the initial suggestion

list.

The surveys (Beel et al., 2016; Bai et al., 2020)

emphasize the following recommendation techniques

as being the most appropriate approaches in the ﬁeld

of research-paper recommender systems: Stereotyp-

ing, Content-based Filtering, Collaborative Filtering,

Co-Occurrence, Graph-based, Global Relevance, or

Hybrid.

From the same surveys it may be noticed that more

than half of the recommendation approaches applied

content-based ﬁltering (Pazzani and Billsus, 2007;

Caragea et al., 2014), while collaborative ﬁltering and

graph-based (Gori and Pucci, 2006; Xia et al., 2016;

Zhou et al., 2014) recommendations were applied

each in around 15% of the approaches. In addition,

most approaches neglected the user-modeling process

and did not infer information automatically, but let

users provide keywords, text snippets or a single pa-

per as input. In this regard, they are much similar

to retrieval systems. During development of our sys-

tem, we have considered: content based, graph-based,

global relevance and hybridization(Burke, 2007).

Following an analysis of the open-source solu-

tions that tackle the problem of academic papers re-

trieval based on some keywords or speciﬁc queries,

the next tools have been identiﬁed as representatives.

CitNetExplorer: is a tool developed at the Leiden

University that can be used to visualize and analyze

citation patterns of scientiﬁc publications. It uses var-

ious algorithms to detect the connected components,

most relevant papers, shortest and longest paths. It

also allows indirect citation relations to be visible on

the graph, and supports direct import from Web of

Science. Its main advantage is that it can handle large

citation networks (millions of papers and ten times

more relations) (van Eck and Waltman, 2014).

VOSviewer: is another tool from the Leiden Uni-

versity that is used to build and visualize bibliomet-

ric networks. The networks can contain many types

of publications and the relations between the nodes

can use the citation, bibliographic coupling or co-

authorship. This tool also allows to build and vi-

sualize networks that are related to a speciﬁc query.

It can download data from Web of Science, Scopus,

Dimensions, Lens and PubMed. Through the APIs

provided by Crossref, Semantic Scholar, OpenCita-

tions and WikiData, it can build co-authorship and co-

occurrence networks (Centre for Science and Tech-

nology Studies, ). Even though the VOSviewer tool

might have full subscriptions to all the APIs above,

depending on the required size of the network, query-

ing those databases through the API might not be very

suitable, due to the high latency.

ConnectedPapers: is a visual tool, very easy to use,

that provides a graph overview of the academic land-

scape on a speciﬁc topic. It’s mainly used to discover

relevant prior work on the subject of interest and to

create bibliographies for research papers.

Unlike the previous solutions, ConnectedPapers

doesn’t use a citation tree. It builds a graph in which

the papers are organized depending on how similar

they are. As a result, even if the papers don’t nec-

essarily cite one another, they can still be highly re-

lated and put into the graph close to each other. The

metric used for determining the similarity uses co-

citations and bibliographic coupling, which means

that if two papers have similar references, they’re

most likely related. When building the graph, Con-

nectedPapers searches through approximately 50,000

papers, groups similar papers in clusters, while high-

lighting the popular papers (highly cited) with larger

nodes, and recent papers with darker colors (Alex Tar-

navsky et al., 2020). The only limitation is the fact

that the queried papers are retrieved from a single

database, namely the Semantic Scholar Paper Corpus.

Besides having this single point of failure, the papers

might not be relevant for criteria such as geographic

location.

Zotero: was ﬁrst proposed in 2006; this is an open-

source tool which is used to gather, arrange, and an-

alyze research papers. The user can manage a per-

sonal library, as well as generating bibliographies, ci-

tations, and reports (Corporation for Digital Scholar-

ship, 2006). Zotero can be seen more like a research

paper manager rather than a recommender system, be-

cause although the user can supply RSS feeds and

ENASE 2023 - 18th International Conference on Evaluation of Novel Approaches to Software Engineering

438

other data sources, this implies that the user has to

introduce them manually.

OpenCitations: is a tool for open scholarship that

tackles the publication of open citation data in the

form of Linked Open Data. This means that it builds

an open source citation index. It currently contains in-

formation about approximately 14 million citations.

The main goal of this tool is to make scholarly bib-

liographic and citation data largely available to any

user (Peroni and Shotton, 2020). Though it provides

a huge index of research papers, this tool is more like

a database, rather than a recommender system, but it

can be well used as an input for such a software tool.

SciGraph: is a knowledge graph from Springer,

which offers linked open data from the following ag-

gregates sources: Springer Nature, Digital Science

and Unsilo. This aggregated data generates a rich

semantic abstract of how various pieces of informa-

tion are related. It currently can hold between 1.5

and 2 billion relations between various types of pa-

pers, such as journals, articles, books, and confer-

ences (SciGraph, 2022). SciGraph’s dataset is highly

biased on the indexes provided by their partners. They

plan to extend their dataset to bring more highly qual-

itative data.

3 ARRS –FUNCTIONALITY AND

DESIGN

The proposed system – ARRS – uses aggregated aca-

demic sources in order to extract the relevant papers

on a topic speciﬁed through a keyword based inter-

rogation. By choosing a resulted paper, its authors

together with their collaborators are extracted and

shown using connected graphs that emphasize also

the degree of connectivity.

3.1 System Functionality

Papers query.

The keyword interrogation can be given in a simple

implicit way as an enumeration of words (when we

assume an OR operator between them), but also in

more complex way using operators AND, OR, and

NOT.

An example of interrogation is:

“so f tware bot

′′

OR robotic AND so f tware

OR “robotic process automation

′′

OR “so f tware robot

′′

NOT “cognitive automation

′′

The result of such query is a graph of papers that

is also graphically represented on the interface. The

edge between two nodes in this papers graph repre-

sents one or more common keywords from the query,

the more common keywords, the wider the edge (each

edge has an associated weight).

Authors Network. After the relevant papers are

shown, the user may select a speciﬁc paper, and then

a graph of the corresponding authors and also all their

collaborators will be shown (a collaborator is an au-

thor that is co-author to at least one paper). In the au-

thors’ graph, an edge means co-authorship on at least

one paper.

Author’s Papers. On the authors’ graph, if the user

selects any author node (which can be an author cor-

responding to the initial selected paper, or a collabo-

rator), the system will display the papers published by

that author, as well as their metadata. This allows us

to inspect on what ﬁelds those researchers worked on,

in general, not only the ﬁelds described by the initial

query.

3.2 Data Normalisation

Behind the scene, there is a very important process:

data normalization. The collected data come from

different sources with various representation schemas

and they should be transformed in order to have a

common representation of the associated metadata

(e.g.: title, authors, authors’ afﬁliations, subject, pub-

lisher, publish year, abstract, DOI, type of article). If

for example the keywords are missing, ARRS will try

to extract keywords from the abstract (if available),

or from the title. The collected data are then used to

update the database, either by adding missing infor-

mation for certain papers or just adding new ones.

The system uses NLP (Natural Language Process-

ing) in the searching process (the roots of the key-

words are extracted using a stemming algorithm) and

in the normalisation process, too. For example, in or-

der to extract keywords from an abstract or from a

title, NLP algorithms have been enrolled.

3.3 Architecture, Design and

Implementation Challenges

Since we propose the creation of a system that can

gather data from multiple sources, normalize them on

a cluster, and aggregate them into a graph database,

this implies a very high level of complexity, and so a

hierarchical decomposition through modularisation is

needed.

In order to provide a high level of scalability,

ARRS was designed as a scalable solution, by using

multiple parallelization techniques.

Efﬁcient Academic Retrieval System Based on Aggregated Sources

439

Figure 1: Architecture of the ARSS Retrieval System.

We have considered Apache Spark

in order to im-

prove the performance of the normalization process –

Spark has been chosen as being appropriate for ef-

ﬁcient streaming computation. (Many other recom-

mender systems use the machine learning component

– Apache ML

for generating recommendations; this

could be added in a further development for AI rec-

ommendations.) The Sparck cluster uses a load bal-

ancer to distribute the workload to the worker nodes.

The architecture of the system is depicted in Fig-

ure 1.

For each emphasised component there were spe-

ciﬁc implementation challenges, and we emphasise

the most important of them in what it follows. Many

of these challenges were surmounted by using appro-

priate specialized frameworks, but putting all these to

work together represented also a challenge.

Spark Streaming Component. The Spark component

handles the data normalization. Once the data nor-

malization and keywords extraction is done, the result

is inserted into the graph database.

The Spark component accepts data as a socket text

stream and ﬁrst has to deserialize the data and apply a

ﬂatMap operation to ensure the result is a single col-

https://spark.apache.org

https://spark.apache.org/mllib/

lection, instead of a collection of collections (each

operation that processes the data is executed on the

Spark worker nodes). Then, after obtaining the dese-

rialized items, a transform operation is triggered on all

the RDDs (Resilient Distributed Datasets) which is then

chained with a map operation per RDD; this map oper-

ation calls the function that normalizes the items. Af-

ter the processing is done, foreach operations are used

to call the database manager function that inserts or

updates the items.

Publish-Subscribe System. Due to the fact that the

results obtained from the crawlers are yielded at dif-

ferent times, there is a need to transform them into a

stream, such that they can be processed when they are

available. To achieve this, a custom publish-subscribe

system based on TCP sockets was created that allows

the Web Sources Aggregator to send its data, as soon

as it becomes available, to the TCP server, which will

then forward it to its subscribers, meaning the Spark

system that builds a stream from the data received

through the TCP socket.

Web Sources Aggregator. This component has the

responsibility of retrieving papers from a large vari-

ety of data sources, such as Crossref, IEEE, ACM,

Scopus, ResearchGate, arXiv, in an efﬁcient way, us-

ing thread pool executors (employing thread pool ex-

ecutors was essential in order to assure a good per-

formance). This component also allows the use of

Web crawlers that simulate the user interaction with a

browser to get the data, and a specialised framework

– Selenium

– was used for this purpose.

The Web crawling process is triggered each time an

user enters a new query. Still, even parallelised,

crawling on the spot still takes tens of seconds, which

hinders the responsiveness of the system. To over-

come this, priorities for each dataset were assigned:

highest priority for the database, medium priority for

API calls (since it’s near instantaneously), and lowest

priority for the crawling. These priorities determine

what data will be ﬁrst shown on the graph, which will

be regenerated when a new dataset is available.

Graph Database Manager. Considering the many-

to-many relationship between the papers and their au-

thors, as well as the complex join operations that are

in the dataset, the chosen graph database was neo4j

Initially, this database is populated with data from

Crossref that contains around 120 million papers.

Still, this contains only the work published until 2021,

which means that the system also has to query more

recent papers by crawling multiple databases, such as

IEEE, ACM, Scopus, arXiv, and ResearchGate. Each

query produces an update of the database by updat-

https://www.selenium.dev

https://neo4j.com

ENASE 2023 - 18th International Conference on Evaluation of Novel Approaches to Software Engineering

440

ing some missing metadata of some papers (if they

were found on the web sources) and by adding the

new papers that were found. This way the database

is dynamically and implicitly maintained through the

usage of the system by the users.

User Interface. The entries selected from the graph

database based on the user’s query are displayed on

the interface, as lists of papers and authors, as well as

graphical representations of the created graphs. When

the user enters a query, ARRS will ﬁrst match the

papers from the database based on three meta-ﬁelds:

the title, the associated keywords, and the abstract.

The result of an interrogation is updated after the web

sources interrogation is ﬁnalized.

In addition to these main components, we empha-

size the following two modules:

Network Analysis Module. The network analysis

module was introduced as a part of ARRS controller

and uses the framework networkx in order to compute

degree, betweenness, closeness and eigenvector cen-

tralities, as well as some metrics such as the diameter,

the average shortest path length and the density of the

graph. It is important to note that all the graphs in the

ARRS use weighted edges, to better highlight what

the most important nodes are. Their weight is com-

puted as the number of common keywords between

two nodes in the papers graph, and the number of col-

laborations between nodes in the authors graph (in

this context, a collaboration means that two authors

wrote a paper together).

NLP Processing Module. During normalisation pro-

cess, the extraction of keywords from the abstract or

the title was based on NLP by using RAKE

– an

algorithm that can extract keywords from individual

documents by splitting the text into a vector depend-

ing on a list of delimiters. This vector is then di-

vided into continuous sequences of words using the

stop words as delimiters. The roots of the words are

computed using a stemming function based on the

Porter algorithm (Porter, 2006) to get to the root of the

word (e.g.: ’running’ becomes ’run’). This was used

to allow the same word to match multiple documents

which might have otherwise been excluded because

they used another form of the word.

Normalization functions were created for each data

source.

NLP processing is also used for achieving a uniform

style for the DOI (e.g.: only the ID, not the full URL)

and build the item following a chosen schema.

Cluster of Docker Containers. The system uses a

cluster of Docker containers

. The container images

used are similar and we wrote an image speciﬁcation

RAKE (Rapid Automatic Keyword Extractor)

https://www.docker.com/

for each of them. Since there are multiple containers

that need to be started in a speciﬁc order, and they

require a custom conﬁguration, docker-compose was

used to handle the creation of the containers. Figure 2

shows the cluster of Docker containers and how they

interact.

Figure 2: Docker containers cluster.

Heuristic Optimization. Since the crawling pro-

cess is a slow one, depending on the data source and

whether it needed to simulate a browser interaction

with Selenium or not, the process can take anywhere

between half a minute and a few hours. For this

reason, the number of entries brought back by the

crawlers is limited based on the factors above, to some

parameters that were heuristically estimated. Further-

more, each data source ﬁrst provides the most relevant

papers to the user’s query. This ensures that the im-

pact of this limitation is minimal, and guarantees an

acceptable response time of the ARRS.

4 ARRS – EVALUATION

The evaluation of the proposed solution is addressed

from the software quality attributes, with a special in-

terest on the following three perspectives: usability,

accuracy, and performance. For evaluation a main

test case has been considered for the following sim-

ple query: “machine learning text summarization”

4.1 Usability

The user interface was inspired from the one provided

by ConnectedPapers(Alex Tarnavsky et al., 2020), pro-

viding a base layout with a search bar on the top, two

side columns that show information about the papers,

and a center section that displays an interactive graph.

The authors graph is shown when a node in the

papers graph is selected (double click). The graph

contains the authors of the selected papers, as well

as their peers (other researchers with whom they col-

laborated). The user can further inspect the papers

published by each author, and see the metadata cor-

responding to the each resulted papers. In the papers’

graph, there are edges that are wider than the rest. The

Efﬁcient Academic Retrieval System Based on Aggregated Sources

441

weight of an edge that links two papers depends on the

number of matches between their keywords.

The responsiveness of the system is assured by ob-

taining the results in steps, based on some priorities

(as described in the previous section), and by caching

them in memory.

4.2 Accuracy

Even though ARRS is similar to ConnectedPapers

tool in terms of UI, the way the network of papers is

built is entirely different. ConnectedPapers starts, for

example, from a user query, then it asks the user to

select the most relevant papers, based on which it will

build the graph, whereas the ARRS uses the user’s

query to directly build the graph. Overall, this al-

lows a better overview of the research published for

the topic of interest, given that another important pur-

pose of the system is to allow the user to check the

authors graph to look for possible collaborators.

When comparing the results obtained from the

ConnectedPapers and ARRS, the ﬁrst one yielded 40

papers, while the proposed solution provided over

100. In terms of papers matching between the two

results, this greatly varies depending on the paper that

the ConnectedPaper user chose to build the graph. As

a consequence, they cannot be directly compared, as

in most cases, ARRS will yield an entirely different

dataset. However, in terms of searching for future col-

laborators, ARRS proved to offer a better overview of

the literature, which gives the user the chance to in-

spect the most relevant papers to the query, based on

which edge is wider (meaning it has more keywords

matching with the ones in the interrogation).

4.3 Performance

The parallelisation implemented through the thread

pool executor for crawling and using Spark for nor-

malisation, signiﬁcantly improves the obtained per-

formance. Several experiments have been conducted

on to emphasize the importance of parallel computa-

tion introduced in the system.

In terms of hardware speciﬁcation, the system was

tested on two virtual machines, each with 16 CPU

cores and 64GB RAM. The storage used was 1TB.

In terms of performance obtained for crawling,

for the main test query the parallel approach yielded

the results in 29.678 seconds, while the sequential

one (without the executor thread pool) provided the

results in 6 minutes and 9.018 seconds. So, the

speedup (speedup = sequentialTime/parallelTime)

is for this process 12.434.

Regarding the performance of the normalization

process, the sequential execution took 32.304 sec-

onds, while the execution on the Spark cluster only

took 2.461 seconds, resulting in a speedup of 13.126.

Table 1: ARRS Performance Analysis. T

denotes the se-

quential execution time in seconds, and T

denotes the par-

allel execution time in seconds.

Query T

Speedup

abstractive text sum-

marization

222.788 33.624 6.626

abstractive AND text

AND summarization

NOT extractive

252.619 26.792 9.429

state-of-the-art nlp

techniques for data

cleaning

207.006 33.966 6.095

GAN networks 245.739 32.214 7.628

human face genera-

tion using GAN net-

works

243.652 31.967 7.622

text to image transla-

tion

253.977 31.681 8.016

clothing AND trans-

lation NOT ”GAN

network”

215.248 28.517 7.548

automatic metadata

generator

248.436 31.389 7.915

”data encryption”

AND ”data decryp-

tion”

255.079 30.650 8.322

blockchain technol-

ogy in the banking

industry

246.831 30.251 8.159

In order to evaluate the performance in more de-

tail, we have done other several experiments besides

that main test case – for different queries, and the

results are shown in Table 1. In this table, T

de-

notes the sequential execution time in seconds, and

denotes the parallel execution time in seconds, and

the speedup obtained through parallelization is em-

phasized. From these, an analysis of average perfor-

mance could be extracted:

• Average sequential time: 239.127 sec.

• Average parallel time: 30.215 sec.

• Average speedup: 7.736.

5 CONCLUSIONS

We proposed an Aggregator Researcher Retrieval

System - ARRS, as a hybrid system that retrieves

data from a large variety of data sources, such as

ENASE 2023 - 18th International Conference on Evaluation of Novel Approaches to Software Engineering

442

Crossref, IEEE, ACM, Scopus, ResearchGate, and

arXiv, through crawlers and APIs. The system ag-

gregates the duplicate papers by merging and ﬁlls in

their missing metadata when possible (keywords ex-

traction, metadata harvesting).

Through parallelization using thread pool execu-

tors, the crawling process has been highly improved,

which greatly improved the responsiveness of the ap-

plication. The normalization process was run on a

Spark cluster deployed on Docker containers. ARRS

system provides a solid base which can be further im-

proved by scaling and aggregating data from more

data sources.

The system we proposed could represent the data

engineering component of a more complex recom-

mender system that adds an AI data analysis compo-

nent that extracts the best recommendations that ﬁt the

user proﬁle. This is the reason of using the term ’re-

trieval system’, even if besides the papers extraction

and aggregation, the system also provides the possi-

bility to extract the network of authors and their col-

laborators.

On a larger view, this system could be considered

a proof of concept for a general efﬁcient approach

in building and updating big databases from different

web sources. In many cases, creating and managing

a database that aggregates information from different

sources is difﬁcult and implies considerable cost. This

dynamic approach of updating and supplementing a

database by using a tool that offers the possibility to

search for the information stored into the database and

in associated sources, could be an efﬁcient and pro-

ductive solution.

ACKNOWLEDGEMENTS

The implementation of the ARRS system started in

2021 with the dissertation thesis of the student O.

Oprisan, from the master specialization High Perfor-

mance Computing and Big Data Analytics of “Babes¸-

Bolyai” University. The thesis was under the coordi-

nation of dr. Virginia Niculescu.

REFERENCES

Alex Tarnavsky, E., Eddie, S., Itay Knaan, H., and Sa-

har, P. (2020). Connected Papers - Find and explore

academic papers. https://www.connectedpapers.com/.

Accessed 10.05.2022.

Apache Software Foundation (2022). Apache Spark - Uni-

ﬁed Engine for large-scale data analytics. https://

spark.apache.org. Accessed 5.05.2022.

Bai, X., Wang, M., Lee, I., Yang, Z., Kong, X., and Xia, F.

(2020). Scientiﬁc paper recommendation: A survey.

Beel, J., Gipp, B., Langer, S., and Breitinger, C. (2016).

Research-paper recommender systems: A literature

survey. Int. J. Digit. Libr., 17(4):305–338.

Burke, R. (2007). Hybrid Web Recommender Systems,

pages 377–408. Springer Berlin Heidelberg.

Caragea, C., Bulgarov, F. A., Godea, A., and Das Golla-

palli, S. (2014). Citation-enhanced keyphrase extrac-

tion from research papers: A supervised approach.

In Proceedings of the 2014 Conference on Empirical

Methods in Natural Language Processing (EMNLP),

pages 1435–1446. Assoc. for Comp. Linguistics.

Centre for Science and Technology Studies (). VOSviewer

- Visualizing scientiﬁc landscapes. https://www.

vosviewer.com. Accessed 10.05.2022.

Corporation for Digital Scholarship (2006). Zotero - Your

personal research assistant. https://www.zotero.org.

Accessed 10.05.2022.

Gill, T., Gilliland, A. J., Whalen, M., and Woodley, M. S.

(2008). Introduction to Metadata. Getty Publications.

Gori, M. and Pucci, A. (2006). Research paper recom-

mender systems: A random-walk based approach.

In 2006 IEEE/WIC/ACM International Conference on

Web Intelligence (WI 2006 Main Conference Proceed-

ings)(WI’06), pages 778–781.

Haines, S. (2022). Modern Data Engineering with Apache

Spark: A Hands-On Guide for Building Mission-

Critical Streaming Applications. Apress.

Pazzani, M. J. and Billsus, D. (2007). Content-Based

Recommendation Systems, pages 325–341. Springer

Berlin Heidelberg, Berlin, Heidelberg.

Peroni, S. and Shotton, D. (2020). OpenCitations, an infras-

tructure organization for open scholarship. Quantita-

tive Science Studies, 1(1):428–444.

Porter, M. (2006). The Porter stemming algorithm. https:

//tartarus.org/martin/PorterStemmer/.

Reis, J. and Housley, M. (2022). Fundamentals of Data

Engineering. O’Reilly Media.

Ricci, F., Rokach, L., and Shapira, B. (2022). Recommender

Systems Handbook. Springer, 3rd ed. 2022 edition.

SciGraph (2022). SciGraph - A Linked Open Data platform

for the scholarly domain. https://www.springernature.

com/gp/researchers/scigraph. Accessed 10.05.2022.

Sun, J., Ma, J., Liu, Z., and Miao, Y. (2014). Leveraging

Content and Connections for Scientiﬁc Article Rec-

ommendation in Social Computing Contexts. The

Computer Journal, 57(9):1331–1342.

van Eck, N. J. and Waltman, L. (2014). CitNetExplorer: A

new software tool for analyzing and visualizing cita-

tion networks. Journal of Informetrics, 8(4):802–823.

Xia, F., Liu, H., Lee, I., and Cao, L. (2016). Scientiﬁc ar-

ticle recommendation: Exploiting common author re-

lations and historical preferences. IEEE Transactions

on Big Data, 2(2):101–112.

Zhou, Q., Chen, X., and Chen, C. (2014). Authorita-

tive scholarly paper recommendation based on paper

communities. In 2014 IEEE 17th International Con-

ference on Computational Science and Engineering,

pages 1536–1540.

Efﬁcient Academic Retrieval System Based on Aggregated Sources

443