Crossref, IEEE, ACM, Scopus, ResearchGate, and
arXiv, through crawlers and APIs. The system ag-
gregates the duplicate papers by merging and fills in
their missing metadata when possible (keywords ex-
traction, metadata harvesting).
Through parallelization using thread pool execu-
tors, the crawling process has been highly improved,
which greatly improved the responsiveness of the ap-
plication. The normalization process was run on a
Spark cluster deployed on Docker containers. ARRS
system provides a solid base which can be further im-
proved by scaling and aggregating data from more
data sources.
The system we proposed could represent the data
engineering component of a more complex recom-
mender system that adds an AI data analysis compo-
nent that extracts the best recommendations that fit the
user profile. This is the reason of using the term ’re-
trieval system’, even if besides the papers extraction
and aggregation, the system also provides the possi-
bility to extract the network of authors and their col-
On a larger view, this system could be considered
a proof of concept for a general efficient approach
in building and updating big databases from different
web sources. In many cases, creating and managing
a database that aggregates information from different
sources is difficult and implies considerable cost. This
dynamic approach of updating and supplementing a
database by using a tool that offers the possibility to
search for the information stored into the database and
in associated sources, could be an efficient and pro-
ductive solution.
The implementation of the ARRS system started in
2021 with the dissertation thesis of the student O.
Oprisan, from the master specialization High Perfor-
mance Computing and Big Data Analytics of “Babes¸-
Bolyai” University. The thesis was under the coordi-
nation of dr. Virginia Niculescu.
Efficient Academic Retrieval System Based on Aggregated Sources