Crossref, IEEE, ACM, Scopus, ResearchGate, and
arXiv, through crawlers and APIs. The system ag-
gregates the duplicate papers by merging and fills in
their missing metadata when possible (keywords ex-
traction, metadata harvesting).
Through parallelization using thread pool execu-
tors, the crawling process has been highly improved,
which greatly improved the responsiveness of the ap-
plication. The normalization process was run on a
Spark cluster deployed on Docker containers. ARRS
system provides a solid base which can be further im-
proved by scaling and aggregating data from more
data sources.
The system we proposed could represent the data
engineering component of a more complex recom-
mender system that adds an AI data analysis compo-
nent that extracts the best recommendations that fit the
user profile. This is the reason of using the term ’re-
trieval system’, even if besides the papers extraction
and aggregation, the system also provides the possi-
bility to extract the network of authors and their col-
On a larger view, this system could be considered
a proof of concept for a general efficient approach
in building and updating big databases from different
web sources. In many cases, creating and managing
a database that aggregates information from different
sources is difficult and implies considerable cost. This
dynamic approach of updating and supplementing a
database by using a tool that offers the possibility to
search for the information stored into the database and
in associated sources, could be an efficient and pro-
ductive solution.
The implementation of the ARRS system started in
2021 with the dissertation thesis of the student O.
Oprisan, from the master specialization High Perfor-
mance Computing and Big Data Analytics of “Babes¸-
Bolyai” University. The thesis was under the coordi-
nation of dr. Virginia Niculescu.
Alex Tarnavsky, E., Eddie, S., Itay Knaan, H., and Sa-
har, P. (2020). Connected Papers - Find and explore
academic papers.
Accessed 10.05.2022.
Apache Software Foundation (2022). Apache Spark - Uni-
fied Engine for large-scale data analytics. https:// Accessed 5.05.2022.
Bai, X., Wang, M., Lee, I., Yang, Z., Kong, X., and Xia, F.
(2020). Scientific paper recommendation: A survey.
Beel, J., Gipp, B., Langer, S., and Breitinger, C. (2016).
Research-paper recommender systems: A literature
survey. Int. J. Digit. Libr., 17(4):305–338.
Burke, R. (2007). Hybrid Web Recommender Systems,
pages 377–408. Springer Berlin Heidelberg.
Caragea, C., Bulgarov, F. A., Godea, A., and Das Golla-
palli, S. (2014). Citation-enhanced keyphrase extrac-
tion from research papers: A supervised approach.
In Proceedings of the 2014 Conference on Empirical
Methods in Natural Language Processing (EMNLP),
pages 1435–1446. Assoc. for Comp. Linguistics.
Centre for Science and Technology Studies (). VOSviewer
- Visualizing scientific landscapes. https://www. Accessed 10.05.2022.
Corporation for Digital Scholarship (2006). Zotero - Your
personal research assistant.
Accessed 10.05.2022.
Gill, T., Gilliland, A. J., Whalen, M., and Woodley, M. S.
(2008). Introduction to Metadata. Getty Publications.
Gori, M. and Pucci, A. (2006). Research paper recom-
mender systems: A random-walk based approach.
In 2006 IEEE/WIC/ACM International Conference on
Web Intelligence (WI 2006 Main Conference Proceed-
ings)(WI’06), pages 778–781.
Haines, S. (2022). Modern Data Engineering with Apache
Spark: A Hands-On Guide for Building Mission-
Critical Streaming Applications. Apress.
Pazzani, M. J. and Billsus, D. (2007). Content-Based
Recommendation Systems, pages 325–341. Springer
Berlin Heidelberg, Berlin, Heidelberg.
Peroni, S. and Shotton, D. (2020). OpenCitations, an infras-
tructure organization for open scholarship. Quantita-
tive Science Studies, 1(1):428–444.
Porter, M. (2006). The Porter stemming algorithm. https:
Reis, J. and Housley, M. (2022). Fundamentals of Data
Engineering. O’Reilly Media.
Ricci, F., Rokach, L., and Shapira, B. (2022). Recommender
Systems Handbook. Springer, 3rd ed. 2022 edition.
SciGraph (2022). SciGraph - A Linked Open Data platform
for the scholarly domain. https://www.springernature.
com/gp/researchers/scigraph. Accessed 10.05.2022.
Sun, J., Ma, J., Liu, Z., and Miao, Y. (2014). Leveraging
Content and Connections for Scientific Article Rec-
ommendation in Social Computing Contexts. The
Computer Journal, 57(9):1331–1342.
van Eck, N. J. and Waltman, L. (2014). CitNetExplorer: A
new software tool for analyzing and visualizing cita-
tion networks. Journal of Informetrics, 8(4):802–823.
Xia, F., Liu, H., Lee, I., and Cao, L. (2016). Scientific ar-
ticle recommendation: Exploiting common author re-
lations and historical preferences. IEEE Transactions
on Big Data, 2(2):101–112.
Zhou, Q., Chen, X., and Chen, C. (2014). Authorita-
tive scholarly paper recommendation based on paper
communities. In 2014 IEEE 17th International Con-
ference on Computational Science and Engineering,
pages 1536–1540.
Efficient Academic Retrieval System Based on Aggregated Sources