Crossref, IEEE, ACM, Scopus, ResearchGate, and
arXiv, through crawlers and APIs. The system ag-
gregates the duplicate papers by merging and fills in
their missing metadata when possible (keywords ex-
traction, metadata harvesting).
Through parallelization using thread pool execu-
tors, the crawling process has been highly improved,
which greatly improved the responsiveness of the ap-
plication. The normalization process was run on a
Spark cluster deployed on Docker containers. ARRS
system provides a solid base which can be further im-
proved by scaling and aggregating data from more
data sources.
The system we proposed could represent the data
engineering component of a more complex recom-
mender system that adds an AI data analysis compo-
nent that extracts the best recommendations that fit the
user profile. This is the reason of using the term ’re-
trieval system’, even if besides the papers extraction
and aggregation, the system also provides the possi-
bility to extract the network of authors and their col-
laborators.
On a larger view, this system could be considered
a proof of concept for a general efficient approach
in building and updating big databases from different
web sources. In many cases, creating and managing
a database that aggregates information from different
sources is difficult and implies considerable cost. This
dynamic approach of updating and supplementing a
database by using a tool that offers the possibility to
search for the information stored into the database and
in associated sources, could be an efficient and pro-
ductive solution.
ACKNOWLEDGEMENTS
The implementation of the ARRS system started in
2021 with the dissertation thesis of the student O.
Oprisan, from the master specialization High Perfor-
mance Computing and Big Data Analytics of “Babes¸-
Bolyai” University. The thesis was under the coordi-
nation of dr. Virginia Niculescu.
REFERENCES
Alex Tarnavsky, E., Eddie, S., Itay Knaan, H., and Sa-
har, P. (2020). Connected Papers - Find and explore
academic papers. https://www.connectedpapers.com/.
Accessed 10.05.2022.
Apache Software Foundation (2022). Apache Spark - Uni-
fied Engine for large-scale data analytics. https://
spark.apache.org. Accessed 5.05.2022.
Bai, X., Wang, M., Lee, I., Yang, Z., Kong, X., and Xia, F.
(2020). Scientific paper recommendation: A survey.
Beel, J., Gipp, B., Langer, S., and Breitinger, C. (2016).
Research-paper recommender systems: A literature
survey. Int. J. Digit. Libr., 17(4):305–338.
Burke, R. (2007). Hybrid Web Recommender Systems,
pages 377–408. Springer Berlin Heidelberg.
Caragea, C., Bulgarov, F. A., Godea, A., and Das Golla-
palli, S. (2014). Citation-enhanced keyphrase extrac-
tion from research papers: A supervised approach.
In Proceedings of the 2014 Conference on Empirical
Methods in Natural Language Processing (EMNLP),
pages 1435–1446. Assoc. for Comp. Linguistics.
Centre for Science and Technology Studies (). VOSviewer
- Visualizing scientific landscapes. https://www.
vosviewer.com. Accessed 10.05.2022.
Corporation for Digital Scholarship (2006). Zotero - Your
personal research assistant. https://www.zotero.org.
Accessed 10.05.2022.
Gill, T., Gilliland, A. J., Whalen, M., and Woodley, M. S.
(2008). Introduction to Metadata. Getty Publications.
Gori, M. and Pucci, A. (2006). Research paper recom-
mender systems: A random-walk based approach.
In 2006 IEEE/WIC/ACM International Conference on
Web Intelligence (WI 2006 Main Conference Proceed-
ings)(WI’06), pages 778–781.
Haines, S. (2022). Modern Data Engineering with Apache
Spark: A Hands-On Guide for Building Mission-
Critical Streaming Applications. Apress.
Pazzani, M. J. and Billsus, D. (2007). Content-Based
Recommendation Systems, pages 325–341. Springer
Berlin Heidelberg, Berlin, Heidelberg.
Peroni, S. and Shotton, D. (2020). OpenCitations, an infras-
tructure organization for open scholarship. Quantita-
tive Science Studies, 1(1):428–444.
Porter, M. (2006). The Porter stemming algorithm. https:
//tartarus.org/martin/PorterStemmer/.
Reis, J. and Housley, M. (2022). Fundamentals of Data
Engineering. O’Reilly Media.
Ricci, F., Rokach, L., and Shapira, B. (2022). Recommender
Systems Handbook. Springer, 3rd ed. 2022 edition.
SciGraph (2022). SciGraph - A Linked Open Data platform
for the scholarly domain. https://www.springernature.
com/gp/researchers/scigraph. Accessed 10.05.2022.
Sun, J., Ma, J., Liu, Z., and Miao, Y. (2014). Leveraging
Content and Connections for Scientific Article Rec-
ommendation in Social Computing Contexts. The
Computer Journal, 57(9):1331–1342.
van Eck, N. J. and Waltman, L. (2014). CitNetExplorer: A
new software tool for analyzing and visualizing cita-
tion networks. Journal of Informetrics, 8(4):802–823.
Xia, F., Liu, H., Lee, I., and Cao, L. (2016). Scientific ar-
ticle recommendation: Exploiting common author re-
lations and historical preferences. IEEE Transactions
on Big Data, 2(2):101–112.
Zhou, Q., Chen, X., and Chen, C. (2014). Authorita-
tive scholarly paper recommendation based on paper
communities. In 2014 IEEE 17th International Con-
ference on Computational Science and Engineering,
pages 1536–1540.
Efficient Academic Retrieval System Based on Aggregated Sources
443