
ings underscore the pivotal role of data preprocess-
ing, highlighting its impact on subsequent stages, es-
pecially when interfaced with LLMs. Our incremen-
tal approach offers several advantages, including flex-
ibility and the ability to adapt individual steps to align
with specific user goals. This ensures that users can
effectively sift through vast information sources, dif-
ferentiating between pertinent and non-pertinent data,
ultimately facilitating a more concentrated investiga-
tion of the investigated phenomenon.
However, we also acknowledge certain limitations
inherent in our approach. A significant challenge
is the preparatory phase of data curation and model
training, which we aim to alleviate in future iterations
through enhanced automation of the pipeline. Addi-
tionally, the Flan-T5 model’s intensive hardware de-
mands underscore the time-intensive nature of analyz-
ing sizable corpora.
In spite of these challenges, we believe that our
approach has several advantages. Due to the way the
pipeline is designed, users are allowed to customize
the solution. Ultimately, users can decide for them-
selves how accurate they need the results at each step
and the associated computational cost of each step.
Thanks to this flexibility, it is possible for users to op-
timize the task according to their requirements, either
for computational cost and speed or for overall calcu-
lation accuracy.
ACKNOWLEDGEMENTS
The work is partially supported by grant SGS13/P
ˇ
rF-
MF/2023. This work was supported by the Ministry
of Education, Youth and Sports of the Czech Republic
through the e-INFRA CZ (ID:90254).
REFERENCES
Biswas, S. S. (2023a). Potential use of chat gpt in
global warming. Annals of biomedical engineering,
51(6):1126–1127.
Biswas, S. S. (2023b). Role of chat gpt in public health.
Annals of biomedical engineering, 51(5):868–869.
Bukar, U., Sayeed, M. S., Razak, S. F. A., Yogarayan, S.,
and Amodu, O. A. (2023). Text analysis of chatgpt as
a tool for academic progress or exploitation. Available
at SSRN 4381394.
Chung, H. W., Hou, L., Longpre, S., Zoph, B., Tay, Y., Fe-
dus, W., Li, E., Wang, X., Dehghani, M., Brahma, S.,
et al. (2022). Scaling instruction-finetuned language
models. arXiv preprint arXiv:2210.11416.
Grave, E., Bojanowski, P., Gupta, P., Joulin, A., and
Mikolov, T. (2018). Learning word vectors for
157 languages. In Proceedings of the International
Conference on Language Resources and Evaluation
(LREC 2018).
Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang,
S., Wang, L., and Chen, W. (2021). Lora: Low-rank
adaptation of large language models. arXiv preprint
arXiv:2106.09685.
Jensen, J. L., Karell, D., Tanigawa-Lau, C., Habash, N.,
Oudah, M., and Fairus Shofia Fani, D. (2022). Lan-
guage models in sociological research: an application
to classifying large administrative data and measuring
religiosity. Sociological Methodology, 52(1):30–52.
Krause, A. and Golovin, D. (2014). Submodular function
maximization. Tractability, 3(71-104):3.
Krippendorff, K. (2018). Content analysis: An introduction
to its methodology. Sage publications.
Macanovic, A. (2022). Text mining for social science–the
state and the future of computational text analysis in
sociology. Social Science Research, 108:102784.
Malkov, Y. A. and Yashunin, D. A. (2018). Efficient and
robust approximate nearest neighbor search using hi-
erarchical navigable small world graphs. IEEE trans-
actions on pattern analysis and machine intelligence,
42(4):824–836.
Mangrulkar, S., Gugger, S., Debut, L., Belkada, Y., and
Paul, S. (2022). Peft: State-of-the-art parameter-
efficient fine-tuning methods. https://github.com/h
uggingface/peft.
Patel, S. B. and Lam, K. (2023). Chatgpt: the future of
discharge summaries? The Lancet Digital Health,
5(3):e107–e108.
Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S.,
Matena, M., Zhou, Y., Li, W., and Liu, P. J. (2020).
Exploring the limits of transfer learning with a uni-
fied text-to-text transformer. The Journal of Machine
Learning Research, 21(1):5485–5551.
Reimers, N. and Gurevych, I. (2019). Sentence-bert: Sen-
tence embeddings using siamese bert-networks. In
Proceedings of the 2019 Conference on Empirical
Methods in Natural Language Processing. Associa-
tion for Computational Linguistics.
Schreiber, J., Bilmes, J., and Noble, W. S. (2020). apri-
cot: Submodular selection for data summarization
in python. Journal of Machine Learning Research,
21(161):1–6.
Song, K., Tan, X., Qin, T., Lu, J., and Liu, T.-Y. (2020). Mp-
net: Masked and permuted pre-training for language
understanding. arXiv preprint arXiv:2004.09297.
Sp
¨
orlein, C. and Schlueter, E. (2021). Ethnic insults in
youtube comments: Social contagion and selection ef-
fects during the german “refugee crisis”. European
Sociological Review, 37(3):411–428.
Sylvain Gugger, Lysandre Debut, T. W. P. S. Z. M. S. M. M.
S. B. B. (2022). Accelerate: Training and inference
at scale made simple, efficient and adaptable. https:
//github.com/huggingface/accelerate.
T
¨
ornberg, A. and T
¨
ornberg, P. (2016). Combining cda and
topic modeling: Analyzing discursive connections be-
tween islamophobia and anti-feminism on an online
forum. Discourse & Society, 27(4):401–422.
ICPRAM 2024 - 13th International Conference on Pattern Recognition Applications and Methods
704