highlights the challenges of standalone generative
models when addressing data-intensive tasks.
In contrast, our RAG system presented
outstanding performance by retrieving relevant
information from external datasets. Its ability to
ground responses in specific data resulted in outputs
that were both precise and actionable. Additionally,
this also highlights that our RAG system can be
applied to a wide range of use cases and shows its
ability to be a practical tool for a Q&A system in
various fields.
In future work, we aim to further explore the
potential of RAG and enhance the overall
performance in handling complex and data-intensive
tasks. A particularly promising application of this
integration lies in client engagement, specifically in
generating detailed, data-driven reports tailored to the
needs of various stakeholders.
ACKNOWLEDGEMENTS
This work was supported by UST Young Scientist+
Research Program 2024 through the University of
Science and Technology. (No. 2024YS12)
REFERENCES
Rana, N. P., Slade, E. L., Sahu, G. P., Kizgin, H., Singh, N.,
Dey, B., Gutierrez, A., & Dwivedi, Y. K. (2019).
Digital and social media marketing. In Advances in
theory and practice of emerging markets.
Afzal, A., Vladika, J., Braun, D., & Matthes, F. (2023).
Challenges in Domain-Specific Abstractive
Summarization and How to Overcome Them. arXiv
preprint arXiv:2307.00963.
Han, S., Wang, M., Zhang, J., Li, D., & Duan, J. (2024). A
Review of Large Language Models: Fundamental
Architectures, Key Technological Evolutions,
Interdisciplinary Technologies Integration,
Optimization and Compression Techniques,
Applications, and Challenges. Electronics, 13(24),
5040.
Shahriar, S., Lund, B. D., Mannuru, N. R., Arshad, M. A.,
Hayawi, K., Bevara, R. V. K., Mannuru, A., & Batool,
L. (2024). Putting gpt-4o to the sword: A
comprehensive evaluation of language, vision, speech,
and multimodal proficiency. Applied Sciences, 14(17),
7782.
Chen, J., Lin, H., Han, X., & Sun, L. (2024). Benchmarking
large language models in retrieval-augmented
generation. In Proceedings of the AAAI Conference on
Artificial Intelligence, 38(16), 17754-17762.
Ranade, P., & Joshi, A. (2023). Fabula: Intelligence report
generation using retrieval-augmented narrative
construction. In Proceedings of the International
Conference on Advances in Social Networks Analysis
and Mining, 603-610.
Poositaporn, A., & Jung, H. (2025). Towards a clustering-
based pattern prediction framework. [Manuscript
submitted for publication].
APPENDIX
A. Generated responses of the RAG system and a
GPT-4o model
Q1: What actionable recommendations can be
derived from clustering results to mitigate high
pollution levels?
RAG: The clustering-based pattern prediction
framework suggests several actionable
recommendations to mitigate high pollution levels:
- Utilizing K-means clustering helps identify
distinct air pollution patterns, allowing for
targeted mitigation strategies that are tailored
to specific clusters, which can lead to more
effective resource allocation and policy
development.
- Integration of machine learning models such as
XGBoost and LSTM with clustering can
improve the accuracy of pollution predictions,
enabling proactive measures to be
implemented in anticipation of high pollution
days.
GPT-4o: Clustering results can identify specific
areas or sources contributing most to high pollution
levels. Based on these insights, targeted
interventions such as implementing stricter
emissions regulations for industries in high-
pollution clusters, enhancing public
transportation options to reduce vehicle emissions,
and increasing green spaces in urban areas can be
recommended. Additionally, community
awareness programs can be tailored to educate
residents in affected clusters about pollution
reduction practices.
Q2: Explain the significance of using K-means to
cluster air pollutant pattern for next day's
prediction.
RAG: The significance of employing K-means
clustering to analyze air pollutant patterns lies in its
ability to efficiently group complex data into
meaningful clusters, which represent distinct patterns
of air pollution. By calculating the centroids that
represent these patterns, K-means allows for the
identification of key characteristics of similar data