6 CONCLUSIONS
This survey endeavors to systematically compile and
synthesize existing research on the security of LLMs,
with the overarching goal of facilitating further
exploration in this domain. Through comprehensive
analysis, the survey has categorized and examined
various aspects of LLM security, including
hallucinations, adversarial attacks, AI alignment, and
privacy concerns. The outcomes of experiments
conducted on hallucinations and adversarial attacks
underscore the critical need to delve deeper into LLM
security. The findings reveal that many security
vulnerabilities within LLMs are intrinsic to their
design and operation. While practical solutions may
mitigate some vulnerabilities, others are deeply
ingrained in the fundamental mechanisms of LLMs,
necessitating continued investigation. Moreover,
achieving security in LLMs may pose challenges in
balancing with efforts to optimize model
performance. Therefore, future research should focus
on developing optimization approaches for
theoretically viable security mechanisms, considering
practical feasibility and real-world scenarios. This
holistic approach will provide valuable insights for
researchers and practitioners alike in navigating the
complex landscape of LLM security.
REFERENCES
Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I.,
Aleman, F. L., & McGrew, B. (2023). Gpt-4 technical
report. arXiv: 2303.08774.
Agrawal, G., Kumarage, T., Alghami, Z., & Liu, H. (2023).
Can knowledge graphs reduce hallucinations in LLMs?:
A survey. arXiv: 2311.07914.
Huang, L., Yu, W., Ma, W., Zhong, W., Feng, Z., Wang, H.,
& Liu, T. (2023). A survey on hallucination in large
language models: Principles, taxonomy, challenges,
and open questions. arXiv:2311.05232.
Jain, N., Schwarzschild, A., Wen, Y., Somepalli, G.,
Kirchenbauer, J., Chiang, P. Y., & Goldstein, T. (2023).
Baseline defenses for adversarial attacks against
aligned language models. arXiv: 2309.00614.
Ji, J., Qiu, T., Chen, B., Zhang, B., Lou, H., Wang, K., &
Gao, W. (2023). Ai alignment: A comprehensive
survey. arXiv:2310.19852.
Mehrabi, N., Beirami, A., Morstatter, F., & Galstyan, A.
(2022). Robust conversational agents against
imperceptible toxicity triggers. arXiv:2205.02392.
Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C.,
Mishkin, P., & Lowe, R. (2022). Training language
models to follow instructions with human feedback.
Advances in neural information processing systems, vol.
35, pp: 27730-27744.
Radford, A., Wu, J., Amodei, D., Amodei, D., Clark, J.,
Brundage, M., & Sutskever, I. (2019). Better language
models and their implications. OpenAI blog, vol. 1(2).
Varshney, N., Yao, W., Zhang, H., Chen, J., & Yu, D.
(2023). A stitch in time saves nine: Detecting and
mitigating hallucinations of llms by validating low-
confidence generation. arXiv: 2307.03987.
Wei, A., Haghtalab, N., & Steinhardt, J. (2024). Jailbroken:
How does llm safety training fail?. Advances in Neural
Information Processing Systems, vol. 36.
Zhao, W. X., Zhou, K., Li, J., Tang, T., Wang, X., Hou, Y.,
& Wen, J. R. (2023). A survey of large language models.
arXi: 2303.18223.
Zou, A., Wang, Z., Kolter, J. Z., & Fredrikson, M. (2023).
Universal and transferable adversarial attacks on
aligned language models. arXiv: 2307.15043.