Analysis of the impact of prompt obfuscation on the effectiveness of language models in detecting prompt injections
- Authors: Krohin A.S.1, Gusev M.M.1
-
Affiliations:
- Issue: No 2 (2025)
- Pages: 44-62
- Section: Articles
- URL: https://journal-vniispk.ru/2454-0714/article/view/359371
- DOI: https://doi.org/10.7256/2454-0714.2025.2.73939
- EDN: https://elibrary.ru/FBOXHC
- ID: 359371
Cite item
Full Text
Abstract
Keywords
About the authors
Aleksei Sergeevich Krohin
Email: askrokhin@edu.hse.ru
Maksim Mihailovich Gusev
Email: gusevmaxim04@mail.ru
References
Liu Y. et al. Formalizing and benchmarking prompt injection attacks and defenses // 33rd USENIX Security Symposium (USENIX Security 24). – 2024. – С. 1831-1847. Greshake K. et al. Not what you've signed up for: Compromising real-world llm-integrated applications with indirect prompt injection // Proceedings of the 16th ACM Workshop on Artificial Intelligence and Security. – 2023. – С. 79-90. Shi J. et al. Optimization-based prompt injection attack to llm-as-a-judge // Proceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security. – 2024. – С. 660-674. Sang X., Gu M., Chi H. Evaluating prompt injection safety in large language models using the promptbench dataset. – 2024. Xu Z. et al. LLM Jailbreak Attack versus Defense Techniques--A Comprehensive Study // arXiv e-prints. – 2024. – С. arXiv: 2402.13457. Hu K. et al. Efficient llm jailbreak via adaptive dense-to-sparse constrained optimization // Advances in Neural Information Processing Systems. – 2024. – Т. 37. – С. 23224-23245. Wei A., Haghtalab N., Steinhardt J. Jailbroken: How does llm safety training fail? // Advances in Neural Information Processing Systems. – 2023. – Т. 36. – С. 80079-80110. Li J. et al. Getting more juice out of the sft data: Reward learning from human demonstration improves sft for llm alignment // Advances in Neural Information Processing Systems. – 2024. – Т. 37. – С. 124292-124318. Kwon H., Pak W. Text-based prompt injection attack using mathematical functions in modern large language models // Electronics. – 2024. – Т. 13. – №. 24. – С. 5008. Steindl S. et al. Linguistic obfuscation attacks and large language model uncertainty // Proceedings of the 1st Workshop on Uncertainty-Aware NLP (UncertaiNLP 2024). – 2024. – С. 35-40. Kim M. et al. Protection of LLM Environment Using Prompt Security // 2024 15th International Conference on Information and Communication Technology Convergence (ICTC). – IEEE, 2024. – С. 1715-1719. Wei Z., Liu Y., Erichson N. B. Emoji Attack: A Method for Misleading Judge LLMs in Safety Risk Detection // arXiv preprint arXiv:2411.01077. – 2024. Rahman M. A. et al. Applying Pre-trained Multilingual BERT in Embeddings for Improved Malicious Prompt Injection Attacks Detection // 2024 2nd International Conference on Artificial Intelligence, Blockchain, and Internet of Things (AIBThings). – IEEE, 2024. – С. 1-7. Chen Q., Yamaguchi S., Yamamoto Y. LLM Abuse Prevention Tool Using GCG Jailbreak Attack Detection and DistilBERT-Based Ethics Judgment // Information. – 2025. – Т. 16. – №. 3. – С. 204. Aftan S., Shah H. A survey on bert and its applications // 2023 20th Learning and Technology Conference (L&T). – IEEE, 2023. – С. 161-166. Chan C. F., Yip D. W., Esmradi A. Detection and defense against prominent attacks on preconditioned llm-integrated virtual assistants // 2023 IEEE Asia-Pacific Conference on Computer Science and Data Engineering (CSDE). – IEEE, 2023. – С. 1-5. Biarese D. AdvBench: a framework to evaluate adversarial attacks against fraud detection systems. – 2022. Liu W. et al. DrBioRight 2.0: an LLM-powered bioinformatics chatbot for large-scale cancer functional proteomics analysis // Nature communications. – 2025. – Т. 16. – №. 1. – С. 2256. doi: 10.1038/s41467-025-57430-4 EDN: JUMWJQ. Pannerselvam K. et al. Setfit: A robust approach for offensive content detection in tamil-english code-mixed conversations using sentence transfer fine-tuning // Proceedings of the Fourth Workshop on Speech, Vision, and Language Technologies for Dravidian Languages. – 2024. – С. 35-42. Akpatsa S. K. et al. Online News Sentiment Classification Using DistilBERT // Journal of Quantum Computing. – 2022. – Т. 4. – №. 1. Грицай Г. М., Хабутдинов И. А., Грабовой А. В. Stackmore LLMs: эффективное обнаружение машинно-сгенерированных текстов с помощью аппроксимации значений перплексии // Доклады Российской академии наук. Математика, информатика, процессы управления. – 2024. – Т. 520. – №. 2. – С. 228-237. doi: 10.31857/S2686954324700590 EDN: ASZIOX. Pape D. et al. Prompt obfuscation for large language models // arXiv preprint arXiv:2409.11026. – 2024. Евглевская Н. В., Казанцев А. А. Обеспечение безопасности сложных систем с интеграцией больших языковых моделей: анализ угроз и методов защиты // Экономика и качество систем связи. – 2024. – №. 4 (34). – С. 129-144. EDN: CJEAAZ. Shang S. et al. Intentobfuscator: a jailbreaking method via confusing LLM with prompts // European Symposium on Research in Computer Security. – Cham : Springer Nature Switzerland, 2024. – С. 146-165.
Supplementary files

