Analysis of the impact of prompt obfuscation on the effectiveness of language models in detecting prompt injections

Aleksei Sergeevich Krohin; Крохин Алексей Сергеевич; Maksim Mihailovich Gusev; Гусев Максим Михайлович

doi:10.7256/2454-0714.2025.2.73939

Analysis of the impact of prompt obfuscation on the effectiveness of language models in detecting prompt injections

Authors: Krohin A.S.¹, Gusev M.M.¹
Affiliations:
Issue: No 2 (2025)
Pages: 44-62
Section: Articles
URL: https://journal-vniispk.ru/2454-0714/article/view/359371
DOI: https://doi.org/10.7256/2454-0714.2025.2.73939
EDN: https://elibrary.ru/FBOXHC
ID: 359371

Cite item

Full Text

Abstract
About the authors
References
Supplementary files
Statistics

Abstract

The article addresses the issue of prompt obfuscation as a means of circumventing protective mechanisms in large language models (LLMs) designed to detect prompt injections. Prompt injections represent a method of attack in which malicious actors manipulate input data to alter the model's behavior and cause it to perform undesirable or harmful actions. Obfuscation involves various methods of changing the structure and content of text, such as replacing words with synonyms, scrambling letters in words, inserting random characters, and others. The purpose of obfuscation is to complicate the analysis and classification of text in order to bypass filters and protective mechanisms built into language models. The study conducts an analysis of the effectiveness of various obfuscation methods in bypassing models trained for text classification tasks. Particular attention is paid to assessing the potential implications of obfuscation for security and data protection. The research utilizes different text obfuscation methods applied to prompts from the AdvBench dataset. The effectiveness of the methods is evaluated using three classifier models trained to detect prompt injections. The scientific novelty of the research lies in analyzing the impact of prompt obfuscation on the effectiveness of language models in detecting prompt injections. During the study, it was found that the application of complex obfuscation methods increases the proportion of requests classified as injections, highlighting the need for a thorough approach to testing the security of large language models. The conclusions of the research indicate the importance of balancing the complexity of the obfuscation method with its effectiveness in the context of attacks on models. Excessively complex obfuscation methods may increase the likelihood of injection detection, which requires further investigation to optimize approaches to ensuring the security of language models. The results underline the need for the continuous improvement of protective mechanisms and the development of new methods for detecting and preventing attacks on large language models.

Keywords

LLM, prompt injection, obfuscation, jailbreak, AI, adversarial attacks, encoder, transformers, AI security, fuzzing

About the authors

Aleksei Sergeevich Krohin

Email: askrokhin@edu.hse.ru

Maksim Mihailovich Gusev

Email: gusevmaxim04@mail.ru

References

Liu Y. et al. Formalizing and benchmarking prompt injection attacks and defenses // 33rd USENIX Security Symposium (USENIX Security 24). – 2024. – С. 1831-1847.
Greshake K. et al. Not what you've signed up for: Compromising real-world llm-integrated applications with indirect prompt injection // Proceedings of the 16th ACM Workshop on Artificial Intelligence and Security. – 2023. – С. 79-90.
Shi J. et al. Optimization-based prompt injection attack to llm-as-a-judge // Proceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security. – 2024. – С. 660-674.
Sang X., Gu M., Chi H. Evaluating prompt injection safety in large language models using the promptbench dataset. – 2024.
Xu Z. et al. LLM Jailbreak Attack versus Defense Techniques--A Comprehensive Study // arXiv e-prints. – 2024. – С. arXiv: 2402.13457.
Hu K. et al. Efficient llm jailbreak via adaptive dense-to-sparse constrained optimization // Advances in Neural Information Processing Systems. – 2024. – Т. 37. – С. 23224-23245.
Wei A., Haghtalab N., Steinhardt J. Jailbroken: How does llm safety training fail? // Advances in Neural Information Processing Systems. – 2023. – Т. 36. – С. 80079-80110.
Li J. et al. Getting more juice out of the sft data: Reward learning from human demonstration improves sft for llm alignment // Advances in Neural Information Processing Systems. – 2024. – Т. 37. – С. 124292-124318.
Kwon H., Pak W. Text-based prompt injection attack using mathematical functions in modern large language models // Electronics. – 2024. – Т. 13. – №. 24. – С. 5008.
Steindl S. et al. Linguistic obfuscation attacks and large language model uncertainty // Proceedings of the 1st Workshop on Uncertainty-Aware NLP (UncertaiNLP 2024). – 2024. – С. 35-40.
Kim M. et al. Protection of LLM Environment Using Prompt Security // 2024 15th International Conference on Information and Communication Technology Convergence (ICTC). – IEEE, 2024. – С. 1715-1719.
Wei Z., Liu Y., Erichson N. B. Emoji Attack: A Method for Misleading Judge LLMs in Safety Risk Detection // arXiv preprint arXiv:2411.01077. – 2024.
Rahman M. A. et al. Applying Pre-trained Multilingual BERT in Embeddings for Improved Malicious Prompt Injection Attacks Detection // 2024 2nd International Conference on Artificial Intelligence, Blockchain, and Internet of Things (AIBThings). – IEEE, 2024. – С. 1-7.
Chen Q., Yamaguchi S., Yamamoto Y. LLM Abuse Prevention Tool Using GCG Jailbreak Attack Detection and DistilBERT-Based Ethics Judgment // Information. – 2025. – Т. 16. – №. 3. – С. 204.
Aftan S., Shah H. A survey on bert and its applications // 2023 20th Learning and Technology Conference (L&T). – IEEE, 2023. – С. 161-166.
Chan C. F., Yip D. W., Esmradi A. Detection and defense against prominent attacks on preconditioned llm-integrated virtual assistants // 2023 IEEE Asia-Pacific Conference on Computer Science and Data Engineering (CSDE). – IEEE, 2023. – С. 1-5.
Biarese D. AdvBench: a framework to evaluate adversarial attacks against fraud detection systems. – 2022.
Liu W. et al. DrBioRight 2.0: an LLM-powered bioinformatics chatbot for large-scale cancer functional proteomics analysis // Nature communications. – 2025. – Т. 16. – №. 1. – С. 2256. doi: 10.1038/s41467-025-57430-4 EDN: JUMWJQ.
Pannerselvam K. et al. Setfit: A robust approach for offensive content detection in tamil-english code-mixed conversations using sentence transfer fine-tuning // Proceedings of the Fourth Workshop on Speech, Vision, and Language Technologies for Dravidian Languages. – 2024. – С. 35-42.
Akpatsa S. K. et al. Online News Sentiment Classification Using DistilBERT // Journal of Quantum Computing. – 2022. – Т. 4. – №. 1.
Грицай Г. М., Хабутдинов И. А., Грабовой А. В. Stackmore LLMs: эффективное обнаружение машинно-сгенерированных текстов с помощью аппроксимации значений перплексии // Доклады Российской академии наук. Математика, информатика, процессы управления. – 2024. – Т. 520. – №. 2. – С. 228-237. doi: 10.31857/S2686954324700590 EDN: ASZIOX.
Pape D. et al. Prompt obfuscation for large language models // arXiv preprint arXiv:2409.11026. – 2024.
Евглевская Н. В., Казанцев А. А. Обеспечение безопасности сложных систем с интеграцией больших языковых моделей: анализ угроз и методов защиты // Экономика и качество систем связи. – 2024. – №. 4 (34). – С. 129-144. EDN: CJEAAZ.
Shang S. et al. Intentobfuscator: a jailbreaking method via confusing LLM with prompts // European Symposium on Research in Computer Security. – Cham : Springer Nature Switzerland, 2024. – С. 146-165.

Supplementary files

Supplementary Files

Action

1. JATS XML

Download

Username
Password
Remember me

Forgot password?	Register

Username
Password
Remember me

Forgot password?	Register

No 3 (2025)

No 3 (2025)

Analysis of the impact of prompt obfuscation on the effectiveness of language models in detecting prompt injections

Full Text

Abstract

Keywords

About the authors

Aleksei Sergeevich Krohin

Maksim Mihailovich Gusev

References

Supplementary files