Can LLMs Separate Instructions From Data? And What Do We Even Mean By That?

Published in International Conference on Learning Representations (ICLR) 2025, 2024

Recommended citation: E. Zverev, S. Abdelnabi, S. Tabesh, M. Fritz, C. H. Lampert. (2024). "Can LLMs Separate Instructions From Data? And What Do We Even Mean By That?" International Conference on Learning Representations (ICLR) 2025. https://arxiv.org/abs/2403.06833

This study formalises the “instruction–data separation” problem, builds automatic metrics, and demonstrates that even aligned LLMs frequently entangle system prompts with user content, enabling prompt-injection attacks. Through controlled synthetic tests and real-world jailbreaks, the paper quantifies leakage pathways and offers architectural and training-time mitigations that improve separation scores without harming task performance.

Access paper here