ASIDE: Architectural Separation of Instructions and Data in Language Models
Published in Workshop on Building Trust in LMs, ICLR 2025, 2025
ASIDE proposes an architectural retrofit that duplicates a model’s embedding layer and applies an orthogonal rotation to one copy, creating disjoint sub-spaces for instructions and user data. The retrofit can be applied to any transformer without retraining from scratch, lifts instruction–data separation metrics by orders of magnitude, and already rivals specialised safety fine-tuning on prompt-injection benchmarks—all while preserving generative quality.
Recommended citation: E. Zverev, E. Kortukov, A. Panfilov, A. Volkova, S. Tabesh, S. Lapuschkin, W. Samek, C. H. Lampert. (2024). "ASIDE: Architectural Separation of Instructions and Data in Language Models." Workshop on Building Trust in LMs, ICLR 2025. https://arxiv.org/abs/2503.10566