Publications

You can also find my articles on my Google Scholar profile.

Quartet: Native FP4 Training Can Be Optimal for Large Language Models

Published in arXiv preprint

We introduce Quartet, a new approach for training large language models (LLMs) directly in low-precision FP4 arithmetic, which significantly reduces computational costs. Our method overcomes the typical accuracy degradation of low-precision training and enables accurate, end-to-end FP4 training. Through optimized implementations for NVIDIA’s Blackwell GPUs, we show that Quartet achieves state-of-the-art accuracy for FP4, making it a viable and efficient alternative to standard-precision training.

Recommended citation: R. L. Castro, A. Panferov, S. Tabesh, O. Sieberling, J. Chen, M. Nikdan, S. Ashkboos, D. Alistarh. (2025). "Quartet: Native FP4 Training Can Be Optimal for Large Language Models." arXiv preprint arXiv:2505.14669. https://arxiv.org/abs/2505.14669

ASIDE: Architectural Separation of Instructions and Data in Language Models

Published in Workshop on Building Trust in LMs, ICLR 2025

ASIDE proposes an architectural retrofit that duplicates a model’s embedding layer and applies an orthogonal rotation to one copy, creating disjoint sub-spaces for instructions and user data. The retrofit can be applied to any transformer without retraining from scratch, lifts instruction–data separation metrics by orders of magnitude, and already rivals specialised safety fine-tuning on prompt-injection benchmarks—all while preserving generative quality.

Recommended citation: E. Zverev, E. Kortukov, A. Panfilov, A. Volkova, S. Tabesh, S. Lapuschkin, W. Samek, C. H. Lampert. (2024). "ASIDE: Architectural Separation of Instructions and Data in Language Models." Workshop on Building Trust in LMs, ICLR 2025. https://arxiv.org/abs/2503.10566

QuEST: Stable Training of LLMs with 1-Bit Weights and Activations

Published in ICML 2025

QuEST pushes quantization further by showing, for the first time, that full 1-bit weight and activation training is stable on transformer LLMs. It couples fast Hadamard-based normalization with an MSE-optimal fitting scheme plus a novel “trust” gradient estimator that explicitly bridges the gap between noisy low-precision gradients and their full-precision counterparts, yielding Pareto-dominant accuracy–size trade-offs and scalable GPU kernels.

Recommended citation: A. Panferov, J. Chen, S. Tabesh, R. L. Castro, M. Nikdan, D. Alistarh. (2024). "QuEST: Stable Training of LLMs with 1-Bit Weights and Activations." Workshop on Sparsity in Large Language Models (SLLM), ICLR 2025. https://arxiv.org/abs/2502.05003

HALO: Hadamard-Assisted Lower-Precision Optimization for LLMs

Published in arXiv pre-print

HALO introduces a quantization-aware training pipeline that inserts lightweight Hadamard rotations before every large matrix multiplication, smoothing activation and gradient outliers so the entire forward-and-backward pass can run in 8-bit (or lower) precision without accuracy loss. Combined with custom CUDA kernels and FSDP-compatible communication, HALO delivers up to 1.4 × end-to-end fine-tuning speed-ups on Llama models while remaining compatible with PEFT techniques.

Recommended citation: S. Ashkboos^†, M. Nikdan^†, S. Tabesh^†, R. L. Castro, T. Hoefler, D. Alistarh. (2024). "HALO: Hadamard-Assisted Lower-Precision Optimization for LLMs." arXiv pre-print. https://arxiv.org/abs/2501.02625

Can LLMs Separate Instructions From Data? And What Do We Even Mean By That?

Published in International Conference on Learning Representations (ICLR) 2025

This study formalises the “instruction–data separation” problem, builds automatic metrics, and demonstrates that even aligned LLMs frequently entangle system prompts with user content, enabling prompt-injection attacks. Through controlled synthetic tests and real-world jailbreaks, the paper quantifies leakage pathways and offers architectural and training-time mitigations that improve separation scores without harming task performance.

Recommended citation: E. Zverev, S. Abdelnabi, S. Tabesh, M. Fritz, C. H. Lampert. (2024). "Can LLMs Separate Instructions From Data? And What Do We Even Mean By That?" International Conference on Learning Representations (ICLR) 2025. https://arxiv.org/abs/2403.06833

RoSA: Accurate Parameter-Efficient Fine-Tuning via Robust Adaptation

Published in International Conference on Machine Learning (ICML) 2024

RoSA blends low-rank adapters with a sparse “outlier” mask inspired by robust PCA, jointly optimised to approximate full-fine-tuning quality at a fraction of the parameter cost. Across math reasoning and SQL generation tasks, RoSA consistently beats LoRA and pure-sparse baselines at equal budgets and even recovers full-fine-tuned accuracy on several benchmarks; bespoke sparse GPU kernels keep memory footprints minimal.

Recommended citation: M. Nikdan^†, S. Tabesh^†, E. Crnčević, D. Alistarh. (2024). "RoSA: Accurate Parameter-Efficient Fine-Tuning via Robust Adaptation." International Conference on Machine Learning (ICML) 2024. https://arxiv.org/abs/2401.04679

TACO: Vision Models Can Be Efficiently Specialized via Few-Shot Task-Aware Compression

Published in Transactions on Machine Learning Research (TMLR)

TACO shows that large vision backbones (ResNet, ViT, ConvNeXt) can be “specialised” into efficient subnetworks using only a handful of task-specific samples. Its layer-wise, data-aware pruning plus distillation pipeline slashes non-zero parameters by up to 20 × and cuts inference latency 2–5 ×, yet retains—or exceeds—the parent model’s accuracy on narrow downstream tasks such as wildlife or vehicle classification.

Recommended citation: D. Kuznedelev^†, S. Tabesh^†, K. Noorbakhsh^†, E. Frantar^†, S. Beery, E. Kurtic, D. Alistarh. (2023). "TACO: Vision Models Can Be Efficiently Specialized via Few-Shot Task-Aware Compression." Transactions on Machine Learning Research (TMLR), in press 2025. https://arxiv.org/abs/2303.14409

Rush Tabesh