NLE_IWSLT25_IF_SHORT_en-it_constrained_primary

Naver Labs Europe (NLE) submission to the instruction following short task is a single model for ASR/ST/SQA across all languages.
Training: The model is trained in 3 steps.
1. A Transformer-based speech projector trains on ASR/ST tasks to learn to project the averaged representations of SeamlessM4T layer 24 (averaged every 3 frames) into the embedding space of llama 3.1.
2. In parallel, text-only LoRA fine-tuning in MT/QA task is performed on top of llama 3.1.
3. A quick merging step is performed by loading both 1 and 2 models, and training on multilingual multimodal data (all speech and text tasks at the same time). Our batching strategy tries to better align modalities by sampling equivalent tasks together (e.g. for every ST en-de batch sampled during training, the dataloader samples a MT en-de batch next).
Data: We leverage all data from the constrained setting, together with automatically obtained text translation data (SeamlessM4T) and reformulations of answers to fluent form (Llama 3.1) for SpokenSQUAD. We also replace the audio input of SpokenSQUAD train split by TTS data obtained by SeamlessM4T.
From\To de en it zh
en 0.422