IST_IWSLT25_IF_SHORT_en-de_unconstrained_primary

Speech Text LM with wav2vec 2.0 speech processor and Qwen2.5 1.5B LLM as the text backbone. We use a two-stage curriculum training approach with a modality alignment phase and an instruction fine-tuning phase. We use a linear projector for the modality alignment and keep everything learnable during training. 

We use CC-BY licensed data for training. The following datasets are used for each of the tasks:

ASR

LibriSpeech
Multilingual Libri Speech
VoxPopuli 
GigaSpeech 
Fleurs
CommonVoice 

AST

CoVoST-2

SQA

Spoken-SQuAD (en)
Spoken-SQuAD with machine translation and quality filtering through Comet-QE with a 0.85 treshold (zh, de,it). We use a soup of EuroLLM, Seamless, and Tower70B for translations.
Spoken-SQuAD unanswerable generated using Qwen2.5 70B to generate un answerable questions using the same context as Spoken SQuAD. 
From\To de en it zh
en 0.220