IST_IWSLT25_IF_SHORT_en-zh_unconstrained_primary

Speech Text LM with wav2vec 2.0 speech processor and Qwen2.5 1.5B LLM as the text backbone. We use a two-stage curriculum training approach with a modality alignment phase and an instruction fine-tuning phase. We use a linear projector for the modality alignment and keep everything learnable during training.

We use CC-BY licensed data for training. The following datasets are used for each of the tasks:

ASR

LibriSpeech
Multilingual Libri Speech
VoxPopuli
GigaSpeech
Fleurs
CommonVoice

AST

CoVoST-2

SQA

Spoken-SQuAD (en)
Spoken-SQuAD with machine translation and quality filtering through Comet-QE with a 0.85 treshold (zh, de,it). We use a soup of EuroLLM, Seamless, and Tower70B for translations.
Spoken-SQuAD unanswerable generated using Qwen2.5 70B to generate un answerable questions using the same context as Spoken SQuAD.

From\To	de	en	it	zh
From\To	en				0.212