IST_IWSLT25_IF_SHORT_en-zh_unconstrained_primary
Speech Text LM with wav2vec 2.0 speech processor and Qwen2.5 1.5B LLM as the text backbone. We use a two-stage curriculum training approach with a modality alignment phase and an instruction fine-tuning phase. We use a linear projector for the modality alignment and keep everything learnable during training.
We use CC-BY licensed data for training. The following datasets are used for each of the tasks:
ASR
LibriSpeech
Multilingual Libri Speech
VoxPopuli
GigaSpeech
Fleurs
CommonVoice
AST
CoVoST-2
SQA
Spoken-SQuAD (en)
Spoken-SQuAD with machine translation and quality filtering through Comet-QE with a 0.85 treshold (zh, de,it). We use a soup of EuroLLM, Seamless, and Tower70B for translations.
Spoken-SQuAD unanswerable generated using Qwen2.5 70B to generate un answerable questions using the same context as Spoken SQuAD.
We use CC-BY licensed data for training. The following datasets are used for each of the tasks:
ASR
LibriSpeech
Multilingual Libri Speech
VoxPopuli
GigaSpeech
Fleurs
CommonVoice
AST
CoVoST-2
SQA
Spoken-SQuAD (en)
Spoken-SQuAD with machine translation and quality filtering through Comet-QE with a 0.85 treshold (zh, de,it). We use a soup of EuroLLM, Seamless, and Tower70B for translations.
Spoken-SQuAD unanswerable generated using Qwen2.5 70B to generate un answerable questions using the same context as Spoken SQuAD.
From\To | de | en | it | zh |
---|---|---|---|---|
en | 0.212 |