KIT_IWSLT25_IF_LONG_en-it_constrained_primary

Our model follows the constrained setting and is designed as an end-to-end speech SpeechLLM that integrates Meta-Llama-3.1-8B-Instruct as the language model and Seamless-m4t-v2-large as the audio encoder. These components are connected via a Qformer module (4 query tokens, 4 transformer layers), which acts as a learnable projector.

To manage memory constraints, we split long audio inputs into 60-second chunks, encode each chunk separately, and concatenate the encoder outputs before passing them through the Qformer and into the language model.

During training, both the audio encoder and LLM are frozen; only the Qformer is trained. We first pretrain the system using contrastive learning on ASR data within the constrained setup, aligning audio-text pairs using a Wasserstein loss. This is followed by supervised finetuning on the constrained-task data.

For data, we use all allowable resources in the constrained track. We also apply domain-specific augmentation:
- For SQA, we chunk and transcribe NUTSHELL talks using Seamless, and use Llama to generate multilingual QA pairs (en, de, it, zh).
From\To de en it zh
en 0.389