CUNI-NL_IWSLT25_Offline_en-de_constrained_contrastive

Our systems follow the end-to-end approach. Each system consists of a pretrained, frozen speech encoder, along with a medium-sized Large Language Model fine-tuned with LoRA on three tasks: 1) transcribing the English audio; 2) directly translating the English audio to German text; and 3) a combination of the above two tasks, i.e. simultaneously transcribing the English audio and translating the English audio to German text. The audio hidden features, obtained from the speech encoder, are fed directly to the LLM along with the text prompt in either English or German (subject to the task) to perform the aforementioned tasks. Under the ``constrained+LLM`` setting for both Offline track and Instruction-Following Short track, the speech encoder is taken only from the encoder part of the ``SeamlessM4T-v2`` architecture; and the LLMs include ``EuroLLM-9B-Instruct``, ``Llama-3.1-8B-Instruct``, and ``gemma-3-12b-it``. All systems were trained using the ``CoVoST2 en-de`` dataset.

From\To	de	zh
From\To	en	0.504