NAIST_IWSLT25_Offline_en-zh_constrained_primary

GENERAL APPROACH
This submission follows the constrained with large language models condition. The model is an end-to-end system consisting of a Whisper encoder, a 2D adaptive average pooling layer and 2 linear layers as a connection module, and the Qwen2.5 7B LLM.

DATA

Train: CoVoST v2 en-zh train set
Dev: CoVoST v2 en-zh dev set
Test: IWSLT Past Editions Development en-zh tst2022 set

MODEL ARCHITECTURE
Our model comprises the Whisper encoder (openai/whisper-large-v3), a 2D adaptive average pooling layer as a connection module, and the Qwen2.5 7B LLM (Qwen/Qwen2.5-7B). These components are fine-tuned end-to-end using the datasets mentioned above. During fine-tuning, the parameters of the Whisper encoder and Qwen2.5 are frozen, while the pooling layer and LoRA adapters for Qwen2.5 are trained.

Since the model supports audio inputs of up to 30 seconds, we segment longer recordings as follows:

- Development set : Segmented using Gentle Forced Aligner (https://github.com/strob/gentle), based on alignment with $set.en-de.en.xml.
- Test set: Segmented using Silero-VAD (https://github.com/snakers4/silero-vad).

After generating the Chinese translation for each segment, we combine the translations for the long audio using simple string concatenation.

From\To	de	zh
From\To	en		0.724