NAIST_IWSLT25_Offline_en-zh_constrained_primary

GENERAL APPROACH
This submission follows the constrained with large language models condition. The model is an end-to-end system consisting of a Whisper encoder, a 2D adaptive average pooling layer and 2 linear layers as a connection module, and the Qwen2.5 7B LLM.

DATA
MODEL ARCHITECTURE
Our model comprises the Whisper encoder (openai/whisper-large-v3), a 2D adaptive average pooling layer as a connection module, and the Qwen2.5 7B LLM (Qwen/Qwen2.5-7B). These components are fine-tuned end-to-end using the datasets mentioned above. During fine-tuning, the parameters of the Whisper encoder and Qwen2.5 are frozen, while the pooling layer and LoRA adapters for Qwen2.5 are trained.

Since the model supports audio inputs of up to 30 seconds, we segment longer recordings as follows:
After generating the Chinese translation for each segment, we combine the translations for the long audio using simple string concatenation.
From\To de zh
en 0.724