NAIST_IWSLT25_Offline_en-de_constrained_primary
GENERAL APPROACH
This submission follows the constrained with large language models condition. The model is an end-to-end system consisting of a Whisper encoder, a 2D adaptive average pooling layer and 2 linear layers as a connection module, and the Qwen2.5 7B LLM.
This submission follows the constrained with large language models condition. The model is an end-to-end system consisting of a Whisper encoder, a 2D adaptive average pooling layer and 2 linear layers as a connection module, and the Qwen2.5 7B LLM.
DATA
- Train: CoVoST v2 en-de train set + Europarl-ST v1.1 en-de train set
- Dev: IWSLT Past Editions Development en-de tst2022 set
- Test: IWSLT Past Editions Development en-de tst2021 set
MODEL ARCHITECTURE
Our model comprises the Whisper encoder (openai/whisper-large-v3), a 2D adaptive average pooling layer as a connection module, and the Qwen2.5 7B LLM (Qwen/Qwen2.5-7B). These components are fine-tuned end-to-end using the datasets mentioned above. During fine-tuning, the parameters of the Whisper encoder and Qwen2.5 are frozen, while the pooling layer and LoRA adapters for Qwen2.5 are trained.
Our model comprises the Whisper encoder (openai/whisper-large-v3), a 2D adaptive average pooling layer as a connection module, and the Qwen2.5 7B LLM (Qwen/Qwen2.5-7B). These components are fine-tuned end-to-end using the datasets mentioned above. During fine-tuning, the parameters of the Whisper encoder and Qwen2.5 are frozen, while the pooling layer and LoRA adapters for Qwen2.5 are trained.
Since the model supports audio inputs of up to 30 seconds, we segment longer recordings as follows:
- - Development set : Segmented using Gentle Forced Aligner (https://github.com/strob/gentle), based on alignment with $set.en-de.en.xml.
- - Test set: Segmented using Silero-VAD (https://github.com/snakers4/silero-vad).
After generating the German translation for each segment, we combine the translations for the long audio using simple string concatenation.
From\To | de | zh |
---|---|---|
en | 0.554 |