NAIST_IWSLT25_Offline_en-zh_unconstrained_primary

GENERAL APPROACH
We use SALMONN [1], an end-to-end speech LLM, as our base model. We fine-tuned the official SALMONN v1 checkpoint on English-Chinese speech translation (ST) data.

DATA

Train: CoVoST v2 en-zh train set
Dev: CoVoST v2 en-zh dev set
Test: IWSLT Past Editions Development en-zh tst2022 set

MODEL ARCHITECTURE
SALMONN integrates dual auditory encoders: a Whisper speech encoder (openai/whisper-large-v2) and a fine-tuned BEATs non-speech audio encoder, Q-Former connection module, and a Vicuna LLM (lmsys/vicuna-13b-v1.1). The SALMONN model is pre-trained through a three-stage cross-modal process: pre-training, instruction tuning, and task-specific fine-tuning using various data.

In this work, we fine-tuned the SALMONN v1 checkpoint using the datasets above, following the original hyperparameters provided in the SALMONN source code.

Since SALMONN supports audio inputs of up to 30 seconds, we segmented longer recordings as follows:

- IWSLT Past Editions Development set : Segmented using Gentle Forced Aligner (https://github.com/strob/gentle), based on alignment with $set.en-de.en.xml.
- Test set: Segmented using Silero-VAD (https://github.com/snakers4/silero-vad).

After generating the Chinese translation for each segment, we combine the translations for the long audio using simple string concatenation.

[1] Tang, C., Yu, W., Sun, G., Chen, X., Tan, T., Li, W., ... & Zhang, C. (2023). Salmonn: Towards generic hearing abilities for large language models. arXiv preprint arXiv:2310.13289.

From\To	de	zh
From\To	en		0.600