NAIST_IWSLT25_Offline_en-zh_unconstrained_primary
GENERAL APPROACH
We use SALMONN [1], an end-to-end speech LLM, as our base model. We fine-tuned the official SALMONN v1 checkpoint on English-Chinese speech translation (ST) data.
We use SALMONN [1], an end-to-end speech LLM, as our base model. We fine-tuned the official SALMONN v1 checkpoint on English-Chinese speech translation (ST) data.
DATA
- Train: CoVoST v2 en-zh train set
- Dev: CoVoST v2 en-zh dev set
- Test: IWSLT Past Editions Development en-zh tst2022 set
MODEL ARCHITECTURE
SALMONN integrates dual auditory encoders: a Whisper speech encoder (openai/whisper-large-v2) and a fine-tuned BEATs non-speech audio encoder, Q-Former connection module, and a Vicuna LLM (lmsys/vicuna-13b-v1.1). The SALMONN model is pre-trained through a three-stage cross-modal process: pre-training, instruction tuning, and task-specific fine-tuning using various data.
SALMONN integrates dual auditory encoders: a Whisper speech encoder (openai/whisper-large-v2) and a fine-tuned BEATs non-speech audio encoder, Q-Former connection module, and a Vicuna LLM (lmsys/vicuna-13b-v1.1). The SALMONN model is pre-trained through a three-stage cross-modal process: pre-training, instruction tuning, and task-specific fine-tuning using various data.
In this work, we fine-tuned the SALMONN v1 checkpoint using the datasets above, following the original hyperparameters provided in the SALMONN source code.
Since SALMONN supports audio inputs of up to 30 seconds, we segmented longer recordings as follows:
- - IWSLT Past Editions Development set : Segmented using Gentle Forced Aligner (https://github.com/strob/gentle), based on alignment with $set.en-de.en.xml.
- - Test set: Segmented using Silero-VAD (https://github.com/snakers4/silero-vad).
After generating the Chinese translation for each segment, we combine the translations for the long audio using simple string concatenation.
[1] Tang, C., Yu, W., Sun, G., Chen, X., Tan, T., Li, W., ... & Zhang, C. (2023). Salmonn: Towards generic hearing abilities for large language models. arXiv preprint arXiv:2310.13289.
From\To | de | zh |
---|---|---|
en | 0.600 |