NAIST_IWSLT25_Offline_en-de_unconstrained_primary

GENERAL APPROACH
We use SALMONN [1], an end-to-end speech LLM, as our base model. We fine-tuned the official SALMONN v1 checkpoint on English-German speech translation (ST) data.

DATA
MODEL ARCHITECTURE
SALMONN integrates dual auditory encoders: a Whisper speech encoder (openai/whisper-large-v2) and a fine-tuned BEATs non-speech audio encoder, Q-Former connection module, and a Vicuna LLM (lmsys/vicuna-13b-v1.1). The SALMONN model is pre-trained through a three-stage cross-modal process: pre-training, instruction tuning, and task-specific fine-tuning using various data.

In this work, we fine-tuned the SALMONN v1 checkpoint using the datasets above, following the original hyperparameters provided in the SALMONN source code.
Since SALMONN supports audio inputs of up to 30 seconds, we segmented longer recordings as follows:
After generating the German translation for each segment, we combine the translations for the long audio using simple string concatenation.

[1] Tang, C., Yu, W., Sun, G., Chen, X., Tan, T., Li, W., ... & Zhang, C. (2023). Salmonn: Towards generic hearing abilities for large language models. arXiv preprint arXiv:2310.13289.
From\To de zh
en 0.448