KIT_IWSLT25_Offline_en-de_unconstrained_contrastive1

This submission is generated from a cascaded speech translation system in four steps.

First, we segment the long-form audio using voice activity detection to chunk the audio.

Then for each audio, we generate ASR transcript from four different systems (Whisper-large-v3, Whisper-large-v2, Whisper-large-v2 fine-tuned on Bazinga, Phi-4). Then, we concatanet the chunks at document-level to a fine-tuned post-editor LLama3.1 to generate the final ASR transcript. We hypothesize that different models have different strengths and combining them at talk-level using Llama3.1 can correct transcription errors.

Next, we segement the ASR transcripts into sentences using nltk tokenizer.

Finally, we translate using 2 segments at once (difference from primary) using our fine-tuned Tower-Instruct 7B v0.2 on speech domain for generating the final translation.
From\To de zh
en 0.347