KIT_IWSLT25_Offline_en-de_unconstrained_primary

This submission is generated from a cascaded speech translation system in five steps.

First, we segment the long-form audio using voice activity detection to chunk the audio.

Then for each audio, we generate ASR transcript from four different systems (Whisper-large-v3, Whisper-large-v2, Whisper-large-v2 fine-tuned on Bazinga, Phi-4). Then, we concatanet the chunks at document-level to a fine-tuned post-editor LLama3.1 to generate the final ASR transcript. We hypothesize that different models have different strengths and combining them at talk-level using Llama3.1 can correct transcription errors.

Next, we segement the ASR transcripts into sentences using nltk tokenizer.

Further, we translate using our fine-tuned Tower-Instruct 7B v0.2 on speech domain for generating the initial translation.

Finally, we send the ASR transcript and initial translation to our fine-tuned Tower-Instruct 13B for Automatic Post-editing.

From\To	de	zh
From\To	en	0.337