KIT_IWSLT25_Offline_en-de_unconstrained_primary
This submission is generated from a cascaded speech translation system in five steps.
First, we segment the long-form audio using voice activity detection to chunk the audio.
Then for each audio, we generate ASR transcript from four different systems (Whisper-large-v3, Whisper-large-v2, Whisper-large-v2 fine-tuned on Bazinga, Phi-4). Then, we concatanet the chunks at document-level to a fine-tuned post-editor LLama3.1 to generate the final ASR transcript. We hypothesize that different models have different strengths and combining them at talk-level using Llama3.1 can correct transcription errors.
Next, we segement the ASR transcripts into sentences using nltk tokenizer.
Further, we translate using our fine-tuned Tower-Instruct 7B v0.2 on speech domain for generating the initial translation.
Finally, we send the ASR transcript and initial translation to our fine-tuned Tower-Instruct 13B for Automatic Post-editing.
First, we segment the long-form audio using voice activity detection to chunk the audio.
Then for each audio, we generate ASR transcript from four different systems (Whisper-large-v3, Whisper-large-v2, Whisper-large-v2 fine-tuned on Bazinga, Phi-4). Then, we concatanet the chunks at document-level to a fine-tuned post-editor LLama3.1 to generate the final ASR transcript. We hypothesize that different models have different strengths and combining them at talk-level using Llama3.1 can correct transcription errors.
Next, we segement the ASR transcripts into sentences using nltk tokenizer.
Further, we translate using our fine-tuned Tower-Instruct 7B v0.2 on speech domain for generating the initial translation.
Finally, we send the ASR transcript and initial translation to our fine-tuned Tower-Instruct 13B for Automatic Post-editing.
From\To | de | zh |
---|---|---|
en | 0.337 |