Instruction-Following Short

Participants are asked to build a model capable of performing the following tasks (All tasks listed for each track are mandatory):

Automatic Speech Recognition (ASR): the speech is transcribed into the same language;
Speech-to-text Translation (S2TT): the speech is translated into the target language;
Spoken Question Answering (SQA): textual questions have to be answered based on the spoken content in the same language and in a language different from the speech (questions and answers are always in the same language).

We release the source audio and the instructions, and participants submit their outputs. The instructions can be modified by participants to match their system’s prompts. In SQA, questions are provided both in the same language of the speech (English) and in different languages (German, Italian, Chinese) but they always have to be replied to in the same language of the questions (e.g., an Italian question should be replied to in Italian). Questions can also be nonanswerable, in this case, only the answer “Not answerable.” (and the corresponding Italian “Non è possibile rispondere.”, German “Nicht zu beantworten.”, and Chinese “无法回答。” translations) will be considered correct.

The Short Track will handle the same audio files as the Long Track in WAV format, but they will be automatically segmented into 15–20 second audio segments, on average, using SHAS.

We provide an example for the Long track, downloadable here. Participants are also allowed to use it as 1-shot example for their model. We also provide useful scripts for parsing inputs and outputs, downloadable here.