Nos_TranscriSpeech-GL: Galician ASR corpus

  1. Vladu, Adina Ioana 1
  2. Vázquez Abuín, Marta 1
  3. Fernández Rei, Elisa 1
  4. García Díaz, Noelia 1
  5. Vidal Miguéns, Adrián 1
  6. Magariños, Carmen 1
  1. 1 Universidade de Santiago de Compostela
    info

    Universidade de Santiago de Compostela

    Santiago de Compostela, España

    ROR https://ror.org/030eybx10

Editor: Zenodo

Ano de publicación: 2023

Tipo: Dataset

Resumo

Manually transcribed and speech-to-text aligned Galician ASR corpus containing 53 hours of multi-domain speech. The corpus is divided into four subcorpora according to the type of audio: conferences, debates, speeches, and interviews. The file naming scheme of both the audio and the transcription files consists of an ID indicating the speaker, followed, if necessary, by a number indicating successive audios by the same speaker. Parts of the same audio are marked by a number separated by an underscore from the speaker ID (e.g., Alberti1_1.wav). The audio files are released in 44.1 kHz 16-bit WAV format and the transcriptions are available in .stm and .trf. Moreover, the corpus is accompanied by the corresponding speaker metadata and the guide detailing the conventions used for the manual transcription. Funding and acknowledgements “The Nós project: Galician in the society and economy of Artificial Intelligence” is possible thanks to the funding resulting from the agreement 2021-CP080 between the Xunta de Galicia and the University of Santiago de Compostela, and thanks to the Investigo program, within the National Recovery, Transformation and Resilience Plan, within the framework of the European Recovery Fund (NextGenerationEU). We would like to thank the Corpus Oral Informatizado da Lingua Galega (CORILGA) project for their kind collaboration in providing the original data. For more information, please go to https://nos.gal/ or contact the Nós project at proxecto.nos@usc.gal.