DE eng

Search in the Catalogues and Directories

Hits 1 – 1 of 1

1
BAS Edition of German Distant Speech Data Corpus 2014/2015
Abstract: General information: The corpus contains read German speech of 179 different speakers (50 female, 129 male). Each speaker has read randomly selected sentences from four text collections: Wikipedia, the Europarl Corpus,a list of German Command/Control sentences, a corpus of web-crawled sentences that represent direct speech. The recording took place at the Language Technology and Telecooperation labs, TU-Darmstadt, Germany in 2014-2015. The task for the speaker was to read fluently and precise (no dialectal variation). Up to 5 microphones were recorded in parallel: Kinect 1 Beamformed Audio signal through Kinect SDK, Kinect 1 Direct Access as normal microphone, Internal Realtek Mic of Asus PC - near noisy fan, Samson C01U, Yamaha PSG-01S. Distance to mouth for all microphones was approx. 100cm. Room: 'dry' acoustics ('quiet office'), no noise. Sampling rate: 16kHz, resolution: 16 Bit. The speech data was collected in a controlled environment (same room, same microphone distances, etc.). Each recording has a xml transcription file that also includes speaker meta data. The data is curated (manually checked and corrected), to reduce errors and artefacts. The speech data is divided into three independent data sets: Training / Test / Dev, Test and Dev contains new sentences and new speakers that are not part of training set, in order to assess model quality in a speaker-independent open-vocabulary setting. Information about the data collection procedure: (1) Train set (recordings in 2014): Sentences were randomly chosen from German Wikipedia and Europarl Corpus, to be read by the speakers. The Europarl corpus (Release v7) is a collection of the proceedings of the European Parliament between 1996 and 2011, generated by Philipp Koehn (Europarl: A Parallel Corpus for Statistical Machine Translation, Philipp Koehn, MT Summit 2005, http://www.statmt.org/europarl/). As third data set, German command and control sentences, were manually specified and would be typical for a command and control setting in living rooms. (2) Test/dev set (recordings in 2015): Additional sentences from the German Wikipedia and from the Europarl Corpus have selected for the recordings. Additionally, we collected German sentences from the web by crawling the German top-level-domain and applying language filtering and deduplification. Exclusively sentences starting with quotation marks were selected and randomly sampled. The three text sources are represented with approximately equal amounts of recordings in the test/dev set.
Keyword: phonetics
URL: http://hdl.handle.net/11022/1009-0000-0007-F5CB-0
BASE
Hide details

Catalogues
0
0
0
0
0
0
0
Bibliographies
0
0
0
0
0
0
0
0
0
Linked Open Data catalogues
0
Online resources
0
0
0
0
Open access documents
1
0
0
0
0
© 2013 - 2024 Lin|gu|is|tik | Imprint | Privacy Policy | Datenschutzeinstellungen ändern