Speech corpus

A speech corpus (or spoken corpus) is a database of speech audio files and text transcriptions. In speech technology, speech corpora are used, among other things, to create acoustic models (which can then be used with a speech recognition or speaker identification engine). In linguistics, spoken corpora are used to do research into phonetic, conversation analysis, dialectology and other fields.

A corpus is one such database. Corpora is the plural of corpus (i.e. it is many such databases).

There are two types of speech corpora:

Read Speech, which includes:
* Book excerpts
* Broadcast news
* Lists of words
* Sequences of numbers
Spontaneous Speech, which includes:
* Dialogs Ã¢ÂÂ between two or more people (includes meetings; one such corpus is the KEC);
* Narratives Ã¢ÂÂ a person telling a story (one such corpus is the Buckeye Corpus);
* Map-tasks Ã¢ÂÂ one person explains a route on a map to another;
* Appointment-tasks Ã¢ÂÂ two people try to find a common meeting time based on individual schedules.

A special kind of speech corpora are non-native speech databases that contain speech with a foreign accent.

References

Edwards, Jane / Lampert, Martin (eds.) (1992): Talking Data Ã¢ÂÂ Transcription and Coding in Discourse Research. Hillsdale: Erlbaum.
Leech, Geoffrey / Myers, Greg / Thomas, Jenny (eds.) (1995): Spoken English on Computer: Transcription, Markup and Application. Harlow: Longman.