GlobalTIMIT: Acoustic-Phonetic Datasets for the World's Language - Center of Excellence in Southeast Asian Linguistics

Although the TIMIT acoustic-phonetic dataset was created three decades ago, it remains in wide use, with more than 20000 Google Scholar references, and more than 1000 since 2017. Despite antiquity and relatively small size, inspection of these references shows that it is still used in many research areas: speech recognition, speaker recognition, speech synthesis, speech coding, speech enhancement, voice activity detection, speech perception, overlap detection and source separation, diagnosis of speech and language disorders, and linguistic phonetics, among others.

Nevertheless, comparable datasets are not available even for other widely-studied languages, much less for under- documented languages and varieties. Therefore, we have developed a method for creating TIMIT-like datasets in new languages with modest effort and cost, and we have applied this method in standard Thai, standard Mandarin Chinese, English from Chinese L2 learners, the Guanzhong dialect of Mandarin Chinese, and the Ga language of West Africa. Other collections are planned or underway.

The resulting datasets will be published through the LDC, along with instructions and open-source tools for replicating this method in other languages, covering the steps of sentence selection and assignment to speakers, speaker recruiting and recording, proof-listening, and forced alignment.

GlobalTIMIT: Acoustic-Phonetic Datasets for the World’s Language

Abstract

Related Research Program

Clinical linguistics and phonetics in the context of Thailand