Center of Excellence in Southeast Asian Linguistics

Survey and Selection of Texts for Thai National Historical Corpus

ChulaSEAL author(s):
APA: Pittayaporn, Pittayawat, Pothipath, Vipas, Jatuthasri, Thaneerat, Sanah, Nopparut, Matheethammawat, Phongpat, Maspong, Sireemas, Iamdanush, Jakrabhop, and Laimanoo, Ponlawat. (2015). Survey and Selection of Texts for Thai National Historical Corpus. Thai language and Culture, 32.2: 1-41.

Abstract

This article reports on the pilot project for the Thai National Historical Corpus, a diachronic corpus that represents the different stages of the Thai language. Three important decisions were made as a result of the project. First, the texts will be selected according to the criteria designed for the British National Corpus and also adopted by the Thai National Corpus. To keep the data balanced, approximately 25% and 75% of the texts in the corpus will be imaginative and informative respectively. Second, the texts will be tagged for both the historical era and the year of composition. This is because exact dates cannot be specified for a great number of texts. Last but not least, special features found in poetic texts will also be tagged as they are considered part of the text as intended by the authors. In the next phase of the Thai National Historical Corpus, 2.0 million words of texts, including 0.7 million words of imaginative texts and 1.3 million words of informative texts will be processed. The corpus is expected to be launched by April 2016.