Thai Word Segmentation (Version 2.1)
Syllable segmentation is done by applying Thai syllable rules. Segmentation ambiguities are resolved by using a trigram model of syllables, trained with a corpus of 630,000 syllables from a newspaper.
Word segmentation is performed by using maximum collocation approach. (see the paper
"Collocation and Thai Word Segmentation"
submitted to
SNLP-COCOSDA2002 conference
).
Dictionary used in the program is adapted from the Royal Institute Dictionary, which is made available by
LINKS.
But some obsolete words are deleted from the dictionary. There is no routine to handle proper names, abbrviations directly yet. Thus, segmentation of sentences containing a proper name could be incorrect.
A stand alone version running on Windows system can be downloaded <
here
> (version 2.1)
A DOS version can be downloaded <
here
>. You will need to unrar all files into a specified directory. To run the program, type "thaiseg INPUTFILE OUTPUTFILE /w or /s (/vb)" The last option (verbose) is optional.
Last updated on Dec 2010. This program can be used for non-commercial purposes.
This program is a part of a project supported by the Research Division of the Faculty of Arts, 2000-2.
Written by
Wirote Aroonmanakun
. Copyright 2002-4.
Department of Linguistics