Textbooks frequency list overview
Overview v.2.4
The original frequency list is the 2016 work of Dr. Tantong Champaiboon (Ph.D. from Chulalongkorn University, Linguistics Department). She studied a corpus of textbooks for Thai students age 3-16 yo. The list is organised by various dimensions: measures of complexity of the vocabulary, comparison across 4 age ranges and 4 historical and current curricula.
The แจ่มไพบูลย์/แรช Frequency List for Thai Learners v2 is the enhanced version of the list as adapted for (English-speaking) Thai learners.
Reddit r/learnthai
- Post for v2.4
- Older post for v1 in the same sub.
Major caveat
The original study is useful to us adult Thai learners because of its domain: school textbooks. The small size, however, is an issue (only around 3 M words). As you go down the index number (first column), the probability that the word has that rank in real life decreases rapidly; it is not linear. To put it in other words: words number 1 to 9-10,000 are highly likely to be in the 20,000 most used words IRL; but if you take word number, say 16,000, all you can assert is that it is likely amongst the 50,000 most used words. The index is indicative of rank, but is not strictly a rank, take it with a pinch of salt. Index is an indication of rank — in the corpus [yes, em-dash]. If your preferred domain to learn Thai is lakorn or news, แล้วแต่คุณ.
How many words do we need?
Do we need all 19,494 words? No.
110 words represent half the corpus, and slightly less than 2,100 represent 90%. And with say 6-7,000, you could read any of the textbooks at Extensive Reading level (95-98% Paul Nation, 2005), the first word reaching 95% cumulative frequency is at rank 3,856, the last 98% is at 8,361. On the other hand, 13,600 words are present in 3 or all 4 of the source dictionaries (see section ‘sources’), so they compose a ‘hard’ core of the Thai language (see the hexagon-based chart in the doc).
Furthermore, if you want to produce a list of 2,000 words with complex spelling, or 3,000 compound words, which are more than the sum of their parts, (see section ‘examples of use’), you need more than 2-3,000 overall. So, this long list gives us learners the flexibility we need, based on individuals’ goals.
For a description of all columns and their possible values, see the ‘Notice’ tab in the sheet, or the full docs here in github. We will highlight key changes with v1. More dimensions have been added in this version (see below).
Quick stats: 19,494 words, 1,169 repeat-words, 2/3-rds of the words have examples. ~60% have audio available; audio caveat: the links to Wikimedia are effective, but have not been verified one by one. I have not yet received authorisation to share the files for the ‘audio’ column (value=1)
Audio files
The licence for the audio files has not been clarified; so, they are simply not available at this stage.
Key changes with v1
- all words in the original list are now included (19,494 instead of ~16k).
- all words have IPA phonetics and a sensible romanisation, with tones;
- only 329 words have no meaning attached;
- there should be no repeated meanings, meanings have been tidyed up. 93% of the list now has only 1-2 senses.
- Experimental features: (these are denoted in the sheet with a tag of [exper.])
- repeat-words are pointing back to their base-word, when it exists in the list.
- some compounds not found in dictionaries point to their (poss.) component-words, when it exists in the list.
- loan-words: most are translated and have a transliteration (though a few defeat us). The transliteration is included so that we can learn to pronounce these words the Thai way, and thus be understood.
- new column: Classifiers – out of 9178 nouns, 3244 (35%) have 1 or more classifiers (Thai word + transliteration).
- changed: column 1 is now 'index'. Use it in combo with the last 2-3 columns on the right to produce your learning lists.
A note on meanings/senses
Why are all senses of a word aggregated? Can you not emphasise the most frequent meaning? One of the key findings of the original thesis is that when a word is introduced to children at a given level, all senses/facets of this word are also introduced, i.e. they are not developed over time.
Read this page for more details on the columns of the spreadsheet
Examples of usage
430 grammar words have a sense, and most have one or more examples - good to find out which you already know, and which you should research or ask your teacher. Note that most rank pretty high in frequency, that figures.
Conversely, filter out grammar words and use the result to "go to town with Anki."
Sources & licences
The thesis (link), as far as I can tell is in the public domain.
Lexitron v2 (link) NECTEC licence.
Wiktionary (link) is licenced under CC BY-SA 4.0 (Attribution-Share Alike 4.0 International)
Volubilis v. 25.2 (link), also under CC BY-SA 4.0.
The Royal Institute Dictionary 1999 is also under NECTEC licence.
"This product is created by the adaptation of LEXiTRON developed by NECTEC."
This frequency list is shared under CC BY-SA 4.0, including the mention above as work derivative from a NECTEC production.
Links
If you have suggestions, the sheet is now not only public, but open for comments. However, if you disagree with some of the meanings, you should likely take it with the corresponding dictionary authors. I welcome any constructive criticism.
The Blog will be open for comments [registered github users]
TLDR
A Thai word frequency list of ~20k words used in the primary and secondary school textbooks, with various dimensions to cut and slice custom lists.