Thai wikipedia analysis
We processed a Sept. 2025 dump of Thai wikipedia. The purpose was to produce a frequency list based on a relatively neutral corpus. Throughout this blog, the resulting frequency list will be referred to as the 'thwiki' list. 500,000 articles, north of 150+ million words/tokens. We processed it so you don't have to.
Sourcing
a big file
- https://dumps.wikimedia.org/thwiki
- update 20250901
- bz2 460MB to xml 1 file 3GB
- high-level site-info and 593,089 pages
thwiki as a corpus?
Assumptions: - it is balanced; it is not a wiki where article/stubs-generating bots are active; not many stubs for places, plants, animals, persons, companies, chemicals, etc. - it covers most practical aspects of language in a semi-formal register (mostly);
The process
'Design' Decisions
- no history log
- no discussion thread
- no deleted pages
- strip all technical markers, and meta info (e.g. title/heading/etc.)
- remove any words/segment not in Thai,
- inc. unit (km, kg, etc.); # might be TEMP
- any latin characters
- words in Pali, Sanskrit, Khmer, written Chinese characters, etc.
- keep any segment even not in dictionary, but cut off at 5 occurences (might ned to raise to 10, tradi corpus work 3-5);
- proper names etc. likely to be cut off, or very low occurences, so leave in;
- at least in first pass, do not attempt to remove infoboxes, and category links;
- these will increase the occurences of certain words, but we feel it reflects actual increased usage.
- we note that it might skew slightly the frequency for these words
- repeat-word ฯ paiyan noi 1- as distinct word 2- remove symbol, 1 word count 3- double occurence NOPE
- keep only the most recent revision of an article, but regardless of status
headers
they are not attributed a special weight, as in e.g. Google Page Rank, but on the other hand, if some titles appear frequently (think of "Overview", "Plot", "References"), they do appear more frequently. They are therefore counted. The skew is likely marginal.
Why less than 5 cutoff?
Paul Nation and extensive reading
Paul Nation and his colleagues' research indicates that 95-98% vocabulary coverage is required for extensive reading.
At 95% coverage: This translates to about one unknown word in every 20 running words. Research suggests that this level is adequate for gaining a basic understanding and guessing unknown words from context.
At 98% coverage: This means about one unknown word in every 50 running words. At this higher level, the density of unknown words is low enough that comprehension is more fluent and less interrupted. Guessing from context becomes more reliable, and the reader can focus more on the meaning of the text.
Summary: 95% undersatnding and new acquisitions from context, 98% comfortable reading.
Standards for frequency lists
When reducing corpora to frequency lists, linguists generally apply a 3 to 5 occurences cut-off. But to compare two frequency rankings, the value of 5 for cut-off is the standard (5+ occur. for chi-squared).
As our ultimate goal is two compare the thwiki and the แจ่มไพบูลย์/แรช rankings, we used a 5 cut-off straight off the bat for thwiki.
Considering extensive reading and cut-off for the แจ่มไพบูลย์/แรช Frequency List for Thai Learners:
- Rank 11,557 is first word at 5, 12,316 the last;
- 95% of cumulative frequency spans words 3,856-4,479;
- 98% spans 6,495-8,361.
Conclusion: using a cut-off of 5 gives us roughly 12k words to compare with thwiki, and 12k is equivalent to 99% coverage of the textbooks.
Soft side
We used pythainlp.newmm algorithm, which, as per our understanding, is an enhanced version of the dictionary-based, co-location, maximising algo used by Dr.Tantong Champaiboon in her thesis. Accuracy improved.
python scripts and regexes were used for the pre- and post-processing.
We used the dictionary coming with the algo, we are still working on superdict.
Excel was used for some final post-processing.
The results
Number of articles: 395,944 Total tokens: 152,815,652 Dict raw size: 2,414,272 (tokens)
After removing non-Thai characters and arabic numbers: Dict clean size: 189,101
At this stage, it still contains a few entries with spaces, tabs and/or nbsp, and punctuation and numbers (thai numbers).
can eliminate in excel 1. sort on word 2. delete 3. sort by count desc.
After excel clean up, they are no longer tokens, but words.
count 184,760 on a total words of 88,478,626
113,854 have 5 or less occurences -> cut-off (less than 5 would have been ~ 109k)
After cut-off:
count 70,906
total words 88,283,877
Contains 585 repeat-words
| rank | word | count | raw freq | cum. freq | position |
|---|---|---|---|---|---|
| 12265 | เชี่ยน | 481 | 0.00% | 95.00% | FIRST 95 |
| 14377 | นายพราน | 361 | 0.00% | 95.99% | LAST 95 |
| 21818 | ลัสปัลมัส | 152 | 0.00% | 98.00% | FIRST 98 |
| 30434 | เก็บเสียง | 67 | 0.00% | 98.99% | LAST 98 |
Further work
We are working on a comparison of the ranking between the 20k frequency list and the one obtained from wikipedia. Stay posted.