Via a contact at CCTV i got hold of a couple of thousands subtitles for films, dramas and TV shows. I parsed them in my programs and got a very nice list of words and frequencies. Around 60.000.000 words in total where parsed. Once done the same friend told me that a team from a Belgium university had done something similar as part of an academic study. They harvested some 33.500.000 words. I found that study and compared those two lists and for the first 2-3000 words or so there is a close to exact match. A word can be a few numbers up or down in the list but it is the same word and the differences is seldom more than 5 numbers up or down in the lists. The Belgian team used native Chinese to verify their lists and compared it to other sources etc. For the remaining words there is still a close similarity but the variations can be a few hundred places up and down. The conclusion is that this frequency list much better meets the real life usage of Chinese than the traditional ones. Their report on the study can be found here http://expsy.ugent.be/subtlex-ch/Cai%20&%20Brysbaert%202010%20Plos%20One.pdf It is in fact very interesting reading. I am at Shanghai Airport right now and the bandwidth do not allow me to upload the files as vocab lists but when I am back in Japan i will upload them and make them public. I am also on my way to do the same harvesting on newspapers. I am up to 15.000.000 words up till now. I did check the frequency i found so far against those lists as well. As expected most of the daily usage words are the same and in almost the same frequency. The big difference is that in newspapers there is much more talk about countries, company names, disasters, finance etc. I will make lists of those as well since they are perfect complements to the subtitles lists. There is not a large number of differences and the words are still highly frequent but in another context. One interesting note here is that most current Chinese frequency lists are based on literature. They are in fact not mirroring the daily usage of word as well as films or internet sources do.