As mentioned in the "Context centric learning" thread: http://www.skritter.com/forum/topic?id=92990307&comments=23 I have uploaded the program for anyone who might like to play with the code. I got an unexpected trip back to Japan for a 28 hours stop over so took the time to get the code. It is totally free, open source, do what ever you like with it.
What it do is something like this:
- Crawl any Chinese site and collects all links on the site.
- Machine parse all the Chinese text and create popup translation for all the words or just the words that I do not know.
- Import my skritter word list and use that when parsing.
- Store all found words on the pages to get an "most used" word list.
etc.
The output of the scan can look like this: https://docs.google.com/leaf?id=0B-BfzC-4_dxeY2RiMmFkYTYtNzBjNy00Y2QxLWI2OWUtMDBlZTI4MDhjNGUw&hl=en
Download, store as html and open. Note, the file is using a java script, totally safe, to show the pop up translations. You will probably have to say yes to that in your browser.
For the source:
https://docs.google.com/leaf?id=0B-BfzC-4_dxeZTdmMDBiYTYtN2Q3OC00YzczLTgwN2MtNGJlOWVhY2EzNjZl&hl=en&authkey=CKK57rMK
For the complete installation package:
https://docs.google.com/leaf?id=0B-BfzC-4_dxeNDdmNTdjYzUtMDgxZS00YWY2LThhY2EtNzg5YWYzZWUzYWMz&hl=en&authkey=CKj75PwG
This is not tested on any other PC so there might be problem with installation.
The code is only some very old playing from my side to test different ideas. Very little error handling code etc. It can handle simplified and traditional Chinese and it would be relatively easy to get it to work with Japanese as well.
To use it:
- install and run the program
- Export your skritter list. Save it in Notepad and save it in UNICODE format. Needs to be just Unicode!
- Click the button "Import Skritter list" and find your file.
- Ready to go.Type in any URL in the textbox or choose any of mine from the dropdown. Click the button "start". By default the limit it set to 100 links to follow. To change just change 100 to any other number. There is also a textbox with the number 4. this is to limit the search to max 4 letter long words. Change to any length of you like to search for longer words as well. Longer words = longer parse time.
- The crawl is usually fast. Once done you can single click on any link to do a new search from that link or double click to load that side only. In most cases however i like to scan all the links so then I click "get all URL:s". That will take some time! Just let it run in the background. What it do is to get the text from each page and scan that for words. Found words that I do know is just printed out and the one I do not know get an popup translation. I can choose to translate the whole text as well, only search for words that i know that is 2 or more chars etc, etc. The output files will end up in the app folders sub folder "parsed". The name will contain the number of words in the text, number of words i did know and the percentage i do know. In the document there is a link to the original text.
The code is old and ugly but the idea as such is OK. At the same time i do get some word count on most used words on the sites i scan. Helps me pick new words to learn. Maybe someone out there can help out converting the idea to modern programming language and clean it up. I do not have the time so that is Why i still use this old one.