Skritter | Chinese textcrawler

Newer Topic Created 14 years ago Older Topic

Chinese textcrawler

Mandarinboy May 17th, 2011 3:53p.m.

As mentioned in the "Context centric learning" thread: http://www.skritter.com/forum/topic?id=92990307&comments=23 I have uploaded the program for anyone who might like to play with the code. I got an unexpected trip back to Japan for a 28 hours stop over so took the time to get the code. It is totally free, open source, do what ever you like with it.

What it do is something like this:

- Crawl any Chinese site and collects all links on the site.

- Machine parse all the Chinese text and create popup translation for all the words or just the words that I do not know.

- Import my skritter word list and use that when parsing.

- Store all found words on the pages to get an "most used" word list.

etc.

The output of the scan can look like this: https://docs.google.com/leaf?id=0B-BfzC-4_dxeY2RiMmFkYTYtNzBjNy00Y2QxLWI2OWUtMDBlZTI4MDhjNGUw&hl=en

Download, store as html and open. Note, the file is using a java script, totally safe, to show the pop up translations. You will probably have to say yes to that in your browser.

For the source:

https://docs.google.com/leaf?id=0B-BfzC-4_dxeZTdmMDBiYTYtN2Q3OC00YzczLTgwN2MtNGJlOWVhY2EzNjZl&hl=en&authkey=CKK57rMK

For the complete installation package:

https://docs.google.com/leaf?id=0B-BfzC-4_dxeNDdmNTdjYzUtMDgxZS00YWY2LThhY2EtNzg5YWYzZWUzYWMz&hl=en&authkey=CKj75PwG

This is not tested on any other PC so there might be problem with installation.

The code is only some very old playing from my side to test different ideas. Very little error handling code etc. It can handle simplified and traditional Chinese and it would be relatively easy to get it to work with Japanese as well.

To use it:

- install and run the program

- Export your skritter list. Save it in Notepad and save it in UNICODE format. Needs to be just Unicode!

- Click the button "Import Skritter list" and find your file.

- Ready to go.Type in any URL in the textbox or choose any of mine from the dropdown. Click the button "start". By default the limit it set to 100 links to follow. To change just change 100 to any other number. There is also a textbox with the number 4. this is to limit the search to max 4 letter long words. Change to any length of you like to search for longer words as well. Longer words = longer parse time.

- The crawl is usually fast. Once done you can single click on any link to do a new search from that link or double click to load that side only. In most cases however i like to scan all the links so then I click "get all URL:s". That will take some time! Just let it run in the background. What it do is to get the text from each page and scan that for words. Found words that I do know is just printed out and the one I do not know get an popup translation. I can choose to translate the whole text as well, only search for words that i know that is 2 or more chars etc, etc. The output files will end up in the app folders sub folder "parsed". The name will contain the number of words in the text, number of words i did know and the percentage i do know. In the document there is a link to the original text.

The code is old and ugly but the idea as such is OK. At the same time i do get some word count on most used words on the sites i scan. Helps me pick new words to learn. Maybe someone out there can help out converting the idea to modern programming language and clean it up. I do not have the time so that is Why i still use this old one.

jww1066 May 17th, 2011 6:44p.m.

Wow, thanks for sharing this! I'll try it out and let you know how it goes.

James

FatDragon May 17th, 2011 7:03p.m.

Way cool. I'm totally going to check this out, though I won't be the one to optimize it or clean it up, since I'm the sort of guy who thinks perl is an expensive gift for your girlfriend.

FatDragon May 17th, 2011 10:44p.m.

Hmm, any chance someone could pop this onto a more China-friendly server?

Mandarinboy May 18th, 2011 12:12a.m.

I can upload it when I am back in China the next time. On a road trip now to the US but will be in China in June again.

Kai Carver May 18th, 2011 7:20a.m.

looks very nice!

but I have trouble installing on my Win 7 64-bit laptop... running it gives me successive warnings that comdlg32.ocx, mscomctl.ocx, msinet.ocx are missing. I installed the first two and gave up on the third

but anyway it may inspire me to do a web version of this

Phoboss May 18th, 2011 7:37a.m.

hey, how cool is that??
This is very useful for me! Thank you so much!
Also this functionality reminds me a bit of the LinQs of the (commercial) language learning portal Lingq.com.
These LinQs are actually characters/words which you already know. So if you open a Chinese article at Linq.com the characters/words which you don't know get marked.

Now I can save the 10 bucks for this portal :)))

jww1066 May 18th, 2011 7:48a.m.

@mandarinboy I had roughly the same problem Kai did; I got "Component 'MSINET.OCX' or one of its dependencies is not correctly registered: a file is missing or invalid". This is on Windows XP SP3. Installing http://activex.microsoft.com/controls/vb6/msinet.cab solved it.

James

Phoboss May 18th, 2011 7:50a.m.

@Mandarinboy Me again. The links works, but to actually download the WebCrawler.rar doesn't work for me...

Kai Carver May 18th, 2011 8:14a.m.

@jww thanks your link and the below two links got me to get the program running!
http://devonenote.com/2010/02/register-comdlg32-ocx-on-x64-win7/
http://majorgeeks.com/faqshow.php?id=8

@Phoboss you may need to install something to unpack .rar files. I use 7zip http://www.7-zip.org/ for that.

Mandarinboy May 18th, 2011 8:24a.m.

sorry for the problems, I will check the installation packages as soon as I can. it is old stuff so it only works on 32 bits windows. I am planning to upgrade to C#. other things that I am playing with now are:

- Add function to search word frequency for words containing any character I do know.

- store example sentences from pages.

- Search example sentences containing any word i need one for.

etc. My current solution includes an SQL server and a lot of BI but this work for some fun reading as least.

Phoboss May 19th, 2011 4:18p.m.

OK, now I know why I wasn't able download it. You must have an Google account. With that, you must sign in and after that you're able to download it.

@Mandarinboy
I would welcome it if you also upload it on another site like code.google.com or megaupload etc. so other people who don't have google accounts can easily download it.
Thank you in advance!

aharlekyn May 23rd, 2011 3:33a.m.

Would it be easy to modify it so that it can be used with Anki and Spanish or another language?

This could be a really marketable program. I have not tried it yet, but am completely willing to pay for the concept.

Mandarinboy May 23rd, 2011 2:28p.m.

I try to fix a better place to have it once back in the civilization again. Since this is just some very old testing on what can be done with some simple coding this is for now just a test. I will try to get a real modern version available in C# once i start my next session in Japan in June. Think i can do that rather fast. It is like with all coding, everything is possible. It should not be any problem to use it with Anki and for other languages we basically just need an word list so possible yes. But, have you ever used e.g. Google translate? Then you know how hard it is to translate more than just words. To translate a sentence is really complex and nothing that I even will try to do. I limit this to translate words and characters. For my purposes that is enough. One interesting thing with this is to get word usage on different sites. There is a big difference in word usage on e.g. news papers and on chat forums. I also think that i can speed it up some ten times or so. Speed have never been an issue for me since i have an computer only scanning sites but it would be nice to do this faster.

jww1066 May 23rd, 2011 3:11p.m.

@aharlekyn I imagine one of the major issues in other languages would be word stemming, which is not an issue in Chinese. For example, in Spanish, if you know 'comer' you also presumably know 'como', 'comes', 'comiste', 'comeré', 'comiera', etc.

James

This forum is now read only. Please go to Skritter Discourse Forum instead to start a new conversation!

create an account

recover an account

Chinese textcrawler