How do I make an online dictionary?

gangleri2001

Garbage day!!!
Joined
Sep 30, 2007
Messages
4,010
Location
Caldes de Montbui, Großkatalonien
Hi there,

I've been lately wondering if I could do an online Chinese-Catalan-Chinese dictionary. I'm sick of having to use the paper version or finding "strange" english words that I've to look for in Catalan-English dicitonaries in the available English-Chinese ones.

I know personally the one who made the paper Catalan-Chinese dictionary and I think I could try to make him join the project of making a digital one. The problem is: I've no idea how to start. Sure I'll have to do a monstruous database, but the requirements of Chinese (pinyin + other systems of romanization, simplified and traditional characters, yitis chengyus, etc...) make everything harder.

I've been thinking on how should it look like and I've come to this conclusion:

- Simplified and traditional characters (if so) in all entries.
- Yitis of the same character (if so) in the one-character entries only.
- Pinyin search prioritary, but offer also tables of Wade-Giles and other transcription methodes (e.g. zhuyin).
- Java Applet to recognize handwritten characters.
- Independent entries for chengyus, xiehouyus and other expressions.

Now the question is: How do I start from scratch? Because I've absolutely NO IDEA on how to make and online dictionary, much less a CHINESE one.
 
edit i think i mis under stood
 
Your main problem here isn't going to be development of the code (if you plan on building everything from scratch), but rather acquisition of the required data. Collecting it by hand would probably take too long, so you might have to purchase an existing dictionary data file from somewhere and begin your work by complementing it with your own data.. Or maybe if you're lucky you could find a public domain dictionary file somewhere? Maybe stuff like that exists, who knows.

Development of the software should be simple. You need a database (likely mysql or mssql) and knowledge of php (it would be cheaper for you to develop in php rather than asp or cf or even jsp). You would pretty much only have to build a search function and a word-display page.. what else does a dictionary need?

I'm not sure about the handwritten characters though.. why do you want to go that route? Can't all chinese characters be represented on-screen in some fashion?
 
Pardon my wall-o-text.

My first question is do you have any experience with projects like this? Judging by your not knowing where to even start, Im going to go with no.

In that case, what warpus suggested should be your first priority. It would be pretty simple to get a word-for-word translator running. You'd need a dictionary file and some sort of database to store it in. Plain text file is not a good idea since that is massively slow when the file is thousands of lines long. You also don't want to keep it in memory, since its a pretty big chunk of RAM, and it's not well organized unless you make a data structure for it.

Your basic search function will allow you to literally translate a word, no context at all. It will take the input data, whether its in Catalan, or some form of Chinese and find the database entry for it. Then, depending on the "to" language asked, you print out the correct stored translation for that word.

Translating based on context will be a LOT harder. You need to know some advanced language processing to do that well. Quite outside of your skill level most likely.

Handwritten character analysis would also be fairly difficult. The most brute-force way I can think of doing it is taking the user input as an image, doing a comparison based on distance from your stored character images and basing your guess of what the user entered on that. It'd be pretty danged slow though, as you'd have to analyze every character image for the language. It's not so hard for a language like english where we've only got 26 characters plus a handful of punctuation marks, but I would think for Chinese you'd want to have something a little more efficient.

------------
Now, for how to go about with the database. My suggestion initially is to do a simple catalan->romanized chinese dictionary. You can refine your search functions on that dictionary, and it would be (I assume) fairly simple to create the database.
I would also suggest creating a data structure to hold each characters of each of the writing methods for Chinese just so you don't have to deal with the weird characters in-code. Store a small image or svg for each character. Store a word's character makeup alongside the romanized version in the database. This way, when the user enters a word or character using that writing system, you can break it down to its component 'syllables', look for the entries that match those syllables and then put out the translation based on that. The reverse would also apply. When a user enters a Catalan word, you find its entry in the database, and if the user requested it, take the 'syllables', convert them to images, stitch them together and display as a single image.


If you're serious about this, and you get that person to help you, I suggest you sit down and think hard about the design of this. Before you put in even a single line of code, make sure everything is written out on paper. What the database will look like, how you will represent chinese characters, both in storage and in-code. What the UI will look like, what kind of resources you would need for this. When you sit down to code, you should know the exact code you need to write down. The only thing left should be writing it, and testing it.
 
Pardon my wall-o-text.

My first question is do you have any experience with projects like this? Judging by your not knowing where to even start, Im going to go with no.

No at all. But I think that a project like this requires someone who knows chinese instead of someone who can program because, as you will see, chinese is a very "special" language.

In that case, what warpus suggested should be your first priority. It would be pretty simple to get a word-for-word translator running. You'd need a dictionary file and some sort of database to store it in. Plain text file is not a good idea since that is massively slow when the file is thousands of lines long. You also don't want to keep it in memory, since its a pretty big chunk of RAM, and it's not well organized unless you make a data structure for it.

I didn't mean that, I mean an online dictionary like this:

http://www.nciku.com/

Handwritten character analysis would also be fairly difficult. The most brute-force way I can think of doing it is taking the user input as an image, doing a comparison based on distance from your stored character images and basing your guess of what the user entered on that. It'd be pretty danged slow though, as you'd have to analyze every character image for the language. It's not so hard for a language like english where we've only got 26 characters plus a handful of punctuation marks, but I would think for Chinese you'd want to have something a little more efficient.

I didn't mean that kind of analysis, but applets like the one you can see at nciku:

nciku.png


I'd really like to know where can I get the source of one of these.

------------
Now, for how to go about with the database. My suggestion initially is to do a simple catalan->romanized chinese dictionary. You can refine your search functions on that dictionary, and it would be (I assume) fairly simple to create the database.
I would also suggest creating a data structure to hold each characters of each of the writing methods for Chinese just so you don't have to deal with the weird characters in-code. Store a small image or svg for each character. Store a word's character makeup alongside the romanized version in the database. This way, when the user enters a word or character using that writing system, you can break it down to its component 'syllables', look for the entries that match those syllables and then put out the translation based on that. The reverse would also apply. When a user enters a Catalan word, you find its entry in the database, and if the user requested it, take the 'syllables', convert them to images, stitch them together and display as a single image.

Understood. Give priority to pinyin.

If you're serious about this, and you get that person to help you, I suggest you sit down and think hard about the design of this. Before you put in even a single line of code, make sure everything is written out on paper. What the database will look like, how you will represent chinese characters, both in storage and in-code. What the UI will look like, what kind of resources you would need for this. When you sit down to code, you should know the exact code you need to write down. The only thing left should be writing it, and testing it.

I'm already doing it.

BTW, if you really want to understand what I want to do, you better check nciku. Sorry, I'm a total n00b.
 
In that case I would first make a list of improvements I would like to make over existing functionality on that page (nciku)

What do you want to improve? Why haven't they made those improvements? Is it feasible?

For a project with a large database such as this, it is imperative that the underlying database table structure is designed well.. You're going to need someone to look at a list of the things you want this thing to do (this might be best presented to a programmer/database person in the form of use cases/use case diagrams) and he/she is going to go: "Ahh! Okay! We're going to need such and such database structure for this thing". You're going to nod and smile. He/she is going to ask you questions and modify the table structure as necessary.

Collecting all the data to fill this database with still sounds like one of the major obstacles in this project. How are you going to do that?

Only after you've overcome these difficulties you can actually start to plan the programming, the GUI, etc.
 
No at all. But I think that a project like this requires someone who knows chinese instead of someone who can program because, as you will see, chinese is a very "special" language.
You will need someone who knows how these things work in-code. An idea is great, but if you want it to work you gotta have some way of writing it in a feasable manner.

I didn't mean that, I mean an online dictionary like this:

http://www.nciku.com/
So you pretty much need a word for word translator, not context based.

I didn't mean that kind of analysis, but applets like the one you can see at nciku:

nciku.png


I'd really like to know where can I get the source of one of these.

That looks like it's doing exactly what I described. It's finding a list of closest matches to whatever you draw.


I'm already doing it.

BTW, if you really want to understand what I want to do, you better check nciku. Sorry, I'm a total n00b.

You still need to sit down and plan it. Diving headifrst into a project with no planning ahead is a recipe for disaster. Take it from someone who's been there.
 
No at all. But I think that a project like this requires someone who knows chinese instead of someone who can program because, as you will see, chinese is a very "special" language.

It would require both. There is no way somebody who can't program well could wing it. Also, the dev needs to understand localization issues that the language specialist will not know about or he will go about solving this problem the wrong way.

If this is any good you will need to put equal emphasis on both aspects. Contrary to what I have read so far this would really be a large project. I'm pretty familiar with the situation as I have worked with web applications that incorporated Japanese chacter recognition with mouse and tablet pc input.

Handwritten character analysis would also be fairly difficult. The most brute-force way I can think of doing it is taking the user input as an image, doing a comparison based on distance from your stored character images and basing your guess of what the user entered on that.


You would just use existing APIs and you want to take the user input as ink not images. For example, you could use the Microsoft handwritting recognition and ink APIs. You can use an interface such as silverlight and it's actually pretty simple. I've done it with Japanese using client side scripting and very little code. That method has the best support for charcter input and analysis, although he might want to use something that has better cross platform support.
 
You would just use existing APIs and you want to take the user input as ink not images. For example, you could use the Microsoft handwritting recognition and ink APIs. You can use an interface such as silverlight and it's actually pretty simple. I've done it with Japanese using client side scripting and very little code. That method has the best support for charcter input and analysis, although he might want to use something that has better cross platform support.

You're completely right. For some reason it slipped my mind that there are likely API's that already do this. The question with said API's though is, like you said, cross-platform support but also cost. They might not be free.
 
You're completely right. For some reason it slipped my mind that there are likely API's that already do this. The question with said API's though is, like you said, cross-platform support but also cost. They might not be free.

Yes, that's why I was thinking about contact with nciku and buy theirs.
 
I just noticed that Google Translate has Catalan and Chinese as languages. Its possible you can use their API and make your own frontend for it.
 
:confused:

But what about information on the characters and stuff? I want to make a dictionary, not a computer translator.
 
Correct, but you can use Google's translation as a backend for your dictionary. It will know how to translate the words, all you would need is a way to tap into that.

It simplifies the work you have to do, in that you do not need to create a dictionary database, you merely need to access google's translation service. Like I said, I believe they offer an API to do that. You can always build your own stuff on top of it, including character recognition.
 
Man, no offense, but this project is going to be such a huge fail..

Why? I've not even begun. This summer I'll begin.
 
Why? I've not even begun. This summer I'll begin.

1. you're doing it backwards

2.
No at all. But I think that a project like this requires someone who knows chinese instead of someone who can program

3. you don't understand how important it is to lay down a solid foundation during the planning stage of a project like this.. you have no expertise with database design and as such have ignored every single piece of advice given to you regarding that. you also know nothing about programming.

4. i've come across people like you, with ideas, with no experience as to how to put it into practice and the steps involved. it usually doesn't work out.
 
Aww don't discourage the guy ;)

gangleri2001: what warpus is saying is very true. If you want to have a snowballs chance in hell of doing this, and doing this right, then you better take a good long time to plan everything. You need to find someone with experience, you also need to have this planned out to the point that your entire program is runnable on paper, before you write as much as puts("Hello World");

Don't ignore advice given to you by others. You have no basis from which to ignore it, unless it is obviously bad ("You should write this whole thing in COBOL!")
 
1. you're doing it backwards

2.

3. you don't understand how important it is to lay down a solid foundation during the planning stage of a project like this.. you have no expertise with database design and as such have ignored every single piece of advice given to you regarding that. you also know nothing about programming.

4. i've come across people like you, with ideas, with no experience as to how to put it into practice and the steps involved. it usually doesn't work out.

You say that because you ignore that I've a programmer on board that has expertise in project managing and the guy who made the first catalan-chinese-catalan paper dictionary too. As you can see, I'm doing pretty well in getting people in ;)
 
I didn't mean to be too discouraging, I was drunk and have seen many projects fail for similar reasons before ;) Plus being honest about something like this is usually better than not saying anything.

Good to see you have a (hopefully) experienced programmer on board. Who's going to be designing the database though?
 
As I said before, we will begin this summer. All we have by now are only vague ideas.
 
Back
Top Bottom