[Dev tool] Civilization 4 XML translation tool

@nightinggale: That looks wonderful ! Good luck ! You are years ahead of me in terms of programming.
Would you mind to point me to the thread when you'll talk about this modcomp ?

I'll may adapt the parser to work with that modcomp.

EDIT: I've read the edited part of the post. Don't you think it is possible to make the game to use unicode directly ?
 
Would you mind to point me to the thread when you'll talk about this modcomp ?

I'll may adapt the parser to work with that modcomp.
I haven't written such a thread yet, but I will inform you when I do.

It looks like I edited my last post (last on last page) about a new problem and possible solution while you were writing a reply.
 
Do you have some news on the subject ?
A new korean player has made me to wish to work on this again.
 
Do you have some news on the subject ?
I'm in the process of fixing my computer meaning I'm temporally without a compiler, hence no development at the moment :(

Also I seem to have lost my Russian translator. He became upset with the western media coverage of Ukraine and decided on a break and I haven't heard from him since.

A new korean player has made me to wish to work on this again.
Korean is a bit of an issue. The game engine is hardcoded to single byte codepages and Korean is a 2 byte codepage.

http://msdn.microsoft.com/en-us/goglobal/bb964654

I haven't found a way around that issue, but I haven't been looking much into it. I want a proper 1 byte solution first.
 
Looking at the korean patch that the player give to me, it look like the xml file are encoded directly with korean characters in encoded format. It look like the gamefont tga only contains three or more korean chars.

Link to the korean patch (for older rev):
https://www.dropbox.com/s/fwqusvo9gxfoa9q/rom-and_v2.0_ko v12.exe?dl=0

Yesterday, i've tried to understand your commits, and i think i've managed to do it BUT, regarding of what you've said earlier on this topic, it might be easier to:
- Use a sylfaen.ttf encoded in windows-125x depending of the language to support (russian, korean....)
- Convert strings from utf8 to the native encoding in xml but change the encoding line according to the player language. XML header can be "encoding=iso8859-1" then changed to "korean-something". The chars would be unreadable in iso but readable in "korean-...".
The process can be done at each update automatically.

This will allow to workaround this dll mess !
 
I don't think that would work. Let's forget about gamefont for a moment as it appears to only be used for city billboards.

What happens now is the game loads that the game loads the XML files byte by byte, ignoring the encoding settings. Vanilla assumes it to be the currently active codepage. Which one that is depends on windows settings (the non-unicode thing), which mean Koreans are assumed to be using CP949 regardless of what we do.

From my tests so far I have only been able to access the first 256 characters, which for two byte codepages like CP949 are actually just 128 characters, including control characters.

So far I haven't found any way around this, with or without DLL modifications. The messy part is in the EXE, not in the DLL. If the exe wasn't hardcoded to use a one byte codepage, we wouldn't have this problem to begin with. In fact the whole font issue is stupid because the exe appears to convert the input from codepage to unicode. If we were allowed to provide unicode, the entire font/encoding problem would go away.
 
Ok, but how do you explain that koreans could play in korean ?
 
This is a korean translation for our mod, but for 200 revisions ago. There is a dll included but not the sources. I have asked for these at the korean player but no answer about that yet. You can extract it with 7zip or i can send you the content if you prefer.

Look at the xml encoding and the gamefonts.tga !
 
I would prefer a completely normal zip file. I do intend to examine this until I figure out how it works. I suspect the DLL source code will be needed to really figure what they did.

I think I will start coding a modcomp just with language support once I have figured this out. My current DLL code is modified a number of times as I learned more details on how the game handles text. Starting on vanilla will likely be the best option for ending up with a clean implementation.
 
Here is the zipped patch.

Some founds:
- I use Debian (linux) as my main OS. My text editor don't show the korean characters, but i've checked into a terminal and i can see them individually, so they are not "bypassed" with gamefonts like russian does.
- ks_c_5601-1987 is the old norm for korean hangeul.
- There are only few chars in gamefonts, so everything should be handled by the fonts or by the dll (for which we miss the source code).

Thank you for all the work you dive into.
 
You can view lots of different encodings if you open a text file in a web browser and as it turns out, the same goes for XML files.

The XML files are encoded with ks_c_5601-1987, which for most characters is the same as CP949. Presumably the game reads it as CP949 and the Koreans haven't encoded any of the different characters.

GameFont.tga seriously lack characters. I suspect the billboards can't use Korean (names, production).

The normal sylfaen.ttf appears not to contain any Korean characters. It either has to be modded or removed. I haven't figured out which font the game uses if sylfaen.ttf is missing, but it does fall back to something, which supports more characters.

There is a full ks_c_5601-1987 characterlist here (likely not very useful atm) http://kikaku.itscj.ipsj.or.jp/ISO-IR/149.pdf
Surprisingly this characterset has full support for Russian in addition to Korean. There is some Japanese as well, but I have no idea if the set is full, and it is not in alphabetical order.


The main question remain. How is the game able to display Korean? The only approach I can think of is that somehow the game can figure out Korean can use two bytes for a character and that it just works. If that is the case, then why did it fail when I tried Chinese and Japanese? I have a feeling I will be testing Asian fonts in the near future.

Also the DLL change I have made, which converts from UTF-8 on load will easily be able to convert to CP949 if needed. All it needs is a conversion table and we can copy those we need from Iconv.
 
This is a korean translation for our mod, but for 200 revisions ago.
Can I get the precise revision?

The only thing I can think of right now is to recreate the setup the Koreans used. I did a quick test to see if just changing the windows codepage and XML encoding would enable the two byte characters. It didn't, which is consistent with my previous results.
 
Hi Nightinggale ! Happy new year !

I did not worked on it yet but i'm rewriting the xml parser with a simple GUI, the ability to launch it from any folder, and some new functions (for example: a button to automatically grab the latest translations from transifex, or extract and upload to transifex).

The CvModName.py define the AND2 version. Indicated version is rev618, so i've isolated the corresponding files of the korean patch (without the dll).
Here's is the original korean patch rev618.
Here's the original corresponding files of our rev618.

You can check any file revision on Afforess' server.
 
Hi Nightinggale !
I've dived into your code (branch New_Russian) and began to merge it to our dll code but i have many difficulties to understand parts of the code.
Futhermore to the questions here, could you help me on some specific questions of FontConversion.cpp ?

Here's my current implementation of your code. I cannot make it to link at the moment.

In FontConversion.cpp

l 24 to 266
These are the tables from iconv. Could you confirm that they are pointers ? When a char is read in the xml, it loads in this specific pointer relative to the pagecode used ? So the DLL can grab it by identifing pointers associated to the char. Am i right ?

l 291, 298 & 299
I don't understand the notation "&=" (pointers address?) neither where does these address comes from "0x7F" / "0x40".

l 326
"iBuffer = conversion_table_gamefont[iBuffer];" The table "conversion_table..." is not defined anywhere in the dll code. Where does that come from ?

l470->510
I don't understand at all this specific part of the code. I suppose these codes make reference to a position into the gamefont, but where can i find such numbers ? How did you found them ?

In CvEnums.h:

l 1156
In this file, there are a lot of variables (constants?). I guess they are manipulated by C++ functions into the DLL then usable by python as the comments seem to indicate.

You defined "enum GameFontTypes" with several types. The previous code in FontConversion seems to define the offset + value of each char and initialize before the main menu.
Then, what are all these variables made for ? To be used in advisors which cannot render the fonts from the codepage but from the gamefonts ?

General
In Civ4, we have "CvInternalGlobals", which seems to make the link between python and the dll (from what i understand). When you prefix a call with "gc." it asks for CvInternalGlobals but you've made reference to CvGlobals each time in your code. I don't know how to adapt that.

Sorry to annoy you with all these questions.
Thank you.
 
l 24 to 266
These are the tables from iconv. Could you confirm that they are pointers ? When a char is read in the xml, it loads in this specific pointer relative to the pagecode used ? So the DLL can grab it by identifing pointers associated to the char. Am i right ?
No. They are tables. The CodePage contains 256 characters using the IDs 0-255. The tables show which unicode ID each character has. For instance cp1252:
Code:
/* 0x80 */
0x20ac, 0xfffd, 0x201a, 0x0192, 0x201e, 0x2026, 0x2020, 0x2021
The first is 0x80 or 128 in CP1252. However it is 0x20ac in unicode meaning whenever the XML text has a character ID 0x20ac, it should be stored as 128 in the string given to the exe file.

Some IDs are skipped, which mean the characters use the same ID on the CodePage as in unicode. This goes for 0Xa0-0Xff in cp1252.

l 291, 298 & 299
I don't understand the notation "&=" (pointers address?) neither where does these address comes from "0x7F" / "0x40".
A += B is the same as A = A + B. Likewise A &= B is the same as A = A & B.

& is binary add meaning it goes through each bit in both variables and sets the resulting bit to 1 if it's 1 in both inputs.

b0111 1111 = 0x7F
b1000 0000 = 0x80

In UTF8, if the first bit i 0, the character uses just one byte. If the first is 1, then the rules for multi byte characters applies.

l 326
"iBuffer = conversion_table_gamefont[iBuffer];" The table "conversion_table..." is not defined anywhere in the dll code. Where does that come from ?
line 268
Code:
static unsigned short conversion_table_unicode[128];
It is the unicode IDs of the 128-255 characters in the currently active CodePage. It is filled with data from the arrays from iconv before it's used. The characters 0-127 are standard ASCII characters and are the same in unicode as in all CodePages, hence no reason to make a conversion array.

Once conversion_table_unicode has been set, the code will use that one exclusively and not care for the iconv tables or which CodePage it is using.

l470->510
I don't understand at all this specific part of the code. I suppose these codes make reference to a position into the gamefont, but where can i find such numbers ? How did you found them ?

In CvEnums.h:

l 1156
In this file, there are a lot of variables (constants?). I guess they are manipulated by C++ functions into the DLL then usable by python as the comments seem to indicate.

You defined "enum GameFontTypes" with several types. The previous code in FontConversion seems to define the offset + value of each char and initialize before the main menu.
Then, what are all these variables made for ? To be used in advisors which cannot render the fonts from the codepage but from the gamefonts ?
GameFont IDs are rather hard to figure out. I have a heavily modified Domestic Advisor for colo, which will loop through the IDs and display whatever is stored. I have planned to make something similar for vanilla BTS, but never got around to actually do it.

Line 469 states that it is vanilla characters. It's actually a fix for a vanilla bug. Line 80 and 90 in cp1252 will not appear on city billboards. Presumably the code can just be copied because a vanilla bug affects all mods. I can't remember precisely how I ended up with the numbers I wrote, but they are the numbers, which should make it display correctly.

The enum is used together with an array to tell the ID of the first yield, building, whatever. Vanilla hardcodes those, which limits the number of icons you can use in each category. Using custom values (which should be configurable from XML) gives the freedom to set a virtually unlimited number of icons in GameFont. The city billboards can handle 286 characters or something, but icons used in city screen, pedia or whatever can be placed outside the range of billboards. This may or may not be useful depending on the size of the mod. Remember that this code is written with multiple mods in mind meaning it should be configurable to handle more or less whatever the XML modders throws at it.

In Civ4, we have "CvInternalGlobals", which seems to make the link between python and the dll (from what i understand). When you prefix a call with "gc." it asks for CvInternalGlobals but you've made reference to CvGlobals each time in your code. I don't know how to adapt that.
Writing GC (it's uppercase in C++ and lowercase in python) is a shortcut for getting a pointer. GC goes to one class to get a pointer to another class or something like that (can't remember the specifics offhand). Storing the resulting pointer locally makes the code a little faster than looking it up each time. Most likely it makes no real difference. I just got picky with performance because... let's just say that after I joined M:C, performance has increased so much that it's hard to believe that it's mostly still the same code.
 
I have another problem. After i removed your language selection code and adapted some commands, i managed to compile and run the game. It works as usual for English, French and others, but Russian still does not work (??? everywhere).

I've figured out that even if i set my computer to russian, codepage 1252 is still declared to Civilization, even as russian is 1251.
Do you have any tip ? May i use the wrong code as you've mentionned previously on this thread that your computer forces 1250 ?
EDIT: I've managed to change the locale to Russian while still using my native language but the codepage in use is 866 and not 1251 :(

Anyway, i would like to thank you again because i've debugged my first program and learnt a huge amount of things in no time in C++ with your explanations.
 
AAAAAAAAAAAAHHHHHHHHHHHHHHHHHHHHHHHHHHHHH :)
 
In fact the picture is to show russian support ! The main screen is downloadable in the creative thread :)
 
Top Bottom