[Dev tool] Civilization 4 XML translation tool

Maybe i'm a bit utopist but, billboards apart, shouldn't is there a better method to support unicode !?

A summary of what we know:

Text import process.
- The executable first lists all the text relative to the tags at the start of the game. (CvXMLLoadUtility::read())
- It imports all the language text, or english by default, into a cached file (memory ?)
- By hijacking importation, we can write files to UTF8 and convert unicode strings to local codepage during reading so the game store local codepage strings.
- All the standard functions render these local CP without the need to change anything else in the code.

Other infos:
- The game renders graphically the CityBillboard (name and current production), so no hope or magic for this.
- Standard text IS NOT rendered by the graphical engine, as we can write any character from the local codepage.
- Base files are written in iso8859-1, which is pretty limited to US and western europe languages.

Now, let's push this a little further.
WHAT IF... *drum rolls* we would rewrite the way it works with the known limits of the game ?
- Store EVERY character as number code (like in ISO8859-1 old method), for example & #1049;, in the xml (just need to modify a bit the xml parser to do this easily).
- The game starts and imports these number suite (we remove the conversion for now).
- Then we create new functions to convert & #1049; into an unicode char, rewrite getText to getUnicodeText and replace all the calls in the DLL codes (and in python, as 2.4 support unicode). If the game outputs ?, that does not necessarily mean that it don't know which char it is, but maybe it don't which one to render. "& # 1 0 4 9" is just a suite of ASCII chars.

It shouln't be impossible (if it work at all lol).

Now look at this (chinese) or this (japanese):


How can chinese, japanese, korean patches can exists using 2 bytes ? We are missing something here.
 
- Base files are written in iso8859-1, which is pretty limited to US and western europe languages.
Wrong, though that is what you get from reading the forum. It is actually CP1252, which for most characters is the same as iso8859-1. They are not 100% identical though.

How can chinese, japanese, korean patches can exists using 2 bytes ? We are missing something here.
Do you have an URL to those patches? I have been looking around, but I never actually found a download link for them. Maybe we can figure out how to do it, but if we figure out how they did it, it could be much faster.
 
It is actually CP1252, which for most characters is the same as iso8859-1. They are not 100% identical though.
You're right, the XML heading is misleading.

Do you have an URL to those patches? I have been looking around, but I never actually found a download link for them. Maybe we can figure out how to do it, but if we figure out how they did it, it could be much faster.
I have the korean patch and a japanese patch for a really old version. I have not found a 3C chinese patch (i don't why 3C, but russian also called their patch 3C).
None of these gave the source code of course.
Do you think you can grab the DLL output on getText calls with your wonderful C++ skills ? :D

Links:
Japanese patch for RoM 2.71
Korean patch for AND rev618.
The same original files from the same revision as the Korean patch (to compare).

The full source code of our mod at revision 618 is here.
 
Korean patch, CvAppInterface.py, line 58
Code:
	# ENABLE_CJK
	sys.setdefaultencoding('utf-8')

EDIT: Wooh ! This plus a pinch of brain sauce made it to fix the last python i encountered. Thank you koreans !
 
Korean patch, CvAppInterface.py, line 58
Code:
	# ENABLE_CJK
	sys.setdefaultencoding('utf-8')

EDIT: Wooh ! This plus a pinch of brain sauce made it to fix the last python i encountered. Thank you koreans !
:eek:

Looking up this function I get that it should not be used. Instead people should use utf-8 (isn't that what we want?).

The question is if we need to do something similar in the DLL. I once tried to do that, but.... that didn't go well. The main problem is that the compiler is from 2003 and I tried using a function, which was introduced in 2005. For some reason finding online documentation on how to code prior to functions introduced 10 years ago is a near impossible task.

Also I tried looking into the Korean patch, but sourceforge had problems at the time and I couldn't reach the svn server. It appears to be working well now and I might give it another go really soon.
 
If Python is directly UTF8 enabled, we can try to use a similar thing in the DLL. That could be hard, but there is a hope !

For the moment, that fix go way beyond my expectations as it also enable accents in Dynamic Civ Names in French :D
 
BREAKING NEWS:
CIV4ArtDefines_Misc.xml
Code:
		<MiscArtInfo>
			<Type>CITY_BILLBOARDS</Type>
			<Path>None</Path>
			<!-- positive scale: city billboards use fonts from GameFont.tga -->
			<!-- negative scale: GFC billboards (uses the interface font) -->
			<fScale>-1.0</fScale>
			<NIF>None</NIF>
			<KFM>None</KFM>
		</MiscArtInfo>

EDIT: I forced a text to render and here it is:


We still need to implement unicode support into the game, but we've never been so far yet !

EDIT2: Tralalaioula :) (just remove GetGameFontString() before writing production and city name).


We need to keep pushing !!
 
My intuition tells me that this is the key:
"DllExport bool GetChildXmlValByName(wchar* pszVal, const TCHAR* szName, wchar* pszDefault = NULL);"

If we managed to convert UTF8 to wchar or something like that... ?
 
That looks awesome. Totally close to getting something useful. I assume you have Russian locale when you made that screenshot.

If Python is directly UTF8 enabled, we can try to use a similar thing in the DLL.
I did some research and made an interesting discovery. I have been using the function setLocale(), yet the documentation says:
If you provide a code page value of UTF-7 or UTF-8, setlocale will fail, returning NULL.
Back to the idea of having an XML file providing into on available translations. My idea is to have an int telling which CodePage the chosen translation needs.

However I have been having problems with setlocale(). I can't remember the details, but it wasn't working correctly.

I just discovered that _setmbcp() sets two byte CodePages, while setlocale() only works with one byte CodePages.

Maybe we can switch CodePage with code like this: (pseudo code. Arguments are somewhat complex)
Code:
int iCodePage = XML value;
if (setlocale(iCodePage) == NULL)
{
    _setmbcp(iCodePage);
}
setlocale returns NULL if it fails, which will make the code try to use the same CodePage, but in two byte mode. If that one is NULL too, then we have a problem.
 
My intuition tells me that this is the key:
"DllExport bool GetChildXmlValByName(wchar* pszVal, const TCHAR* szName, wchar* pszDefault = NULL);"

If we managed to convert UTF8 to wchar or something like that... ?
I already have that in UTF8, but the exe expects to get text delivered in the locale. In other words that function is pointless.

Besides szName is the name of the language, meaning it will look for "<English>" , "<Russian>" or whatever. All of those are in ASCII, which mean it will work in all locale.
 
That is a track indeed. The other i've found is (you might have a better understanding of the question than me):
- CvWString is a wide string, so it should be able to encode more than 2 bytes (theorically unlimited).
- UTF8 is also encoded on one to 4 bytes (for some specials chars).
- If we manage to read UTF8 directly in the xml and convert to a wide string, i think we should get results.
Unfortunately, i'm still trying to convert it.

EDIT: Why do you made the assumption that the exe expect text in the locale ? If Python can use UTF8 directly from the executable, maybe there is a way to just inject UTF to the exe. Maybe the executable doesn't care about text encoding.

The fact is when the XML is coded in ISO8859-1/W1252, the code is made to understand that encoding and works. When using UTF8, it is not native so we need to convert it to something usable by the code or to adapt the code.

EDIT2: THAT ?
Code:
#ifndef USE_RAPID_XML
			wchar buf[2048];
			int iNumWritten = MultiByteToWideChar(CP_UTF8, 0, szTextVal.c_str(), -1, buf, 1000);
			FAssertMsg(iNumWritten < 2048, "UTF8 text too long, increase buffer size");

			// if the conversion fails, fall back to using the read wide string directly
			if (iNumWritten <= 0)
			{
#endif
 
I'm sure it does as it is already included in the default code (in my previous post).
 
We have a partial boost 1.32. There is only one (or was it two?) boost dll files. However full boost have a whole lot more (I think it was over 20). This mean we might not have access to a function even if it is in boost. I tried compiling boost-thread.dll at some point and it is quite hard to do. Even worse it turned out that the only place the game will look for it is next to the exe file. There code to make it look for it else, like next to the mod DLL file, but for some reason it wants to locate the boost dll used by the dll even before it executes the first dll line of code. You could compile the library in a static version, which would put the boost dll code inside our dll file. However I never managed to get that to work. I didn't try that hard though as I discarded the whole threading idea when I learned that I could only save 0.4 seconds when waiting for next turn :wallbash:
(not precisely what I wanted for a near one minute wait)

In short: boost is worth looking into, but it might not be easy to get to work.

EDIT2: THAT ?
Code:
#ifndef USE_RAPID_XML
			wchar buf[2048];
			int iNumWritten = MultiByteToWideChar(CP_UTF8, 0, szTextVal.c_str(), -1, buf, 1000);
			FAssertMsg(iNumWritten < 2048, "UTF8 text too long, increase buffer size");

			// if the conversion fails, fall back to using the read wide string directly
			if (iNumWritten <= 0)
			{
#endif
The first line is an if sentence for the precompiler. If USE_RAPID_XML is defined, then everything until #endif will be ignored by the compiler meaning it will accept code, which can't be compiled.

Interesting enough I can't find anything about USE_RAPID_XML in the code. However I do have MultiByteToWideChar() in code, which is compiled. That should answer the question if it is available.
 
I've pasted the RAPIDXML thing, but i have removed it from my tests. So far, i managed to get only blank lines and my debug doesn't want to stop at breakpoints anymore (i don't know why.) so i can't check if the string is broken before or after being eaten by the executable.
I managed to get small blank lines and also blank lines with the correct size :D
 
After some researchs, it seems the GameBryo engine 2.0 (which is used by Civ4) is limited to UTC-2 (UTF on 2-bytes) and ASCII. That means we can encode strings to handle korean, japanese, etc. but it seems not so realistic to get UTF8.
The documentation here is for the newest version, which i assume, use a more feature-full engine.

I think we should set as a goal to include support for asian languages using our current method for the moment.

Python handles UTF8, so all we have to do is to inject 2-bytes chars into the executable. Maybe we should be inspired by the python method koreans used!
 
Korean patch, CvAppInterface.py, line 58
Code:
	# ENABLE_CJK
	sys.setdefaultencoding('utf-8')

EDIT: Wooh ! This plus a pinch of brain sauce made it to fix the last python i encountered. Thank you koreans !
I can't get this to work. If I add it to init():, all I get is
AttributeError: 'module' object has no attribute 'setdefaultencoding'
:cry:

Did you manage to add this? If so, what did you do?
 
Top Bottom