[Dev tool] Civilization 4 XML translation tool

dbkblk · Jan 23, 2015

Maybe i'm a bit utopist but, billboards apart, shouldn't is there a better method to support unicode !?

A summary of what we know:
Text import process.
- The executable first lists all the text relative to the tags at the start of the game. (CvXMLLoadUtility::read())
- It imports all the language text, or english by default, into a cached file (memory ?)
- By hijacking importation, we can write files to UTF8 and convert unicode strings to local codepage during reading so the game store local codepage strings.
- All the standard functions render these local CP without the need to change anything else in the code.

Other infos:
- The game renders graphically the CityBillboard (name and current production), so no hope or magic for this.
- Standard text IS NOT rendered by the graphical engine, as we can write any character from the local codepage.
- Base files are written in iso8859-1, which is pretty limited to US and western europe languages.

Now, let's push this a little further.
WHAT IF... *drum rolls* we would rewrite the way it works with the known limits of the game ?
- Store EVERY character as number code (like in ISO8859-1 old method), for example & #1049;, in the xml (just need to modify a bit the xml parser to do this easily).
- The game starts and imports these number suite (we remove the conversion for now).
- Then we create new functions to convert & #1049; into an unicode char, rewrite getText to getUnicodeText and replace all the calls in the DLL codes (and in python, as 2.4 support unicode). If the game outputs ?, that does not necessarily mean that it don't know which char it is, but maybe it don't which one to render. "& # 1 0 4 9" is just a suite of ASCII chars.

It shouln't be impossible (if it work at all lol).

Now look at this (chinese) or this (japanese):

How can chinese, japanese, korean patches can exists using 2 bytes ? We are missing something here.

Nightinggale · Jan 23, 2015

dbkblk said:
- Base files are written in iso8859-1, which is pretty limited to US and western europe languages.

Wrong, though that is what you get from reading the forum. It is actually CP1252, which for most characters is the same as iso8859-1. They are not 100% identical though.

dbkblk said:
How can chinese, japanese, korean patches can exists using 2 bytes ? We are missing something here.

Do you have an URL to those patches? I have been looking around, but I never actually found a download link for them. Maybe we can figure out how to do it, but if we figure out how they did it, it could be much faster.

dbkblk · Jan 23, 2015

Nightinggale said:
It is actually CP1252, which for most characters is the same as iso8859-1. They are not 100% identical though.

You're right, the XML heading is misleading.

Nightinggale said:
Do you have an URL to those patches? I have been looking around, but I never actually found a download link for them. Maybe we can figure out how to do it, but if we figure out how they did it, it could be much faster.

I have the korean patch and a japanese patch for a really old version. I have not found a 3C chinese patch (i don't why 3C, but russian also called their patch 3C).
None of these gave the source code of course.
Do you think you can grab the DLL output on getText calls with your wonderful C++ skills ?

Links:
Japanese patch for RoM 2.71
Korean patch for AND rev618.
The same original files from the same revision as the Korean patch (to compare).

The full source code of our mod at revision 618 is here.

dbkblk · Jan 23, 2015

Korean patch, CvAppInterface.py, line 58

Code:

	# ENABLE_CJK
	sys.setdefaultencoding('utf-8')

EDIT: Wooh ! This plus a pinch of brain sauce made it to fix the last python i encountered. Thank you koreans !

Nightinggale · Jan 23, 2015

dbkblk said:
Korean patch, CvAppInterface.py, line 58

Code:

# ENABLE_CJK sys.setdefaultencoding('utf-8')

EDIT: Wooh ! This plus a pinch of brain sauce made it to fix the last python i encountered. Thank you koreans !

Looking up this function I get that it should not be used. Instead people should use utf-8 (isn't that what we want?).

The question is if we need to do something similar in the DLL. I once tried to do that, but.... that didn't go well. The main problem is that the compiler is from 2003 and I tried using a function, which was introduced in 2005. For some reason finding online documentation on how to code prior to functions introduced 10 years ago is a near impossible task.

Also I tried looking into the Korean patch, but sourceforge had problems at the time and I couldn't reach the svn server. It appears to be working well now and I might give it another go really soon.

dbkblk · Jan 23, 2015

If Python is directly UTF8 enabled, we can try to use a similar thing in the DLL. That could be hard, but there is a hope !

For the moment, that fix go way beyond my expectations as it also enable accents in Dynamic Civ Names in French

dbkblk · Jan 23, 2015

BREAKING NEWS:
CIV4ArtDefines_Misc.xml

Code:

		<MiscArtInfo>
			<Type>CITY_BILLBOARDS</Type>
			<Path>None</Path>
			<!-- positive scale: city billboards use fonts from GameFont.tga -->
			<!-- negative scale: GFC billboards (uses the interface font) -->
			<fScale>-1.0</fScale>
			<NIF>None</NIF>
			<KFM>None</KFM>
		</MiscArtInfo>

EDIT: I forced a text to render and here it is:

We still need to implement unicode support into the game, but we've never been so far yet !

EDIT2: Tralalaioula

(just remove GetGameFontString() before writing production and city name).

We need to keep pushing !!

dbkblk · Jan 23, 2015

My intuition tells me that this is the key:
"DllExport bool GetChildXmlValByName(wchar* pszVal, const TCHAR* szName, wchar* pszDefault = NULL);"

If we managed to convert UTF8 to wchar or something like that... ?

Nightinggale · Jan 23, 2015

That looks awesome. Totally close to getting something useful. I assume you have Russian locale when you made that screenshot.

dbkblk said:
If Python is directly UTF8 enabled, we can try to use a similar thing in the DLL.

I did some research and made an interesting discovery. I have been using the function setLocale(), yet the documentation says:

If you provide a code page value of UTF-7 or UTF-8, setlocale will fail, returning NULL.

Back to the idea of having an XML file providing into on available translations. My idea is to have an int telling which CodePage the chosen translation needs.

However I have been having problems with setlocale(). I can't remember the details, but it wasn't working correctly.

I just discovered that _setmbcp() sets two byte CodePages, while setlocale() only works with one byte CodePages.

Maybe we can switch CodePage with code like this: (pseudo code. Arguments are somewhat complex)

Code:

int iCodePage = XML value;
if (setlocale(iCodePage) == NULL)
{
    _setmbcp(iCodePage);
}

setlocale returns NULL if it fails, which will make the code try to use the same CodePage, but in two byte mode. If that one is NULL too, then we have a problem.

Nightinggale · Jan 23, 2015

dbkblk said:
My intuition tells me that this is the key:
"DllExport bool GetChildXmlValByName(wchar* pszVal, const TCHAR* szName, wchar* pszDefault = NULL);"

If we managed to convert UTF8 to wchar or something like that... ?

I already have that in UTF8, but the exe expects to get text delivered in the locale. In other words that function is pointless.

Besides szName is the name of the language, meaning it will look for "<English>" , "<Russian>" or whatever. All of those are in ASCII, which mean it will work in all locale.

dbkblk · Jan 23, 2015

That is a track indeed. The other i've found is (you might have a better understanding of the question than me):
- CvWString is a wide string, so it should be able to encode more than 2 bytes (theorically unlimited).
- UTF8 is also encoded on one to 4 bytes (for some specials chars).
- If we manage to read UTF8 directly in the xml and convert to a wide string, i think we should get results.
Unfortunately, i'm still trying to convert it.

EDIT: Why do you made the assumption that the exe expect text in the locale ? If Python can use UTF8 directly from the executable, maybe there is a way to just inject UTF to the exe. Maybe the executable doesn't care about text encoding.

The fact is when the XML is coded in ISO8859-1/W1252, the code is made to understand that encoding and works. When using UTF8, it is not native so we need to convert it to something usable by the code or to adapt the code.

EDIT2: THAT ?

Code:

#ifndef USE_RAPID_XML
			wchar buf[2048];
			int iNumWritten = MultiByteToWideChar(CP_UTF8, 0, szTextVal.c_str(), -1, buf, 1000);
			FAssertMsg(iNumWritten < 2048, "UTF8 text too long, increase buffer size");

			// if the conversion fails, fall back to using the read wide string directly
			if (iNumWritten <= 0)
			{
#endif

Nightinggale · Jan 23, 2015

UTF-8 is CP65001 in Windows (which is just a way of specifying UTF-8 in the legacy codepage stuff)

If we get that to work and we only store XML in UTF-8, then we will not have to convert. However MultiByteToWideChar() might be needed. I hope our antique compiler can handle that function.

dbkblk · Jan 23, 2015

I'm sure it does as it is already included in the default code (in my previous post).

dbkblk · Jan 23, 2015

Another track to follow: http://www.boost.org/doc/libs/1_32_0/libs/serialization/doc/codecvt.html
Boost 1.32 has this included (the one we use).

Nightinggale · Jan 23, 2015

We have a partial boost 1.32. There is only one (or was it two?) boost dll files. However full boost have a whole lot more (I think it was over 20). This mean we might not have access to a function even if it is in boost. I tried compiling boost-thread.dll at some point and it is quite hard to do. Even worse it turned out that the only place the game will look for it is next to the exe file. There code to make it look for it else, like next to the mod DLL file, but for some reason it wants to locate the boost dll used by the dll even before it executes the first dll line of code. You could compile the library in a static version, which would put the boost dll code inside our dll file. However I never managed to get that to work. I didn't try that hard though as I discarded the whole threading idea when I learned that I could only save 0.4 seconds when waiting for next turn :wallbash:

(not precisely what I wanted for a near one minute wait)

In short: boost is worth looking into, but it might not be easy to get to work.

dbkblk said:

EDIT2: THAT ?

Code:

#ifndef USE_RAPID_XML
			wchar buf[2048];
			int iNumWritten = MultiByteToWideChar(CP_UTF8, 0, szTextVal.c_str(), -1, buf, 1000);
			FAssertMsg(iNumWritten < 2048, "UTF8 text too long, increase buffer size");

			// if the conversion fails, fall back to using the read wide string directly
			if (iNumWritten <= 0)
			{
#endif

The first line is an if sentence for the precompiler. If USE_RAPID_XML is defined, then everything until #endif will be ignored by the compiler meaning it will accept code, which can't be compiled.

Interesting enough I can't find anything about USE_RAPID_XML in the code. However I do have MultiByteToWideChar() in code, which is compiled. That should answer the question if it is available.

dbkblk · Jan 23, 2015

I've pasted the RAPIDXML thing, but i have removed it from my tests. So far, i managed to get only blank lines and my debug doesn't want to stop at breakpoints anymore (i don't know why.) so i can't check if the string is broken before or after being eaten by the executable.
I managed to get small blank lines and also blank lines with the correct size

dbkblk · Jan 23, 2015

After some researchs, it seems the GameBryo engine 2.0 (which is used by Civ4) is limited to UTC-2 (UTF on 2-bytes) and ASCII. That means we can encode strings to handle korean, japanese, etc. but it seems not so realistic to get UTF8.
The documentation here is for the newest version, which i assume, use a more feature-full engine.

I think we should set as a goal to include support for asian languages using our current method for the moment.

Python handles UTF8, so all we have to do is to inject 2-bytes chars into the executable. Maybe we should be inspired by the python method koreans used!

Nightinggale · Jan 23, 2015

dbkblk said:
Korean patch, CvAppInterface.py, line 58

Code:

# ENABLE_CJK sys.setdefaultencoding('utf-8')

EDIT: Wooh ! This plus a pinch of brain sauce made it to fix the last python i encountered. Thank you koreans !

I can't get this to work. If I add it to init():, all I get is

AttributeError: 'module' object has no attribute 'setdefaultencoding'

Did you manage to add this? If so, what did you do?

dbkblk · Jan 23, 2015

I try to keep the git repository up-to-date here: https://github.com/dbkblk/and2_sources_unicode
I've just put this to init() in "Rise of Mankind - A New Dawn/Assets/Python/EntryPoints/CvAppInterface.py"

EDIT: Maybe you'll need to import this file from the base game. I have a copy of your repo but this file doesn't exist. Colonization isn't installed so i can't check.

Nightinggale · Jan 23, 2015

Problem solved.... sort of. Moving to a different computer made the game accept the new code with no problems whatsoever. The problem causing computer have been acting up lately and to be honest reinstalling windows is on the todo list. I just never had any problems with colonization before.

[Dev tool] Civilization 4 XML translation tool

Emperor

Deity

Emperor

Emperor

Deity

Emperor

Emperor

Emperor

Deity

Deity

Emperor

Deity

Emperor

Emperor

Deity

Emperor

Emperor

Deity

Emperor

Deity

Similar threads