[Dev tool] Civilization 4 XML translation tool

Afforess · Feb 22, 2015

Nightinggale said:
The code seems to rely on CP_ACP, which is the CodePage currently used by windows. For all we know it only works if you switch to Japanese locale.

Good question. In fact I find all the modifications a bit odd.

CvGameTextMgr.cpp: minor corrections, such as year before date. Nothing important.
exitingToMainMenu() takes a wide char pointer as argument. From what I can tell it is only used with NULL pointers.
CvDefines: gives it a unique ID for finding other players on GameSpy
CvInitCore: some wide strings are used as wide strings instead of converting to strings
CvString.h: the convert function is added, but appears not to be called

That's about it. Where is the magic, which enables the ingame non-latin characters

Could be mostly EXE changes.

dbkblk · Feb 22, 2015

Yeah. That is a bit disappointing. It doesn't seem to do anything more than using WideCharTo... blabla function. Our method seems more adaptive. I'll try to get the approach to use two bytes chars using the iconv table.

EDIT: @Afforess: I used WinMerge to compare but it cannot generate diff. Here's the original source code: https://dl.dropboxusercontent.com/u/369241/origina_bts3.19_source.7z

Nightinggale · Feb 22, 2015

Afforess said:
Could be mostly EXE changes.

This brings up an interesting question. Is the exe modified? If so, I would like to see it. I'm not sure I would be able to do anything with it, but I would still like to try.

dbkblk · Feb 22, 2015

EDIT: Here it is.

dbkblk · Feb 22, 2015

I haven't worked on asian support the last few hours but, to ease the conversion process, we may use the multichartowidechar to convert a char to another one instead of using the double conversion table.

dbkblk · Feb 24, 2015

@Nightinggale: Have you found something useful about the executable ? With google translate, i've found some infos on a Japanese website that said the english executable doesn't support double-bytes, but i guess they were talking about the DLL.

EDIT: I've got the japanese version in a VM (don't ask me how

) so if you need some files i can send them to you.

Nightinggale · Feb 24, 2015

dbkblk said:
@Nightinggale: Have you found something useful about the executable ?

I have been busy with other tasks so far, which mean I haven't even investigated this yet.

dbkblk said:
With google translate, i've found some infos on a Japanese website that said the english executable doesn't support double-bytes, but i guess they were talking about the DLL.

I suspect they were in fact talking about the exe. My debug code has yet to display multi byte characters and that is with python speaking directly to the exe. This exe file just got a whole lot more interesting.

If the exe is in fact modified, then we could (at least in theory) inject code into the exe using the DLL. There are two problems with that approach though. Not only have I never done something like that before, it would also trigger a steam anticheat system and crash the game meaning it would only work with the disc version.

Also where does this leave me with the Colonization exe

I wonder if we could copy the modified exe into the standard game directory and call it Civ4BeyondSwordTwoByte.exe and then make an alias to it with the mod argument to start the right mod. I don't think steam would like that either though.

dbkblk · Feb 24, 2015

In fact, i don't really understand the problem here. If we convert MultiBytes to wchar before the executable does anything, then all it has to do is to handle wchar ? I haven't progressed on the code yet but i have difficulties to understand how 2-bytes is stored inside a wchar.

EDIT: Here is the exe and dll from the chinese patch. Strangely enough, the executable don't have the same size than the one from japanese patch (which is a good news as the chinese patch is unofficial).

dbkblk · Feb 24, 2015

I try to adapt the iconv function for korean. In fact, the problem comes before ConvertString, as when "아브라함" (Korean) is inside "szTextVal", it becomes "ì•„ë¸Œë¼í•¨".
I cannot process the string if it is not correct in the input. Any idea?

Nightinggale · Feb 24, 2015

dbkblk said:
when "아브라함" (Korean) is inside "szTextVal", it becomes "ìë¸ë¼í¨".

Both yes and no. When writing to a string (or char array), 아브라함 has a specific sequence of bytes. When those bytes are read using CodePage 1252, it becomes ìë¸ë¼í¨. If you read the same string as UTF-8, it reverts back to 아브라함.

The problem isn't the string, it's your display, which is using the wrong encoding.

dbkblk · Feb 24, 2015

Nightinggale said:
Both yes and no. When writing to a string (or char array), 아브라함 has a specific sequence of bytes. When those bytes are read using CodePage 1252, it becomes ì•„ë¸Œë¼í•¨. If you read the same string as UTF-8, it reverts back to 아브라함.

In fact, i'm starting Civ in Debug mode and my computer non-unicode programs are set to korean (chcp = 949). I think Visual Studio always outputs in cp1252.
That said, when i get the int of the first char, it gives me 236, which is "ì".
Maybe the XML parser is reading the file as 1252 even if i'm using 949 ?

EDIT: In the original iconv function, the string is input as a pointer, while here szTextVal is a standard string. I guess the two-first values of the string are only one char.
EDIT2: The answer is right under my nose as the wrong string has two-times more char than the korean one.
EDIT3: I bypassed the limitation by checking c2 = szTextValue.data()[i+1] when c1 > 127 and i've managed to get the unicode number for the korean char. But no display yet.
EDIT4: False alarm. The output code isn't the right one.

dbkblk · Feb 24, 2015

Here is the decoded unicode values for this string:
Korean string: 아브라함
아 : 50500
브 : 48652
라 : 46972
함 : 54632

String: ì•„ë¸Œë¼í•¨
ì : 236
• : 8226
„ : 8222
ë : 235
¸ : 184
Œ : 338
ë : 235
¼ : 188
í : 237
• : 8226
¨ : 268

What i try to look for is to understand the relation between the two.

Nightinggale · Feb 24, 2015

You messed up the encoding and wrote unicode values. When it comes to the values of each byte, unicode is different from UTF-8 :crazyeye:

Also you read as CP1252, not unicode

Korean string: 아브라함 (UTF-8)
아 : ~~50500~~ EC 95 84
브 : ~~48652~~ EB B8 8C
라 : ~~46972~~ EB 9D BC
함 : ~~54632~~ ED 95 A8

String: ìë¸ë¼í¨ (in CP1252)
ì : 236 EC
: ~~8226~~ 95
: ~~8222~~ 84
ë : 235 EB
¸ : 184 B8
: ~~338~~ 8C
ë : 235 EB
9D (unprintable on my screen)
¼ : 188 BC
í : 237 ED
: ~~8226~~ 95
¨ : ~~268~~ A8

They look like they match perfectly to me.

dbkblk · Feb 24, 2015

Ok. But why these chars are said to use two-bytes if they are on three bytes ? I guess unicode encode korean to three bytes and we have to set them to two.

Nightinggale · Feb 24, 2015

dbkblk said:
Ok. But why these chars are said to use two-bytes if they are on three bytes ? I guess unicode encode korean to three bytes and we have to set them to two.

That is one of the great issues with UTF-8. Asian characters tend to use 3 bytes while they use 2 bytes in codepages or similar local charactersets.

I wrote some code to read UTF-8 and store the unicode ID (the real unicode ID). Technically it converts from UTF-8 to UTF-32. Then the iconv tables are used to convert that number back into a single byte using a codepage encoding. If we are to use a two byte codepage (assuming the exe accepts it), the UTF-32 has to be converted into that codepage.

dbkblk · Feb 24, 2015

I've tried to use the iconv tables from Hangul to get the right chars, but i've made the assumption that it was on two bytes, so it cannot work. I'll try with three.

EDIT: WAIT! There is something i can't understand. The iconv function to convert UTF8 to 949 does handle only two chars, so how the third one is supposed to be handled:

Code:

static int
cp949_mbtowc (conv_t conv, ucs4_t *pwc, const unsigned char *s, int n)
{
  unsigned char c = *s;
  /* Code set 0 (ASCII) */
  if (c < 0x80)
    return ascii_mbtowc(conv,pwc,s,n);
  /* UHC part 1 */
  if (c >= 0x81 && c <= 0xa0)
    return uhc_1_mbtowc(conv,pwc,s,n);
  if (c >= 0xa1 && c < 0xff) {
    if (n < 2)
      return RET_TOOFEW(0);
    {
      unsigned char c2 = s[1];
      if (c2 < 0xa1)
        /* UHC part 2 */
        return uhc_2_mbtowc(conv,pwc,s,n);
      else if (c2 < 0xff && !(c == 0xa2 && c2 == 0xe8)) {
        /* Code set 1 (KS C 5601-1992, now KS X 1001:1998) */
        unsigned char buf[2];
        int ret;
        buf[0] = c-0x80; buf[1] = c2-0x80;
        ret = ksc5601_mbtowc(conv,pwc,buf,2);
        if (ret != RET_ILSEQ)
          return ret;
        /* User-defined characters */
        if (c == 0xc9) {
          *pwc = 0xe000 + (c2 - 0xa1);
          return 2;
        }
        if (c == 0xfe) {
          *pwc = 0xe05e + (c2 - 0xa1);
          return 2;
        }
      }
    }
  }
  return RET_ILSEQ;
}

Code:

uhc_1_mbtowc (conv_t conv, ucs4_t *pwc, const unsigned char *s, int n)
{
  unsigned char c1 = s[0];
  if ((c1 >= 0x81 && c1 <= 0xa0)) {
    if (n >= 2) {
      unsigned char c2 = s[1];
      if ((c2 >= 0x41 && c2 < 0x5b) || (c2 >= 0x61 && c2 < 0x7b) || (c2 >= 0x81 && c2 < 0xff)) {
        unsigned int row = c1 - 0x81;
        unsigned int col = c2 - (c2 >= 0x81 ? 0x4d : c2 >= 0x61 ? 0x47 : 0x41);
        unsigned int i = 178 * row + col;
        if (i < 5696) {
          *pwc = (ucs4_t) (uhc_1_2uni_main_page81[2*row+(col>=89?1:0)] + uhc_1_2uni_page81[i]);
          return 2;
        }
      }
      return RET_ILSEQ;
    }
    return RET_TOOFEW(0);
  }
  return RET_ILSEQ;
}

Code:

static int
uhc_2_mbtowc (conv_t conv, ucs4_t *pwc, const unsigned char *s, int n)
{
  unsigned char c1 = s[0];
  if ((c1 >= 0xa1 && c1 <= 0xc6)) {
    if (n >= 2) {
      unsigned char c2 = s[1];
      if ((c2 >= 0x41 && c2 < 0x5b) || (c2 >= 0x61 && c2 < 0x7b) || (c2 >= 0x81 && c2 < 0xa1)) {
        unsigned int row = c1 - 0xa1;
        unsigned int col = c2 - (c2 >= 0x81 ? 0x4d : c2 >= 0x61 ? 0x47 : 0x41);
        unsigned int i = 84 * row + col;
        if (i < 3126) {
          *pwc = (ucs4_t) (uhc_2_2uni_main_pagea1[2*row+(col>=42?1:0)] + uhc_2_2uni_pagea1[i]);
          return 2;
        }
      }
      return RET_ILSEQ;
    }
    return RET_TOOFEW(0);
  }
  return RET_ILSEQ;
}

Code:

static int
ksc5601_mbtowc (conv_t conv, ucs4_t *pwc, const unsigned char *s, int n)
{
  unsigned char c1 = s[0];
  if ((c1 >= 0x21 && c1 <= 0x2c) || (c1 >= 0x30 && c1 <= 0x48) || (c1 >= 0x4a && c1 <= 0x7d)) {
    if (n >= 2) {
      unsigned char c2 = s[1];
      if (c2 >= 0x21 && c2 < 0x7f) {
        unsigned int i = 94 * (c1 - 0x21) + (c2 - 0x21);
        unsigned short wc = 0xfffd;
        if (i < 1410) {
          if (i < 1115)
            wc = ksc5601_2uni_page21[i];
        } else if (i < 3854) {
          if (i < 3760)
            wc = ksc5601_2uni_page30[i-1410];
        } else {
          if (i < 8742)
            wc = ksc5601_2uni_page4a[i-3854];
        }
        if (wc != 0xfffd) {
          *pwc = (ucs4_t) wc;
          return 2;
        }
      }
      return RET_ILSEQ;
    }
    return RET_TOOFEW(0);
  }
  return RET_ILSEQ;
}

dbkblk · Feb 25, 2015

I'm still trying to seek to unravel Korean's mysteries. I'm trying to look for the first char (unicode C544) and i'm surprised i can't find it in the CP949 table. Look here. There is a jump between C543 and C546. Poor me.

EDIT: Useless post, it's in BEC6 (CP949).

Milliopolis · Feb 25, 2015

@dbkblk
Hello,I found you at civ4 wiki Napoleon server.
I tell you some information in civ4 Japanese ver, but I'm not god at programing.
So, I know what troubles you but not how to solve the problem.

Multiwiki distributes XML patch for the language locale problem in Japanese to play English ver.
http://civ4multi.info/patch/
This patch will help you.

Unofficial Chinese patch is an appropriation from Japanese patch.
Hisuipanda has completed Japanize steam ver to use unofficial Chinese patch.
http://twilog.org/hisuipanda/date-140311
http://twilog.org/hisuipanda/date-140312

dbkblk · Feb 25, 2015

Milliopolis said:
Multiwiki distributes XML patch for the language locale problem in Japanese to play English ver.
http://civ4multi.info/patch/
This patch will help you.

In fact, we're trying to make the game to read UTF8 so we can save Japanese in text files and convert the text on-the-fly when the executable run to display it on japanese PC. The method works for 1-byte languages but not on 2-bytes (which includes Japanese).
The patch you provided contains a fix for the japanese version to display english, but no japanese chars. We want to do the reverse of this, inject japanese into the european version!

Milliopolis said:
Unofficial Chinese patch is an appropriation from Japanese patch.
Hisuipanda has completed Japanize steam ver to use unofficial Chinese patch.
http://twilog.org/hisuipanda/date-140311
http://twilog.org/hisuipanda/date-140312

I'll try to get in touch with him! Thank you for your concern, it is a pleasure to get some help from Japanese people, Milliopolis.

Nightinggale · Feb 25, 2015

Milliopolis said:
Multiwiki distributes XML patch for the language locale problem in Japanese to play English ver.
http://civ4multi.info/patch/
This patch will help you.

Sadly no. This patch only contain XML text files and they contain only English. Looks like they are just the vanilla strings we already have

However I extracted http://civ4multi.info/forums/ from the installer. I wonder if there are people there who knows a bit more of what is going on with the text encoding.

[Dev tool] Civilization 4 XML translation tool

The White Wizard

Emperor

Deity

Emperor

Emperor

Emperor

Deity

Emperor

Emperor

Deity

Emperor

Emperor

Deity

Emperor

Deity

Emperor

Emperor

Chieftain

Emperor

Deity

Similar threads