[Dev tool] Civilization 4 XML translation tool

The code seems to rely on CP_ACP, which is the CodePage currently used by windows. For all we know it only works if you switch to Japanese locale.


Good question. In fact I find all the modifications a bit odd.

CvGameTextMgr.cpp: minor corrections, such as year before date. Nothing important.
exitingToMainMenu() takes a wide char pointer as argument. From what I can tell it is only used with NULL pointers.
CvDefines: gives it a unique ID for finding other players on GameSpy
CvInitCore: some wide strings are used as wide strings instead of converting to strings
CvString.h: the convert function is added, but appears not to be called

That's about it. Where is the magic, which enables the ingame non-latin characters :confused:

Could be mostly EXE changes.
 
Yeah. That is a bit disappointing. It doesn't seem to do anything more than using WideCharTo... blabla function. Our method seems more adaptive. I'll try to get the approach to use two bytes chars using the iconv table.

EDIT: @Afforess: I used WinMerge to compare but it cannot generate diff. Here's the original source code: https://dl.dropboxusercontent.com/u/369241/origina_bts3.19_source.7z
 
I haven't worked on asian support the last few hours but, to ease the conversion process, we may use the multichartowidechar to convert a char to another one instead of using the double conversion table.
 
@Nightinggale: Have you found something useful about the executable ? With google translate, i've found some infos on a Japanese website that said the english executable doesn't support double-bytes, but i guess they were talking about the DLL.

EDIT: I've got the japanese version in a VM (don't ask me how ;)) so if you need some files i can send them to you.
 
@Nightinggale: Have you found something useful about the executable ?
I have been busy with other tasks so far, which mean I haven't even investigated this yet.

With google translate, i've found some infos on a Japanese website that said the english executable doesn't support double-bytes, but i guess they were talking about the DLL.
I suspect they were in fact talking about the exe. My debug code has yet to display multi byte characters and that is with python speaking directly to the exe. This exe file just got a whole lot more interesting.

If the exe is in fact modified, then we could (at least in theory) inject code into the exe using the DLL. There are two problems with that approach though. Not only have I never done something like that before, it would also trigger a steam anticheat system and crash the game meaning it would only work with the disc version.

Also where does this leave me with the Colonization exe :(

I wonder if we could copy the modified exe into the standard game directory and call it Civ4BeyondSwordTwoByte.exe and then make an alias to it with the mod argument to start the right mod. I don't think steam would like that either though.
 
In fact, i don't really understand the problem here. If we convert MultiBytes to wchar before the executable does anything, then all it has to do is to handle wchar ? I haven't progressed on the code yet but i have difficulties to understand how 2-bytes is stored inside a wchar.

EDIT: Here is the exe and dll from the chinese patch. Strangely enough, the executable don't have the same size than the one from japanese patch (which is a good news as the chinese patch is unofficial).
 
I try to adapt the iconv function for korean. In fact, the problem comes before ConvertString, as when "아브라함" (Korean) is inside "szTextVal", it becomes "아브라함".
I cannot process the string if it is not correct in the input. Any idea?
 
when "아브라함" (Korean) is inside "szTextVal", it becomes "아브라함".
Both yes and no. When writing to a string (or char array), 아브라함 has a specific sequence of bytes. When those bytes are read using CodePage 1252, it becomes 아브라함. If you read the same string as UTF-8, it reverts back to 아브라함.

The problem isn't the string, it's your display, which is using the wrong encoding.
 
Both yes and no. When writing to a string (or char array), 아브라함 has a specific sequence of bytes. When those bytes are read using CodePage 1252, it becomes 아브라함. If you read the same string as UTF-8, it reverts back to 아브라함.
In fact, i'm starting Civ in Debug mode and my computer non-unicode programs are set to korean (chcp = 949). I think Visual Studio always outputs in cp1252.
That said, when i get the int of the first char, it gives me 236, which is "ì".
Maybe the XML parser is reading the file as 1252 even if i'm using 949 ?

EDIT: In the original iconv function, the string is input as a pointer, while here szTextVal is a standard string. I guess the two-first values of the string are only one char.
EDIT2: The answer is right under my nose as the wrong string has two-times more char than the korean one.
EDIT3: I bypassed the limitation by checking c2 = szTextValue.data()[i+1] when c1 > 127 and i've managed to get the unicode number for the korean char. But no display yet.
EDIT4: False alarm. The output code isn't the right one.
 
Here is the decoded unicode values for this string:
Korean string: 아브라함
아 : 50500
브 : 48652
라 : 46972
함 : 54632

String: 아브라함
ì : 236
• : 8226
„ : 8222
ë : 235
¸ : 184
Π: 338
ë : 235
¼ : 188
í : 237
• : 8226
¨ : 268

What i try to look for is to understand the relation between the two.
 
You messed up the encoding and wrote unicode values. When it comes to the values of each byte, unicode is different from UTF-8 :crazyeye:

Also you read as CP1252, not unicode

Korean string: 아브라함 (UTF-8)
아 : 50500 EC 95 84
브 : 48652 EB B8 8C
라 : 46972 EB 9D BC
함 : 54632 ED 95 A8

String: 아브라함 (in CP1252)
ì : 236 EC
• : 8226 95
„ : 8222 84
ë : 235 EB
¸ : 184 B8
Π: 338 8C
ë : 235 EB
 9D (unprintable on my screen)
¼ : 188 BC
í : 237 ED
• : 8226 95
¨ : 268 A8

They look like they match perfectly to me.
 
Ok. But why these chars are said to use two-bytes if they are on three bytes ? I guess unicode encode korean to three bytes and we have to set them to two.
 
Ok. But why these chars are said to use two-bytes if they are on three bytes ? I guess unicode encode korean to three bytes and we have to set them to two.
That is one of the great issues with UTF-8. Asian characters tend to use 3 bytes while they use 2 bytes in codepages or similar local charactersets.

I wrote some code to read UTF-8 and store the unicode ID (the real unicode ID). Technically it converts from UTF-8 to UTF-32. Then the iconv tables are used to convert that number back into a single byte using a codepage encoding. If we are to use a two byte codepage (assuming the exe accepts it), the UTF-32 has to be converted into that codepage.
 
I've tried to use the iconv tables from Hangul to get the right chars, but i've made the assumption that it was on two bytes, so it cannot work. I'll try with three.

EDIT: WAIT! There is something i can't understand. The iconv function to convert UTF8 to 949 does handle only two chars, so how the third one is supposed to be handled:
Code:
static int
cp949_mbtowc (conv_t conv, ucs4_t *pwc, const unsigned char *s, int n)
{
  unsigned char c = *s;
  /* Code set 0 (ASCII) */
  if (c < 0x80)
    return ascii_mbtowc(conv,pwc,s,n);
  /* UHC part 1 */
  if (c >= 0x81 && c <= 0xa0)
    return uhc_1_mbtowc(conv,pwc,s,n);
  if (c >= 0xa1 && c < 0xff) {
    if (n < 2)
      return RET_TOOFEW(0);
    {
      unsigned char c2 = s[1];
      if (c2 < 0xa1)
        /* UHC part 2 */
        return uhc_2_mbtowc(conv,pwc,s,n);
      else if (c2 < 0xff && !(c == 0xa2 && c2 == 0xe8)) {
        /* Code set 1 (KS C 5601-1992, now KS X 1001:1998) */
        unsigned char buf[2];
        int ret;
        buf[0] = c-0x80; buf[1] = c2-0x80;
        ret = ksc5601_mbtowc(conv,pwc,buf,2);
        if (ret != RET_ILSEQ)
          return ret;
        /* User-defined characters */
        if (c == 0xc9) {
          *pwc = 0xe000 + (c2 - 0xa1);
          return 2;
        }
        if (c == 0xfe) {
          *pwc = 0xe05e + (c2 - 0xa1);
          return 2;
        }
      }
    }
  }
  return RET_ILSEQ;
}
Code:
uhc_1_mbtowc (conv_t conv, ucs4_t *pwc, const unsigned char *s, int n)
{
  unsigned char c1 = s[0];
  if ((c1 >= 0x81 && c1 <= 0xa0)) {
    if (n >= 2) {
      unsigned char c2 = s[1];
      if ((c2 >= 0x41 && c2 < 0x5b) || (c2 >= 0x61 && c2 < 0x7b) || (c2 >= 0x81 && c2 < 0xff)) {
        unsigned int row = c1 - 0x81;
        unsigned int col = c2 - (c2 >= 0x81 ? 0x4d : c2 >= 0x61 ? 0x47 : 0x41);
        unsigned int i = 178 * row + col;
        if (i < 5696) {
          *pwc = (ucs4_t) (uhc_1_2uni_main_page81[2*row+(col>=89?1:0)] + uhc_1_2uni_page81[i]);
          return 2;
        }
      }
      return RET_ILSEQ;
    }
    return RET_TOOFEW(0);
  }
  return RET_ILSEQ;
}
Code:
static int
uhc_2_mbtowc (conv_t conv, ucs4_t *pwc, const unsigned char *s, int n)
{
  unsigned char c1 = s[0];
  if ((c1 >= 0xa1 && c1 <= 0xc6)) {
    if (n >= 2) {
      unsigned char c2 = s[1];
      if ((c2 >= 0x41 && c2 < 0x5b) || (c2 >= 0x61 && c2 < 0x7b) || (c2 >= 0x81 && c2 < 0xa1)) {
        unsigned int row = c1 - 0xa1;
        unsigned int col = c2 - (c2 >= 0x81 ? 0x4d : c2 >= 0x61 ? 0x47 : 0x41);
        unsigned int i = 84 * row + col;
        if (i < 3126) {
          *pwc = (ucs4_t) (uhc_2_2uni_main_pagea1[2*row+(col>=42?1:0)] + uhc_2_2uni_pagea1[i]);
          return 2;
        }
      }
      return RET_ILSEQ;
    }
    return RET_TOOFEW(0);
  }
  return RET_ILSEQ;
}
Code:
static int
ksc5601_mbtowc (conv_t conv, ucs4_t *pwc, const unsigned char *s, int n)
{
  unsigned char c1 = s[0];
  if ((c1 >= 0x21 && c1 <= 0x2c) || (c1 >= 0x30 && c1 <= 0x48) || (c1 >= 0x4a && c1 <= 0x7d)) {
    if (n >= 2) {
      unsigned char c2 = s[1];
      if (c2 >= 0x21 && c2 < 0x7f) {
        unsigned int i = 94 * (c1 - 0x21) + (c2 - 0x21);
        unsigned short wc = 0xfffd;
        if (i < 1410) {
          if (i < 1115)
            wc = ksc5601_2uni_page21[i];
        } else if (i < 3854) {
          if (i < 3760)
            wc = ksc5601_2uni_page30[i-1410];
        } else {
          if (i < 8742)
            wc = ksc5601_2uni_page4a[i-3854];
        }
        if (wc != 0xfffd) {
          *pwc = (ucs4_t) wc;
          return 2;
        }
      }
      return RET_ILSEQ;
    }
    return RET_TOOFEW(0);
  }
  return RET_ILSEQ;
}
 
I'm still trying to seek to unravel Korean's mysteries. I'm trying to look for the first char (unicode C544) and i'm surprised i can't find it in the CP949 table. Look here. There is a jump between C543 and C546. Poor me.

EDIT: Useless post, it's in BEC6 (CP949).
 
@dbkblk
Hello,I found you at civ4 wiki Napoleon server.
I tell you some information in civ4 Japanese ver, but I'm not god at programing.
So, I know what troubles you but not how to solve the problem.

Multiwiki distributes XML patch for the language locale problem in Japanese to play English ver.
http://civ4multi.info/patch/
This patch will help you.

Unofficial Chinese patch is an appropriation from Japanese patch.
Hisuipanda has completed Japanize steam ver to use unofficial Chinese patch.
http://twilog.org/hisuipanda/date-140311
http://twilog.org/hisuipanda/date-140312
 
Multiwiki distributes XML patch for the language locale problem in Japanese to play English ver.
http://civ4multi.info/patch/
This patch will help you.
In fact, we're trying to make the game to read UTF8 so we can save Japanese in text files and convert the text on-the-fly when the executable run to display it on japanese PC. The method works for 1-byte languages but not on 2-bytes (which includes Japanese).
The patch you provided contains a fix for the japanese version to display english, but no japanese chars. We want to do the reverse of this, inject japanese into the european version!

Unofficial Chinese patch is an appropriation from Japanese patch.
Hisuipanda has completed Japanize steam ver to use unofficial Chinese patch.
http://twilog.org/hisuipanda/date-140311
http://twilog.org/hisuipanda/date-140312
I'll try to get in touch with him! Thank you for your concern, it is a pleasure to get some help from Japanese people, Milliopolis.
 
Multiwiki distributes XML patch for the language locale problem in Japanese to play English ver.
http://civ4multi.info/patch/
This patch will help you.
Sadly no. This patch only contain XML text files and they contain only English. Looks like they are just the vanilla strings we already have :(

However I extracted http://civ4multi.info/forums/ from the installer. I wonder if there are people there who knows a bit more of what is going on with the text encoding.
 
Top Bottom