1. We have added a Gift Upgrades feature that allows you to gift an account upgrade to another member, just in time for the holiday season. You can see the gift option when going to the Account Upgrades screen, or on any user profile screen.
    Dismiss Notice

[Dev tool] Civilization 4 XML translation tool

Discussion in 'Rise of Mankind: A New Dawn' started by dbkblk, Jun 2, 2014.

  1. Afforess

    Afforess The White Wizard

    Joined:
    Jul 31, 2007
    Messages:
    12,239
    Location:
    Austin, Texas
    Could be mostly EXE changes.
     
  2. dbkblk

    dbkblk Emperor

    Joined:
    Oct 28, 2005
    Messages:
    1,781
    Location:
    France
    Yeah. That is a bit disappointing. It doesn't seem to do anything more than using WideCharTo... blabla function. Our method seems more adaptive. I'll try to get the approach to use two bytes chars using the iconv table.

    EDIT: @Afforess: I used WinMerge to compare but it cannot generate diff. Here's the original source code: https://dl.dropboxusercontent.com/u/369241/origina_bts3.19_source.7z
     
  3. Nightinggale

    Nightinggale Deity

    Joined:
    Feb 2, 2009
    Messages:
    4,272
    This brings up an interesting question. Is the exe modified? If so, I would like to see it. I'm not sure I would be able to do anything with it, but I would still like to try.
     
  4. dbkblk

    dbkblk Emperor

    Joined:
    Oct 28, 2005
    Messages:
    1,781
    Location:
    France
  5. dbkblk

    dbkblk Emperor

    Joined:
    Oct 28, 2005
    Messages:
    1,781
    Location:
    France
    I haven't worked on asian support the last few hours but, to ease the conversion process, we may use the multichartowidechar to convert a char to another one instead of using the double conversion table.
     
  6. dbkblk

    dbkblk Emperor

    Joined:
    Oct 28, 2005
    Messages:
    1,781
    Location:
    France
    @Nightinggale: Have you found something useful about the executable ? With google translate, i've found some infos on a Japanese website that said the english executable doesn't support double-bytes, but i guess they were talking about the DLL.

    EDIT: I've got the japanese version in a VM (don't ask me how ;)) so if you need some files i can send them to you.
     
  7. Nightinggale

    Nightinggale Deity

    Joined:
    Feb 2, 2009
    Messages:
    4,272
    I have been busy with other tasks so far, which mean I haven't even investigated this yet.

    I suspect they were in fact talking about the exe. My debug code has yet to display multi byte characters and that is with python speaking directly to the exe. This exe file just got a whole lot more interesting.

    If the exe is in fact modified, then we could (at least in theory) inject code into the exe using the DLL. There are two problems with that approach though. Not only have I never done something like that before, it would also trigger a steam anticheat system and crash the game meaning it would only work with the disc version.

    Also where does this leave me with the Colonization exe :(

    I wonder if we could copy the modified exe into the standard game directory and call it Civ4BeyondSwordTwoByte.exe and then make an alias to it with the mod argument to start the right mod. I don't think steam would like that either though.
     
  8. dbkblk

    dbkblk Emperor

    Joined:
    Oct 28, 2005
    Messages:
    1,781
    Location:
    France
    In fact, i don't really understand the problem here. If we convert MultiBytes to wchar before the executable does anything, then all it has to do is to handle wchar ? I haven't progressed on the code yet but i have difficulties to understand how 2-bytes is stored inside a wchar.

    EDIT: Here is the exe and dll from the chinese patch. Strangely enough, the executable don't have the same size than the one from japanese patch (which is a good news as the chinese patch is unofficial).
     
  9. dbkblk

    dbkblk Emperor

    Joined:
    Oct 28, 2005
    Messages:
    1,781
    Location:
    France
    I try to adapt the iconv function for korean. In fact, the problem comes before ConvertString, as when "아브라함" (Korean) is inside "szTextVal", it becomes "아브라함".
    I cannot process the string if it is not correct in the input. Any idea?
     
  10. Nightinggale

    Nightinggale Deity

    Joined:
    Feb 2, 2009
    Messages:
    4,272
    Both yes and no. When writing to a string (or char array), 아브라함 has a specific sequence of bytes. When those bytes are read using CodePage 1252, it becomes 아브라함. If you read the same string as UTF-8, it reverts back to 아브라함.

    The problem isn't the string, it's your display, which is using the wrong encoding.
     
  11. dbkblk

    dbkblk Emperor

    Joined:
    Oct 28, 2005
    Messages:
    1,781
    Location:
    France
    In fact, i'm starting Civ in Debug mode and my computer non-unicode programs are set to korean (chcp = 949). I think Visual Studio always outputs in cp1252.
    That said, when i get the int of the first char, it gives me 236, which is "ì".
    Maybe the XML parser is reading the file as 1252 even if i'm using 949 ?

    EDIT: In the original iconv function, the string is input as a pointer, while here szTextVal is a standard string. I guess the two-first values of the string are only one char.
    EDIT2: The answer is right under my nose as the wrong string has two-times more char than the korean one.
    EDIT3: I bypassed the limitation by checking c2 = szTextValue.data()[i+1] when c1 > 127 and i've managed to get the unicode number for the korean char. But no display yet.
    EDIT4: False alarm. The output code isn't the right one.
     
  12. dbkblk

    dbkblk Emperor

    Joined:
    Oct 28, 2005
    Messages:
    1,781
    Location:
    France
    Here is the decoded unicode values for this string:
    Korean string: 아브라함
    아 : 50500
    브 : 48652
    라 : 46972
    함 : 54632

    String: 아브라함
    ì : 236
    • : 8226
    „ : 8222
    ë : 235
    ¸ : 184
    Π: 338
    ë : 235
    ¼ : 188
    í : 237
    • : 8226
    ¨ : 268

    What i try to look for is to understand the relation between the two.
     
  13. Nightinggale

    Nightinggale Deity

    Joined:
    Feb 2, 2009
    Messages:
    4,272
    You messed up the encoding and wrote unicode values. When it comes to the values of each byte, unicode is different from UTF-8 :crazyeye:

    Also you read as CP1252, not unicode

    Korean string: 아브라함 (UTF-8)
    아 : 50500 EC 95 84
    브 : 48652 EB B8 8C
    라 : 46972 EB 9D BC
    함 : 54632 ED 95 A8

    String: 아브라함 (in CP1252)
    ì : 236 EC
    • : 8226 95
    „ : 8222 84
    ë : 235 EB
    ¸ : 184 B8
    Π: 338 8C
    ë : 235 EB
     9D (unprintable on my screen)
    ¼ : 188 BC
    í : 237 ED
    • : 8226 95
    ¨ : 268 A8

    They look like they match perfectly to me.
     
  14. dbkblk

    dbkblk Emperor

    Joined:
    Oct 28, 2005
    Messages:
    1,781
    Location:
    France
    Ok. But why these chars are said to use two-bytes if they are on three bytes ? I guess unicode encode korean to three bytes and we have to set them to two.
     
  15. Nightinggale

    Nightinggale Deity

    Joined:
    Feb 2, 2009
    Messages:
    4,272
    That is one of the great issues with UTF-8. Asian characters tend to use 3 bytes while they use 2 bytes in codepages or similar local charactersets.

    I wrote some code to read UTF-8 and store the unicode ID (the real unicode ID). Technically it converts from UTF-8 to UTF-32. Then the iconv tables are used to convert that number back into a single byte using a codepage encoding. If we are to use a two byte codepage (assuming the exe accepts it), the UTF-32 has to be converted into that codepage.
     
  16. dbkblk

    dbkblk Emperor

    Joined:
    Oct 28, 2005
    Messages:
    1,781
    Location:
    France
    I've tried to use the iconv tables from Hangul to get the right chars, but i've made the assumption that it was on two bytes, so it cannot work. I'll try with three.

    EDIT: WAIT! There is something i can't understand. The iconv function to convert UTF8 to 949 does handle only two chars, so how the third one is supposed to be handled:
    Code:
    static int
    cp949_mbtowc (conv_t conv, ucs4_t *pwc, const unsigned char *s, int n)
    {
      unsigned char c = *s;
      /* Code set 0 (ASCII) */
      if (c < 0x80)
        return ascii_mbtowc(conv,pwc,s,n);
      /* UHC part 1 */
      if (c >= 0x81 && c <= 0xa0)
        return uhc_1_mbtowc(conv,pwc,s,n);
      if (c >= 0xa1 && c < 0xff) {
        if (n < 2)
          return RET_TOOFEW(0);
        {
          unsigned char c2 = s[1];
          if (c2 < 0xa1)
            /* UHC part 2 */
            return uhc_2_mbtowc(conv,pwc,s,n);
          else if (c2 < 0xff && !(c == 0xa2 && c2 == 0xe8)) {
            /* Code set 1 (KS C 5601-1992, now KS X 1001:1998) */
            unsigned char buf[2];
            int ret;
            buf[0] = c-0x80; buf[1] = c2-0x80;
            ret = ksc5601_mbtowc(conv,pwc,buf,2);
            if (ret != RET_ILSEQ)
              return ret;
            /* User-defined characters */
            if (c == 0xc9) {
              *pwc = 0xe000 + (c2 - 0xa1);
              return 2;
            }
            if (c == 0xfe) {
              *pwc = 0xe05e + (c2 - 0xa1);
              return 2;
            }
          }
        }
      }
      return RET_ILSEQ;
    }
    Code:
    uhc_1_mbtowc (conv_t conv, ucs4_t *pwc, const unsigned char *s, int n)
    {
      unsigned char c1 = s[0];
      if ((c1 >= 0x81 && c1 <= 0xa0)) {
        if (n >= 2) {
          unsigned char c2 = s[1];
          if ((c2 >= 0x41 && c2 < 0x5b) || (c2 >= 0x61 && c2 < 0x7b) || (c2 >= 0x81 && c2 < 0xff)) {
            unsigned int row = c1 - 0x81;
            unsigned int col = c2 - (c2 >= 0x81 ? 0x4d : c2 >= 0x61 ? 0x47 : 0x41);
            unsigned int i = 178 * row + col;
            if (i < 5696) {
              *pwc = (ucs4_t) (uhc_1_2uni_main_page81[2*row+(col>=89?1:0)] + uhc_1_2uni_page81[i]);
              return 2;
            }
          }
          return RET_ILSEQ;
        }
        return RET_TOOFEW(0);
      }
      return RET_ILSEQ;
    }
    Code:
    static int
    uhc_2_mbtowc (conv_t conv, ucs4_t *pwc, const unsigned char *s, int n)
    {
      unsigned char c1 = s[0];
      if ((c1 >= 0xa1 && c1 <= 0xc6)) {
        if (n >= 2) {
          unsigned char c2 = s[1];
          if ((c2 >= 0x41 && c2 < 0x5b) || (c2 >= 0x61 && c2 < 0x7b) || (c2 >= 0x81 && c2 < 0xa1)) {
            unsigned int row = c1 - 0xa1;
            unsigned int col = c2 - (c2 >= 0x81 ? 0x4d : c2 >= 0x61 ? 0x47 : 0x41);
            unsigned int i = 84 * row + col;
            if (i < 3126) {
              *pwc = (ucs4_t) (uhc_2_2uni_main_pagea1[2*row+(col>=42?1:0)] + uhc_2_2uni_pagea1[i]);
              return 2;
            }
          }
          return RET_ILSEQ;
        }
        return RET_TOOFEW(0);
      }
      return RET_ILSEQ;
    }
    Code:
    static int
    ksc5601_mbtowc (conv_t conv, ucs4_t *pwc, const unsigned char *s, int n)
    {
      unsigned char c1 = s[0];
      if ((c1 >= 0x21 && c1 <= 0x2c) || (c1 >= 0x30 && c1 <= 0x48) || (c1 >= 0x4a && c1 <= 0x7d)) {
        if (n >= 2) {
          unsigned char c2 = s[1];
          if (c2 >= 0x21 && c2 < 0x7f) {
            unsigned int i = 94 * (c1 - 0x21) + (c2 - 0x21);
            unsigned short wc = 0xfffd;
            if (i < 1410) {
              if (i < 1115)
                wc = ksc5601_2uni_page21[i];
            } else if (i < 3854) {
              if (i < 3760)
                wc = ksc5601_2uni_page30[i-1410];
            } else {
              if (i < 8742)
                wc = ksc5601_2uni_page4a[i-3854];
            }
            if (wc != 0xfffd) {
              *pwc = (ucs4_t) wc;
              return 2;
            }
          }
          return RET_ILSEQ;
        }
        return RET_TOOFEW(0);
      }
      return RET_ILSEQ;
    }
     
  17. dbkblk

    dbkblk Emperor

    Joined:
    Oct 28, 2005
    Messages:
    1,781
    Location:
    France
    I'm still trying to seek to unravel Korean's mysteries. I'm trying to look for the first char (unicode C544) and i'm surprised i can't find it in the CP949 table. Look here. There is a jump between C543 and C546. Poor me.

    EDIT: Useless post, it's in BEC6 (CP949).
     
  18. Milliopolis

    Milliopolis Chieftain

    Joined:
    Dec 30, 2013
    Messages:
    9
    @dbkblk
    Hello,I found you at civ4 wiki Napoleon server.
    I tell you some information in civ4 Japanese ver, but I'm not god at programing.
    So, I know what troubles you but not how to solve the problem.

    Multiwiki distributes XML patch for the language locale problem in Japanese to play English ver.
    http://civ4multi.info/patch/
    This patch will help you.

    Unofficial Chinese patch is an appropriation from Japanese patch.
    Hisuipanda has completed Japanize steam ver to use unofficial Chinese patch.
    http://twilog.org/hisuipanda/date-140311
    http://twilog.org/hisuipanda/date-140312
     
  19. dbkblk

    dbkblk Emperor

    Joined:
    Oct 28, 2005
    Messages:
    1,781
    Location:
    France
    In fact, we're trying to make the game to read UTF8 so we can save Japanese in text files and convert the text on-the-fly when the executable run to display it on japanese PC. The method works for 1-byte languages but not on 2-bytes (which includes Japanese).
    The patch you provided contains a fix for the japanese version to display english, but no japanese chars. We want to do the reverse of this, inject japanese into the european version!

    I'll try to get in touch with him! Thank you for your concern, it is a pleasure to get some help from Japanese people, Milliopolis.
     
  20. Nightinggale

    Nightinggale Deity

    Joined:
    Feb 2, 2009
    Messages:
    4,272
    Sadly no. This patch only contain XML text files and they contain only English. Looks like they are just the vanilla strings we already have :(

    However I extracted http://civ4multi.info/forums/ from the installer. I wonder if there are people there who knows a bit more of what is going on with the text encoding.
     

Share This Page