Accents and XML Errors: A Plea to All Modders

Danapoppa

Chieftain
Joined
Mar 17, 2004
Messages
25
Location
Japan
I hope y'all will pardon my posting in this forum. I'm not a modder (yet) and this is not exactly a tutorial, but it does contain information that may be of use to modders, and I hope you will all refer to it. If I've missed my mark, then I pray a kind moderator will remove it to a more appropriate location.


BACKGROUND:

I'm running Civ4 on a computer that runs a Japanese version of Windows XP. This is pretty much unavoidable. I bought my computer in Japan, and don't really feel like shelling out the extra bucks for an English OS. And as a native speaker of English, I'd just as soon not play Civ4 in Japanese.

As some of you may know, computers tend to encode text in Japanese (and indeed certain other languages) using two bytes per character. This doesn't affect common ASCII characters, such as the ones you're reading now. But certain two-byte combinations are co-opted by the OS to represent Japanese characters, and so cannot be used as they are in most Western-language OSes.

Why should you care about this? Well, among the co-opted combinations are a number of strings commonly found in XML files, particularly those which include foreign language translations. For example, the TAM_Eras.xml file included with The Ancient Mediterranean modpack contains the string

<Text>Antiquité</Text>​

as its French translation of the English phrase "Stone Age." And it just so happens that the combination of an acute-accented "e" followed by the less-than sign is used to represent some obscure Japanese character or another. So when the XML parser gets to that point, it chokes and stops loading the file. Civ4 gamely tries to run the mod, but text errors abound, making the mod unplayable.

I've found one surefire way around this problem, and that is to replace the problematic characters with the appropriate character codes. For example, the accented e in the above string could (and indeed should) be represented by the following string of characters:

<Text>Antiquit&#38;#233;</Text>​

By going into the XML files and making these changes manually, I can fix the mod so the XML parser will load it without problem -- and even if I wished to run my copy of Civ4 in French, this change would presumably allow it to display the French translation for "Stone Age" correctly.


WHY SHOULD YOU CARE?

By now you may be wondering why you've bothered reading this far. Well, you should care, if you really want people everywhere to play your mods.

I can state from experience (having manually repaired all the mods that shipped with Civ4 Complete) that fixing a mod this way involves a fair bit of slog-work. Often several different files need to be fixed; fortunately the XML parser tells us which ones contain errors.

But since the parser only reports the first error it encounters in any given file, there's no way of knowing how many errors of what type I'll need to fix going in. I may know about the acute-accented "e" in the French translation in line 16, but how about the umlauted "u" in the German translation in line 1256? So I have two choices: either (A) eyeball the whole file looking for all possible errors, or (B) fix the one I know about and load the mod again to see if the XML parser will choke on another error.

As I'm sure you can imagine, this takes a bit of time. It's not uncommon to have to spend an hour fixing accented characters before I can start playing a new mod. Not quite the entertainment I was hoping for! And it's not just once: every time you release an update to your mod, I'll have to go through and fix the same errors again, and again, and again.

Yes, I can probably cut down on the slog by creating a macro that will scan a batch of files and fix all accented characters automatically. But while that will fix the problem for me, what about all the other folks who would love to play your mod and don't possess my degree of kung fu? Most likely they'll download and install it, say "Bah, it's broken!" and never give it the chance it deserves. Are you prepared to write them off?


WHAT NEEDS TO BE DONE?

If the XML files in your mod contain foreign language translations (French, Spanish, German, and Italian are the most common culprits) then they need to be fixed so they can be parsed by systems which use two-byte encodings. Accented characters should be replaced by the appropriate character codes, which are the same codes used to ensure correct display by HTML browsers. The list I usually refer to is this one right here:

http://www.w3.org/MarkUp/html-spec/html-spec_13.html

Ideally these codes should be used to represent all accented characters (and special characters) in the XML files. In practice, however, it's generally only necessary to replace characters that come directly before a less-than sign. (At least, that's what works for me. I can't be certain how systems running in other languages are affected.)

I hope you can agree it would be a good idea if this problem were fixed at the source. That way the fix only needs to be applied once. Then it will be fixed for everybody, and for future versions.

To maximize the audience for your mod, please consider adopting this fix and making it part of your mod creation workflow.

And keep up the great work!
 
well, interesting post. I feel your concern. I may consider doing that with my mod, but currently I'm already swamped with other, (maybe more crucial) work to do on the mod. I do have some of those accents in the text of my mod though, not very many. I did note though that it is for english only text support. Many of the french, german, italian, and spanish tags are left empty anyhow.

Its not a top priority of mine, but I'll consider it, if I can get more stuff translated properly.
 
Thanks for taking an interest. Yes, accents can cause problems even if they're in the English, though usually (at least as far as I know) only if they come immediately before the bracket that starts the closing tag.

Since posting the original message I've discovered that it's not just accents that cause the problem. There are other special characters that cause the same problem, including the ellipsis (when you represent "..." as one character rather than as three periods) and quotation marks (the fancy opening and closing ones, not the one you get by hitting the quotation mark button on your keyboard).

The problem with these special characters is they're not listed in the URL I posted, which happens to be the list for HTML 2.0. Codes for these characters were added for HTML 4.01, and presumably those codes can be used in the XML files. But I haven't yet tested to make sure that the game will correctly display them.
 
You're talking about files added by modders (not the files originally included in the commercial version of CIV4), right?

I'm glad you posted that link as I ran into some problems getting the correct umlauts to display for historical German squadron names and certain city names in my mod. The files affected would be the unitInfos and map xml files in which I've added such names. These are the ones you are concerned about right?

I haven't included any other foreign text translations (yet...)
 
I was surprised to see that in the BTS XML files, special characters aren't encoded in unicode, but in ISO 8859-1 (at least in the German release). Was this corrected for the Japanese release of the expansion?

Have you tried fiddling with the computer's setting? There is an option to choose which encoding non-unicode programs should use by default, and the standard used by Civ4 is Western European, i.e. ISO 8859-1.

The option can be found in 'Control Panel - Regional and Language Options - Advanced' (translation to Japanese is left as an exercise for the reader). I don't know if the character encoding is deselected (search for it in the 'Code page conversion tables') of if it is simply the case that a Japanese one is used by default (changing the default language to English and then restarting might help). If this doesn't work or isn't an option for you, I found an official tool to circumvent the system's standard encoding for certain programs:

http://www.microsoft.com/globaldev/tools/apploc.mspx

I haven't tested it myself, but sounds as if this could solve your problem.


These are some workarounds that should work, but of course they do not invalidate your point that Unicode should be preferred wherever possible.
 
You're talking about files added by modders (not the files originally included in the commercial version of CIV4), right?

Yes, I'm talking about files added by modders. Although as a matter of fact, some of the files originally included with the game also had this problem. Not just mods (Greek World, Chinese Unification, Afterworld, Next War, and RFC, to be precise), but even some of the XML files for unmodded Warlords and BtS caused the XML parser to choke on startup.

But I'm speaking to modders here, because I know there's very little chance Firaxis will release a patch at this point. ;)

I'm glad you posted that link as I ran into some problems getting the correct umlauts to display for historical German squadron names and certain city names in my mod. The files affected would be the unitInfos and map xml files in which I've added such names. These are the ones you are concerned about right?

Right. Any XML file containing text to be displayed by the game could be affected. I hope the character codes work for you.

I was surprised to see that in the BTS XML files, special characters aren't encoded in unicode, but in ISO 8859-1 (at least in the German release). Was this corrected for the Japanese release of the expansion?

I can't speak for the Japanese release. I'm playing the Civ4 Complete package (I think it's a UK release), I just happen to be playing it on a Japanese OS.

Have you tried fiddling with the computer's setting? There is an option to choose which encoding non-unicode programs should use by default, and the standard used by Civ4 is Western European, i.e. ISO 8859-1.

I know about that setting, but I'm hesitant to change it as Civ4 isn't the only thing I do with my computer (though it'd be nice if it was). In most cases I'd like to use Japanese as the default. And I'm not sure that changing that setting will affect how the XML parser operates. It'd be interesting to test it; maybe I'll do so if I find a moment.

Even if that could fix the problem for me, I'm also concerned about other users of two-byte systems who might not find their way to this thread. As a matter of principle, it would be best if the character codes were used in the XML files to ensure compatability everywhere.

Oddly enough, I noticed that there were indeed some character codes in the some of the XML files (both those that shipped with the game and those included with mods). Even some of the files that caused the XML parser to choke used character codes in some places; they just weren't consistent. So either someone was aware that accents could cause problems with parsing, or they used an XML editor that automatically replaced the accented characters with character codes (and later changes were made with a different editor or something).
 
Back
Top Bottom