[Dev tool] Civilization 4 XML translation tool

dbkblk

Emperor
Joined
Oct 28, 2005
Messages
1,786
Location
France
Civilization 4 XML translation tool

This tool is intended to be used by developpers

DESCRIPTION
This tool allow modders to easily sort and cleaning XML text files in Civ 4. It can:
- Export language tags to UTF8 (for importation with a translation platform)
- Import back the exported files
- Sort all tags by categories
- Clean xml files from redundant tags (language values identical to english, empty values, etc.)
- Remove a specific language from files
- Find unused text tags
- Semi-automatic support for encoding / decoding non-latin1 languages

By extension, it allow you to:
- Easily merge ANY xml text file
- Easily merge or extract new languages from xml
- Merge the base game text into the mod game text
- Convert UTF8 russian to Civ 4 format

It must be copied in the same folder as the xml text files.
NOTES:
- The sorting function need the "_xml_parser.config" file. It is a xml file with just another extension. See readme.
- To support semi-automatic encoding/decoding for non-latin1 language, please set the file "_xml_charsets.config". See readme.
- It also works with Colonization.

More detail on how the functions work are available here:
https://github.com/dbkblk/civ4_xml_parser/blob/master/README.md

CHANGELOG
v1.0
- Find unused tags (need gamepath and source dll path). See readme.
- Support for charset conversion tables (useful for any non-latin1 language).

Bugfixes:
- Updating subtags created a new tag instead.
- Tag list is now case unsensitive.
- Fixed exporting function.

v0.9
- Remove a specific language
- Find unused tags in files (still experimental)
- Compiled with MSVC2010 (instead of msvc2013) Qt 5.3.1 static = Usable on XP and Linux (wine)
- Config file renamed from "_categories.parse" to "_xml_parser.config"

v0.8
- Advanced import / export (convert all buggy characters to Civ 4 format)
- Clean files from non-useful tags
- Sort tags by categories
- Support Cyrillic language

REQUIREMENTS
-> To execute, you will need MSVC Redist 2010: Download here (if you have already the latest version of the launcher, that should be ok)
-> Compiled with MSVC2010 & Qt 5.3.1 static (Qt Creator)

https://github.com/dbkblk/civ4_xml_parser/releases/download/v1.0/civ4_xml_parser_v1.0.7z

Readme and source code:
GITHUB LINK
 
Updated to 0.8
- Advanced import / export (convert all buggy characters to Civ 4 format)
- Clean files from non-useful tags
- Sort tags by categories
- Support Cyrillic language

Note for devs: I've finally managed to reduce the computing time of the sorting function from 7 hours to... 5min :D
 
This appears to be a great tool, but I'm having problems running it :(

First I had an error about not being a win32 application. I "solved" this by switching to the 64 bit system, though I would like the 32 bit system to handle this as well.

The 64 bit system complains that MSVCP120.dll is missing.

Next I'm looking into compiling myself. Without any project file or makefile I had a bit of trial and error. It looks like it needs QT 5 and it doesn't like my QT 4. I guess I will have to replace that one. I decided to give feedback on this even though I intend to continue to compile myself. After all if I fail to execute the exe, there will be other people who fail as well.

Apart from the setbacks, I totally love what this tool claims to be able to do. I mod the Colonization mod Medieval Conquest and the other day a Russian showed up and he wants to translate to Russian. Great, but it mean that I had to start researching how to use Cyrillic ingame. I see that it requires a modified GameFont.tga. However I haven't seen anywhere where it says how it should be modified. It would be very nice if this thread could tell that. Even better if there is a working one where a part of it could be copied into another one.

-> Detect unused tag based on python and xml files in «Assets»
That approach will tell that tags are unused if they are only used by the DLL itself. While it would likely be the most correct to search through the source files, it would lead to problems as we don't know where the source files are relative to Assets.

Plan B: We know where the compiled DLL is. The DLL has the strings compiled into it meaning they will be available in plain text if you open it with something like an hex editor. If the translation tool opens the DLL as char array, it should be able to move through it and locate all strings used in the DLL source code.
 
First I had an error about not being a win32 application. I "solved" this by switching to the 64 bit system, though I would like the 32 bit system to handle this as well.
The 64 bit system complains that MSVCP120.dll is missing.

I'm working with QT5 compiled statically with MSVC2013 Redist. You need to install MSVC (i forgot to mention requirements). Note that i use VirtualBox to test my programs on Windows 7 32bits while i'm working on Windows 8.1 64bits.

You can download it here (that's the version the installer download)

Next I'm looking into compiling myself. Without any project file or makefile I had a bit of trial and error. It looks like it needs QT 5 and it doesn't like my QT 4. I guess I will have to replace that one. I decided to give feedback on this even though I intend to continue to compile myself. After all if I fail to execute the exe, there will be other people who fail as well.
It is compiled with Qt5.3.0 and i'm pretty sure i use some qt5 exclusive functions so you will have difficulties.


Apart from the setbacks, I totally love what this tool claims to be able to do. I mod the Colonization mod Medieval Conquest and the other day a Russian showed up and he wants to translate to Russian. Great, but it mean that I had to start researching how to use Cyrillic ingame. I see that it requires a modified GameFont.tga. However I haven't seen anywhere where it says how it should be modified. It would be very nice if this thread could tell that. Even better if there is a working one where a part of it could be copied into another one.

I'm currently writing a batch script to easily switch from russian to latin and inversely. This is temporary because i will include that directly in the launcher later. Watch the next commits of the mod :)


That approach will tell that tags are unused if they are only used by the DLL itself. While it would likely be the most correct to search through the source files, it would lead to problems as we don't know where the source files are relative to Assets.

Plan B: We know where the compiled DLL is. The DLL has the strings compiled into it meaning they will be available in plain text if you open it with something like an hex editor. If the translation tool opens the DLL as char array, it should be able to move through it and locate all strings used in the DLL source code.

Thank you for pointing me to this. I'm not sure i can code this easily because this function has to:
- Look in the mod files
- Look in the base game files
- Look in the source code.
This sound pretty difficult but it can be so much rewarding to improve mod loading. I will focus on the launcher for now but i will come to this later again.

Thanks again for the feedback :)
 
I made a proof of concept to pull string names from the compiled DLL. It goes through the DLL byte by byte and whenever it encounters TX_KEY_, it prints that string to stdout. I admit it isn't the prettiest code that I have written, but it gets the job done and I didn't bother to clean it up as it will have to be modified anyway if it is used for anything.

The good news is that it appears to work just fine. I get a whole lot of strings mentioned by the DLL source code without actually looking through all the source code itself. I get 1033 strings and I didn't check if they were all correct, but they look like it. Also I didn't check for duplicates.
 

Attachments

  • read.cpp.txt
    1 KB · Views: 277
Thank you for the code ! I'm not used to code standard C++ as i started 3 months ago directly with Qt.
Do you think there are any base game strings inside .py/.xml used outside the dll ? I mean, do i need to check the base game files + dll + mod files or just dll + mod files ?
 
If the goal is to make this tool work with every single mod, then the base game would have to be checked too. The game reads all vanilla files and then it overwrites with the files from the mod. This mean the game will use the vanilla file for anything not mentioned in the mod and in reality most mods are a mix. Imagine a mod, which only modifies units. The python file with the strings for main menu will not be in the mod and hence the vanilla python files will tell how to display the main menu.

I don't know about modules inside the mod. I never used those.
 
Yes, my goal is to provide a generic tool for Civ4. I will try to find a way to do that so.
I will try to work on it next week !

The parser need to know where the game files are located. I'll try a registry approch like i did in the launcher, then if it's the path is wrong, just ask the path to the user (it's a bit boring for a console tool).
 
Automatic finding the vanilla files appears to be quite tricky. I see a number of potential problems:
  • registry approach could fail
  • user could have installed in a non-default location (looking in all known default locations isn't fool proof)
  • it could be a warlords mod
  • it could be Colonization (this is my goal, remember ;))
  • Command line is annoying to use for paths
I think the easiest solution is to require a txt file inside Assets, which contain the vanilla path. Sure it's a bit of extra user setup, but the automatic detection appears to be time consuming and possibly unreliable.

Once reading from the txt file works, it can be expanded make a guess at the path if the path file is missing and then ask the user if that is the right path. As the automated path detection is now an addon, it isn't critical to get correct detection every time. That mean it can try the registry path to BtS as that will presumably give the correct path in most cases.

Adding an argument to the command line telling which version you intend to use could also be an option.

The path detection system could also be postponed or skipped entirely as it is more of a "nice to have" than a critical component. Acting correctly on the found files seems more important than finding the files automatically if the user can locate them manually.
 
I think the easiest solution is to require a txt file inside Assets, which contain the vanilla path. Sure it's a bit of extra user setup, but the automatic detection appears to be time consuming and possibly unreliable.

Thank you for the idea !!! The best idea is just to ask user to put the path in the _categories.parse ! No more files, just an extra easy step !

I'm still working on russian and polish at the moment, i guess my work will interest you as it is also relative to the parser.
 
@nightinggale: Here is the 0.9 preview version. It add automatic language tag detection for exportation. The ability to remove a specific language.

It also add the "Find unused tags" options but it is experimental. I used and edited your code to find tags in DLL. I have still to understand all the details but i've managed to inject tags into the code to grab the dll output.
Add the game path to the config file. However, i have a problem as there as around 2500 tags not found out of the 20500 in the xml files and some are really important like "TXT_KEY_MAIN_MENU_LAUNCH". I wonder if there aren't in the dll, not found or in another dll.

The config file has been renamed from "_categories.parse" to "_xml_parser.config", which make more sense.

The latest code is here.

EDIT: Wait, i've found a bug.
 
Updated to v0.9:

Features:
- Remove a specific language
- Find unused tags in files (still experimental)
- Compiled with MSVC2010 (instead of msvc2013) Qt 5.3.1 static
- Config file renamed from "_categories.parse" to "_xml_parser.config"

Requirements:
- MSVC2010 http://www.microsoft.com/en-US/download/details.aspx?id=5555
 
The find unused tags function is quite tricky because some of the tags are made on-the-fly with the code. I've tried to read directly the dll or the source code but the "BUG" tags and the interface tags are written in-game. Basically, i can't just read the code and find unused tags. I'm wondering how to handle that. The only ideas that came to me is an exception list... OR read base game text files and use them as exceptions.
 
on the fly, do you mean like?
Code:
CvWString::format(L"TXT_KEY_HIDDEN_STRING_%d", myInt);
Yeah the code I wrote would detect TXT_KEY_HIDDEN_STRING_ as a tag, but not TXT_KEY_HIDDEN_STRING_0, TXT_KEY_HIDDEN_STRING_1...

The only thing I can think of is to note missing strings mentioned in the DLL. For each of those:
Code:
A is the string from DLL
X = length of string A
for each unused string B do:
  if first X characters in B == A
     store B in a list of possibly used strings
I don't really like that approach, but it is the best I can think of right now.
 
Supposing the base game text aren't duplicated, i've choose to list base game text and automatically add them as duplicates.
 
I tried using this on Colonization vanilla files, but it says "Language not supported" when I try to export. If I export a single language I get French, German, Italian and Spanish. Looks like the unsupported language is English. The file format is the same as in BTS and I see no reason why English should be different from the other languages :confused:
 
@dbkblk: after your update some text has been reverted probably to the original text in BTS. For example when voting for civics resolutions you get the message you're voting for "universal suffrage" or "free speech" which are not actual civics anymore.
 
Ok. I've slowed down a bit for some days... but i will fix that ! I have yet to fix some bugs.
 
The parser is now updated to v1.0. This is meant to be the final version. New versions will only be bugfixes (except if you convince me to add a new feature ;)).

Changelog:
- Find unused tags (need gamepath and source dll path). See readme.
- Support for charset conversion tables (useful for any non-latin1 language). See readme.

Bugfixes:
- Updating subtags created a new tag instead.
- Tag list is now case unsensitive.
- Fixed exporting function.

@Nightinggale: Sorry for the delay, i've stopped development for some weeks. This bug is now fixed. While adding a new automatic language detection, i've forgot to update the export feature. It was buggy also for Civ 4, but i didn't see that coming.
 
@Nightinggale: Sorry for the delay, i've stopped development for some weeks.
I haven't been overly active either. What bugs me most is that I didn't report my findings and plans here :blush:

I'm getting a whole lot closer to making a mod, which supports multiple languages with multiple charactersets with just one set of files.

- Support for charset conversion tables (useful for any non-latin1 language). See readme.
I have learned that the game isn't using ISO 8859-1 (aka latin1) like most people claim. It is in fact using windows-1250, which is designed to be like latin1. They aren't 100% alike though. Sadly it is more complex than this.

Having the alphabet in GameFont is a bit confusing, because for the most parts, sylfaen.ttf is used. In fact the only place I have found so far where GameFont is used is city billboards. They are size 3 meaning GameFont_75.tga appears to never use the font part.

The game doesn't support unicode and will use locate instead. This mean the character set used depends on language settings in Windows. If the computer is set to Russian, the game will use windows-1251 instead of 1250. This mean that say character 198 will no longer display Ć. Instead it will display Ж.

My progress so far:
  • I expect the XML files to be written in UTF-8
  • strings are read as strings instead of wide strings for 100% UTF-8 compatibility. They are wide strings when given to the game engine
  • converts the UTF-8 bytes into a single int containing the unicode ID
  • use GetACP() to figure out which font the game is actively using
  • copied conversion tables from iconv
  • Made conversion table to map from game font to GameFont IDs
  • automate the GameFont conversion table setup as much as possible
  • placed as much code as possible inside a new file for easier code movement between mods

I added conversion support for all 1 byte fonts listed on this page: http://msdn.microsoft.com/en-us/goglobal/bb964654
The font is missing characters for the middle east and far east languages. It appears that just removing sylfaen.ttf will force the game to use another font, which looks worse, but it has support for more languages. I have only looked at this briefly as it isn't the main issue.

I looked into the two byte fonts, but the exe file is hardcoded at just one byte. I have no idea what that mean for supporting languages not listed on that page.


The part where I am right now is the billboards and the table for converting those. There are just two places where strings are given to the billboards and the plan is to insert a function, which takes the string as argument and returns the converted string. The conversion appears to be a bit complex though and I decided to make a table for quick access at runtime.

I want to make a simple interface where people can add characters to GameFont.tga and then tell that GameFont ID 8484 is unicode 1040 (or whatever). Nothing about characterset or anything. The game will figure that out automatically. I think the easiest way to do this is to add a new XML file. That way nothing is hardcoded in the DLL itself and people can add support for a new language without even being able to compile.

Right now that interface is missing and the only source for GameFont IDs is how they are in vanilla, which is windows 1250. The best example of this is likely the unicode character 8364 (€). Windows-1250 has that one on 128 meaning GameFont ID 128 displays €. Windows-1251 has that one on 136. This mean that using 1251, whenever 136 is printed on a billboard, GameFont ID 128 is used. There is no special code for either win-1251 or €. It is simply a matter of the code detecting that it is the same unicode.

My plan is to eventually release a modcomp for Civ4BTS containing all this as well as other GameFont features, such as an ingame window to get the GameFont IDs.

EDIT:
I just realized GameFont is using windows-1252, not 1250. However my computer insists on using 1250, which appears to be fine too for some languages. English is fully supported by all charactersets, even Greek and Hebrew.

However it does reveal that I made a bug in the conversion (no big surprise that I didn't catch it as it is mostly untested right now). I need to fix that, but more importantly I need to figure out a way to get the game to use the correct characterset. I have already added an XML files where people can add new languages without recompiling. I now want to extend that one to set which characterset each translation uses. Next on startup or when the language is changed, the game should force itself to use the characterset specified in XML. That way we can not only be sure that the correct characterset is used, but also allow all translations on all computers. Currently Russian is only available if windows uses windows-1251.

I will also have to improve my debug display. It didn't catch that font and GameFont.tga is out of sync :(

I would actually consider it a vanilla bug that the game expects windows to use windows-1252. It's a really common bug from before unicode was widely used and one of the main reasons why programmers are told to use unicode today.
 
Top Bottom