Translator(s) needed

Looks like if is supposed to except at the top, but that is a simple, copy and paste.
Other thing is, when doing something like this put it in CODE wrap
Code:
(see attached, just click on it) and thx for trying, everything helps no matter how big or small.:)

I tell you what, i will make a Text file just for YOU for Russian ok: (See attached 2)

OK look at attached 2 it said some will be lost because of the UTF coding??
You specified a false coding by using the header with the Latin1 encoding. UTF-8 needs to be used here.
But from some tests I have run today it seems like the XML parser used by Civ ignores the encoding tag and does not properly read it into the unicode strings used internally.
I am investigating if I can get it to yield the unchanged raw UTF-8 encoded text and then convert it myself to the internal wide character unicode.

Edit: Success, reading in the raw string as narrow string and then calling the Windows API to convert the string seems to work. But I will still need to write that code properly and investigate in how far that has an effect on different existing texts (I guess they will all need to be converted to UTF-8 for the cases where ä,ö or similar are used directly).
 
You specified a false coding by using the header with the Latin1 encoding. UTF-8 needs to be used here.
But from some tests I have run today it seems like the XML parser used by Civ ignores the encoding tag and does not properly read it into the unicode strings used internally.
I am investigating if I can get it to yield the unchanged raw UTF-8 encoded text and then convert it myself to the internal wide character unicode.

Edit: Success, reading in the raw string as narrow string and then calling the Windows API to convert the string seems to work. But I will still need to write that code properly and investigate in how far that has an effect on different existing texts (I guess they will all need to be converted to UTF-8 for the cases where ä,ö or similar are used directly).

I thought something was wrong. But the coding changed it to work that way, i bet ALOT of the TXT then is in-correct.
 
I thought something was wrong. But the coding changed it to work that way, i bet ALOT of the TXT then is in-correct.
Latin1 aka ISO-8859-1 is an 8 bit character set that contains most of the characters that are used by western european languages and all of the translations that we currently have only use those characters. Most texts even use only the 7 bit subset of it that is equivalent to ASCII and use HTML encoding for the other characters. Setting those to ISO-8859-1 is fine (and it seems like the XML parser in Civ ignores that encoding setting anyway and always reads with that encoding).

With the new code I am adding the texts will be read with that parser but then afterwards reinterpreted as UTF-8. For all texts that only use the 7 bit ASCII that will work fine, as UTF-8 is compatible with ASCII but all text files that use äöü or similar directly instead of using the HTML encoding (&234; or similar) will need to be loaded and saved in UTF-8 (without BOM) with a good editor like Notepad++.
 
there is 17456 lines of text, then.
my modding team may try to help to provide polish transaltion, if we get acccess to SVN :).

we have been recently translating a mod with about 60 000 lines of text with mere knowledge of language (mod was russian, most of us didn't know russian), so this would be much easier. still it took us about 2 years to do so, so I won't expect we would be done in 2-3 months, but rather longer.
 
there is 17456 lines of text, then.
my modding team may try to help to provide polish translation, if we get access to SVN :).

we have been recently translating a mod with about 60 000 lines of text with mere knowledge of language (mod was russian, most of us didn't know russian), so this would be much easier. still it took us about 2 years to do so, so I won't expect we would be done in 2-3 months, but rather longer.

That the great thing that AIAndy did about the NEW text coding. Even if you only put a few here and there, it will still work, ie: meaning the default is English of course, but if you add for instance Polish to some of the text then it will show up that way. At least thats what i understand. .

As far as SVN authorization goes, i'd like to see quite a few text documents before that, i just want to be careful is all, ok.:)
 
Well, seems like I was too early in calling it a success. The text is properly read in and transferred to the display functions but it just displays invisible text. That may be an issue with the engine itself and the way it messes with the fonts to get symbols to display. It is also possible that the russian version of Civ4 does not have that problem.

I will commit the code that actually reads it in properly to the SVN and then someone with the russian Civ4 can test out if the display works.

The old translation mods just seem to map additional characters into the 1 byte code part and use a different font which is not something we can do for C2C.
An option would be that if someone adds the cyrillic characters to the game font TGAs, then I could map the standard unicode part for those characters to the positions in which Civ inserts the symbols and they would probably be displayed fine.

EDIT: I found a program for rendering true type fonts as font TGAs so it is probably not that difficult to include the cyrillic letters in the game font TGA.
 
is it possible to change the DLL to support unicode fonts? (any not *.fnt standard font format, ie ttf)... well it seems default font is one of them, so... is it possible for DLL to fix game to use any encoding, or is it engine-only matter?

if you want TGA font with all latin+ and all cyrylic characters it would be more than 256 glyphs for sure, so isn't it better to use utf-8 than creating own encoding? (does TGA support basic compression? TGA font with full unicode spectrum would be very big image, at least empty characters may be heaviley compressed as monocolor parts of image) i still think it would be better to use ttf font, but just stick to unicode standard (as we use utf8 not some ANSI).

anyway TGA font is also an option but it would be problem to modify it if some characters weren't included (ie. if we want cyrylica do we include only russian, or only bulgarian or full set of characters of cyrylica? same for latin+ also do we want arabian translation?). we may use fully unicode-compliant tga font too but it would look very funny :P. if we stick to ttf, we just need to provide any font instead of sylfaen and use it (most fonts have full latin+ cyrylica and a bit more, and some of them are loyalty-free)
 
we just need to provide any font instead of sylfaen and use it
This does not work for solving the utf-8 problem.

You can already just take any .ttf font you like, and play your civ with that font. Replacement is in the theme files in bts, so the font-replacement is easily possible - just that it does not help with the utf-8 topic. (am playing civ4c2c under linux with font-substitution to 'open sans', just because it looks better)

Oh, and as you mention arabic... for such a bidirectional support also must be supported, as well as the correct combination of characters. So thats again another step forward.
 
edit: @aiandy: have some problems in following your thoughts :) do I understand you correctly, that you think on treating the text sort of "blahblah[small-polish-stroken-L]blah" like its done with the [ICON_UHAPPY] -> getting this sign out of the .tga ?
 
Let me give a longer explanation and start with how text is treated in Civ4.

Internally Civ4 uses unicode wide characters so in principle unicode is already supported here.
The first step is that it is read from the XML. That is done by an MS parser that is included via the exe. The first obstacle here was that any encoding setting in the XML was ignored. Instead it just reads the text as Latin-1 and the wide character string into which you read has 0 as every second byte. I solved that now by reading into a single byte string and then interpreting that as UTF-8 afterwards, writing the result into a wide character string.
That unicode string is then stored in the EXE part of the code. When you retrieve it later it is processed first and some text is replaced by parameters or symbols that are mapped into high unicode numbers. At that point any unicode you put in earlier gets back out fine and you get the symbols as unicode in addition with some replacement routines in the DLL that are called.

So up to that point all is fine.
Now you pass it into Python which gets it as unicode string so still all is fine. But when you pass it to one of the widgets that are in the exe now, then it seems like any wide character that has 0 as first byte is retrieved from the TTF while e.g. the cyrillic unicode only appears as blank. Any of the symbols that are mapped from the TGA font into unicode are displayed fine though.

So my suggested solution now is to add the cyrillic characters (or any others we need) to the TGA font and when the text is read from the XML change anything that refers to the cyrillic unicode to refer to the respective characters in the TGA instead.
 
But when you pass it to one of the widgets that are in the exe now, then it seems like any wide character that has 0 as first byte is retrieved from the TTF while e.g. the cyrillic unicode only appears as blank.
:rolleyes: sounds like a quick and dirty solution from them *chuckle*
Are there parts of the game which are un-affected by this behaviour, or does it all go through those widgets?

Thanks a lot for the explanation, Andy, now I understand it a bit better - and also why it maybe could work with the russian civ.
 
so all ascii charatcters would come from sylfaen font and all non-ascii characters from tga symbol font?
how many characters would be needed to get included, and wouldn't that be much of code? I propose to do one code for all symbols by using unicode glyph number.
 
so all ascii charatcters would come from sylfaen font and all non-ascii characters from tga symbol font?
Actually all latin-1 characters, not only the ASCII ones.

how many characters would be needed to get included, and wouldn't that be much of code? I propose to do one code for all symbols by using unicode glyph number.
Not sure what you mean with code here.
All characters that are not in Latin-1 but needed for the translations would need to be added to the TGA.
The input text in the text XML would use normal UTF-8 and the mapping would then happen in the XML reading code.
 
now this latest change did mess up quite a lot of the german entries. The perfect challenge to make all the files finally in the right way :). As soon as I have finished adapting and programming my little editor this will proceed much faster than til now.

Still it shows another issue with the Civpedia: it should sort the entries in another way. The actual sorting of the civpedia shows its weakness, it takes just the first letter of the entry and makes the index out of these letters.

Better (especially for german) would be:
- ignoring case sensitivity, so sorting "g" together with "G" and so on (asides from the fact that there should not occur small characters)
- ignoring the Ää / Öö / Üü, treating these letters just like Aa / Oo / Uu, as it is common in dictionaries too (the ß / & szlig; would count as "ss").

At the moment, the additional index-page-entries cluster the selection, as to see on the screenshot.
 

Attachments

  • Civpedia-Index-German.jpg
    Civpedia-Index-German.jpg
    136 KB · Views: 66
now this latest change did mess up quite a lot of the german entries. The perfect challenge to make all the files finally in the right way :). As soon as I have finished adapting and programming my little editor this will proceed much faster than til now.
It seems like my previous assumption was not entirely correct. Using ä and the like directly and then storing it as UTF-8 works but the usage of ä or the equivalent HTML code without storing it as UTF-8 doesn't. Neither does the HTML code as UTF-8.
So I guess I need to change that code some more to keep backward compatibility.

Still it shows another issue with the Civpedia: it should sort the entries in another way. The actual sorting of the civpedia shows its weakness, it takes just the first letter of the entry and makes the index out of these letters.

Better (especially for german) would be:
- ignoring case sensitivity, so sorting "g" together with "G" and so on (asides from the fact that there should not occur small characters)
- ignoring the Ää / Öö / Üü, treating these letters just like Aa / Oo / Uu, as it is common in dictionaries too (the ß / & szlig; would count as "ss").

At the moment, the additional index-page-entries cluster the selection, as to see on the screenshot.
The Pedia currently uses the standard Python sort/compare which sorts by byte value and not by the usual dictionary ways.
Also some of the pedia parts don't capitalize the first letter properly.
 
I am still working on sort of converting tool (integrated in my translation gui) , just need a bit more time til it works, then we will get rid of those oddities anyways. So dont worry about backwards-compatibility for german - its anyways better to have a clean source.
If I am too slow, always it could work with search/replace to the "right" letter-version to use.

I know the civpedia uses this way, but is there a possibility to change it in python ? the small-letters there should not be a problem as there should not occur small letters in the beginning, its just about the Ää/Öö/Üü treating for this like AaOoUu, and maybe the ß as ss too.
Think for the other languages, the È and É and similar are sorted the same way, so this would be a bit more complex *scratchhead*
 
I am still working on sort of converting tool (integrated in my translation gui) , just need a bit more time til it works, then we will get rid of those oddities anyways. So dont worry about backwards-compatibility for german - its anyways better to have a clean source.
If I am too slow, always it could work with search/replace to the "right" letter-version to use.
While that would be possible, I would prefer a backward compatible way so we don't have to change all of it and especially text we add from other mod comps will still work.

I know the civpedia uses this way, but is there a possibility to change it in python ? the small-letters there should not be a problem as there should not occur small letters in the beginning, its just about the Ää/Öö/Üü treating for this like AaOoUu, and maybe the ß as ss too.
Think for the other languages, the È and É and similar are sorted the same way, so this would be a bit more complex *scratchhead*
There are special comparison functions that I think rely on the OS for that kind of thing.
 
Which coding version would you prefer in the raw texts, AIAndy?
Ä or & #196; (without the space char of course) or Ä ?

its just because I ran into some conversion issues and want to "do it right", so I made an array of all used non-[A-Za-z0-9]-chars at the moment. Asides from ISO-8859-1 substitutions I had some strange matches, which are easy to come by:
"’" - & #8217;
"…" - & #8230;
"′" - & #8242;
"×" - & #x0D7;
"ā" - & #x101;
"–" - & #x2013;
"É" - & #xC9;
 
Back
Top Bottom