XML text cleaner

j_mie6 · May 12, 2012

well yes but then the program would complain. that accent is still in the xml and the program can't load the file. the only way around it is to ignore the file or to remove that accent.

still working on fixing all this stuff btw

dacubz145 · May 12, 2012

I would just remove it then, dont see why it matters. If the pedia tag doesnt have the accent, it wont matter as long as the actual text still does, id say remove it

j_mie6 · May 12, 2012

no thats not the point lol. the your artdefines building the Palacio's actual path name contains an accent. the program will correct this and your art is now in the wrong place.

the stuff regarding text is all fine, no harm anywhere, but paths need to be accent free or set to ignore.

and after noting that if the program can load a file it doesn't fix it I realised some files with acceptable accents will be left. so I make it do all of them and as it checks each character individually it takes a loooooooooong time. guess I will leave acceptable accents in

this setup also means that the program will only run slow the first time it's used, as after accents that confict should not appear in the code

The_J · May 12, 2012

j_mie6 said:

this simple program will convert all the characters

very fast too

Code:

# -*- coding: cp1252 -*-
fileread = open("Assyria_CIV4GameText.xml", "r")
string = fileread.read()
fileread.close()

accentMap = {"À" : "À", "Á" : "Á", "Â" : "Â", "Ã" : "Ã", "Ä" : "Ä", "Å" : "Å", "Æ" : "Æ", "Ç" : "Ç", "È" : "È",\
             "É" : "É", "Ê" : "Ê", "Ë" : "Ë", "Ì" : "Ì", "Í" : "Í", "Î" : "Î", "Ï" : "Ï", "Ð" : "Ð", "Ñ" : "Ñ",\
             "Ò" : "Ò", "Ó" : "Ó", "Ô" : "Ô", "Õ" : "Õ", "Ö" : "Ö", "×" : "×", "Ø" : "Ø", "Ù" : "Ù", "Ú" : "Ú",\
             "Û" : "Û", "Ü" : "Ü", "Ý" : "Ý", "Þ" : "Þ", "ß" : "ß", "à" : "à", "á" : "á", "â" : "â", "ã" : "ã",\
             "ä" : "ä", "å" : "å", "æ" : "æ", "ç" : "ç", "è" : "è", "é" : "é", "ê" : "ê", "ë" : "ë", "ì" : "ì",\
             "í" : "í", "î" : "î", "ï" : "ï", "ð" : "ð", "ñ" : "ñ", "ò" : "ò", "ó" : "ó", "ô" : "ô", "õ" : "õ",\
             "ö" : "ö", "ø" : "ø", "ù" : "ù", "ú" : "ú", "û" : "û", "ü" : "ü", "ý" : "ý", "þ" : "þ", "ÿ" : "ÿ"}

nstring = ""
for char in string:
    if char in accentMap.keys():
        char = accentMap[char]
    nstring = nstring + char

filewrite = open("Assyria_CIV4GameText.xml", "w")
filewrite.write(nstring)
filewrite.close()

now to implement this into my cleaner somehow...

edit: of course the second character in each of those pairs is the html notation but you can't see it

btw might I suggest that the name of the buildings art file is changed to not have an á in it: Palácio do Planalto

It's easier to convert these characters into their HTML equivalents ^^.

PHP:

def htmlconvert(string):
        newString =""
        for char in string:
                if ord(char)>127:
                        newString = newString+'&#'+str(ord(char))+';'
                else:
                        newString = newString+char
        return newString

j_mie6 · May 12, 2012

whats the advantages of this method over mine? I guessing it doesn't need a coding declaration?

dacubz145 · May 12, 2012

Ohh i didnt know it was in my art defines, yeah that can be removed no problem, just need to change it in the buildings art defines as well

EDIT: also by the looks of it, Js way looks shorter, and you definatly want to make it as fast as possible, but idk python so ican be wrong

j_mie6 · May 12, 2012

I am still looking to convert it to C++ though for even faster speed. as it does take a long time when going through 355 massive files

, plus the J's code is shorter as it only deals with the string and not the files themselves, but that accounts for only 6 or 7 lines more...

The_J · May 12, 2012

Right, right. In my full code, the reading + writing of the file is handled elsewhere.

j_mie6 said:
whats the advantages of this method over mine? I guessing it doesn't need a coding declaration?

Think it still needs one.
The advantage is that you don't need a dictionary, and will catch even stuff which you never thought about. No need to care about what characters are in there, you can't even forget one, the code will deal with it, no matter what.

j_mie6 · May 12, 2012

hmmm ok, though apart from ÷ I have all off the html special chars in the dictionary I think

j_mie6 · May 13, 2012

ok, in the interest of speed, if the program says a file is incapatable you then click the button on the GUI which fixes all files to use the html codes. however the way this will hopefully be faster is the code run is something different

Code:

os.system("C:\\Users\\User\\Documents\\Programming\\C++\\fixer.exe")

this runs a C++ application that will fix the files way way faster. just gotta wait till my friend gets back from climbing to ask him how I would do the script in C++

of course the Translator will automatically run The_J's code when it is translating mods text

j_mie6 · May 13, 2012

ok so I have C++ code that replaces the accents in a given string with the html code... (can't work out to use a method simmilar to The_J's yet to cast the character without the map)

just need to work on opening the files etc.

Spoiler :

Code:

#include <iostream>
#include <fstream>
#include <string>
#include <map>
#include <sstream>
using namespace std;

/*
   Made by J_mie6 with help from The_J
   This program is designed to be run via the XML cleaner to fix incompatible files;
   Originally made in python this program ran very slowly and hopefully this C++ version
   will drastically increase the "fixing" of these files containing characters that the XML
   parser cannot handle (thus stopping the program from working properly!)
*/

map<const string, int> values;

void setValues()
{
     values["À"] = 192; values["Á"] = 193; values["Â"] = 194; values["Ã"] = 195; values["Ä"] = 196; values["Å"] = 197; values["Æ"] = 198; values["Ç"] = 199; values["È"] = 200;
     values["É"] = 201; values["Ê"] = 202; values["Ë"] = 203; values["Ì"] = 204; values["Í"] = 205; values["Î"] = 206; values["Ï"] = 207; values["Ð"] = 208; values["Ñ"] = 209;
     values["Ò"] = 210; values["Ó"] = 211; values["Ô"] = 212; values["Õ"] = 213; values["Ö"] = 214; values["×"] = 215; values["Ø"] = 216; values["Ù"] = 217; values["Ú"] = 218;
     values["Û"] = 219; values["Ü"] = 220; values["Ý"] = 221; values["Þ"] = 222; values["ß"] = 223; values["à"] = 224; values["á"] = 225; values["â"] = 226; values["ã"] = 227;
     values["ä"] = 228; values["å"] = 229; values["æ"] = 230; values["ç"] = 231; values["è"] = 232; values["é"] = 233; values["ê"] = 234; values["ë"] = 235; values["ì"] = 236;
     values["í"] = 237; values["î"] = 238; values["ï"] = 239; values["ð"] = 240; values["ñ"] = 241; values["ò"] = 242; values["ó"] = 243; values["ô"] = 244; values["õ"] = 245;
     values["ö"] = 246; values["ø"] = 248; values["ù"] = 249; values["ú"] = 250; values["û"] = 251; values["ü"] = 252; values["ý"] = 253; values["þ"] = 254; values["ÿ"] = 255;
}

string convertStringToInt(string value)
{
       stringstream ss;
       ss << values[value];
       return ss.str();
}

string getHtmlValue(string character)
{
       return "&#" + convertStringToInt(character) + ";";
}

int main ()
{

setValues();

string base = "héllo";
string html = getHtmlValue("é");

string str = base;
str.replace(1, 1, html);

cout<< str<< endl;
cin.get();
return 0;
}

j_mie6 · May 13, 2012

update:

the C++ program is finsihed aside from getting the input from Python (which is just a big string of filenames that will be checked. C++ can't find them itself and I am reluctant to write the names into a file made by python to then be read by C++, I'd rather python 'piped' them in. however can't work out how C++ receives the pipe itself yet.

Spoiler :

Code:

#include <iostream>
#include <fstream>
#include <string>
#include <map>
#include <vector>
#include <sstream>
#include <stdio.h>
#include <stdlib.h>
using namespace std;

/*
   Made by J_mie6 with help from The_J
   This program is designed to be run via the XML cleaner to fix incompatible files;
   Originally made in python this program ran very slowly and hopefully this C++ version
   will drastically increase the "fixing" of these files containing characters that the XML
   parser cannot handle (thus stopping the program from working properly!)
*/

map<const char, int> values;

void setValues()
{
     values['À'] = 192; values['Á'] = 193; values['Â'] = 194; values['Ã'] = 195; values['Ä'] = 196; values['Å'] = 197; values['Æ'] = 198; values['Ç'] = 199; values['È'] = 200;
     values['É'] = 201; values['Ê'] = 202; values['Ë'] = 203; values['Ì'] = 204; values['Í'] = 205; values['Î'] = 206; values['Ï'] = 207; values['Ð'] = 208; values['Ñ'] = 209;
     values['Ò'] = 210; values['Ó'] = 211; values['Ô'] = 212; values['Õ'] = 213; values['Ö'] = 214; values['×'] = 215; values['Ø'] = 216; values['Ù'] = 217; values['Ú'] = 218;
     values['Û'] = 219; values['Ü'] = 220; values['Ý'] = 221; values['Þ'] = 222; values['ß'] = 223; values['à'] = 224; values['á'] = 225; values['â'] = 226; values['ã'] = 227;
     values['ä'] = 228; values['å'] = 229; values['æ'] = 230; values['ç'] = 231; values['è'] = 232; values['é'] = 233; values['ê'] = 234; values['ë'] = 235; values['ì'] = 236;
     values['í'] = 237; values['î'] = 238; values['ï'] = 239; values['ð'] = 240; values['ñ'] = 241; values['ò'] = 242; values['ó'] = 243; values['ô'] = 244; values['õ'] = 245;
     values['ö'] = 246; values['ø'] = 248; values['ù'] = 249; values['ú'] = 250; values['û'] = 251; values['ü'] = 252; values['ý'] = 253; values['þ'] = 254; values['ÿ'] = 255;
     values['¡'] = 161; values['¿'] = 191; values['÷'] = 247; values['&#338;'] = 338; values['&#339;'] = 339; values['&#352;'] = 352; values['&#353;'] = 353; values['&#376;'] = 376; values['&#402;'] = 402;
}

string convertCharToInt(char value)
{
       stringstream ss;
       ss << values[value];
       return ss.str();
}

string getHtmlValue(char character)
{
       return "&#" + convertCharToInt(character) + ";";
}

bool isCharSpecial(char c)
{
     return values.count(c)>0;
}

string getTextFromFile(const char* filename)
{
    vector<string> text;
    string line;
    ifstream textstream (filename);
    while (getline(textstream, line)) {
          text.push_back(line + "\n");
    } 
    textstream.close();
    string alltext;
    for (int i=0; i < text.size(); i++){
        alltext += text[i];
    }
    return alltext;
}

void writeTextToFile(const char* filename, string data)
{
     ofstream file;
     file.open (filename);
     file << data;
     file.close();

}

vector <string> splitInput(string data, char separator)
{
     istringstream ss( data );
     vector <string> vData;
     while (!ss.eof())
     {
           string x;
           getline( ss, x, separator );
           vData.push_back(x);
     }  
    
     return vData;
}     

int main (int argc, char *argv[])
{
    string files;
    files = argv[1];
    
    vector <string> vFiles = splitInput(files, ',');
    setValues();
    const char* filename;
    
    for (int x=0; x<vFiles.size()-1; x++)
    {
        filename = vFiles[x].c_str();
        cout << "Correcting: "<< filename << endl;
        string str = getTextFromFile(filename);
        string html;
    
        for(int i=0; i<str.length(); i++)
        {
            if (isCharSpecial(str[i]))
            {
                html = getHtmlValue(str[i]);
                str.replace(i, 1, html);
            }    
        }

       writeTextToFile(filename, str);
    
    }
    cin.get();
    return 0;
}

then after this is finished I hook it up to the GUI as well as The_J's script and try and get your files working...

edit: well we got the arguements to send over, however windows fails to send all of your file names over as it exceeds size restriction!!! (and that is being send as a string

) this is why you gotta merge them

still after testing the C++ version works much faster and that is with C++'s slow printing telling you what file it is checking. edit: in fact it runs all the files (bar the 280+ text files) in less than a minute

would you mind merging all your text files into like 10 or so and sending them back to me so I can continue?

dacubz145 · Jun 16, 2012

quick thought

does this check for things in [] since when you have [tab] or [dot] you do not want that translated

j_mie6 · Jun 17, 2012

It doesn't actualy, that makes sense, I will implement this when I have got the next version!

XML text cleaner

j_mie6

Deity

dacubz145

Deity

j_mie6

Deity

The_J

Say No 2 Net Validations

j_mie6

Deity

dacubz145

Deity

j_mie6

Deity

The_J

Say No 2 Net Validations

j_mie6

Deity

j_mie6

Deity

j_mie6

Deity

j_mie6

Deity

dacubz145

Deity

j_mie6

Deity

Similar threads