Wednesday, April 14, 2010

Using Google to translate resource files – code example

Often there is a need to create resource files in foreign language for code-testing purposes. The typical list of languages that I like to test are:

  • Hebrew
  • Arabic
  • Simplified Chinese
  • Spanish
  • French
  • German – text is often 2x longer then English (what you get with preciseness in expression)
  • Hindi

If the code/css works for all of the above then you are likely safe for other languages. There is two pieces of code and one manual process (cut and paste – Google usually makes it hard to automate the capture of the translation).

 

Converting the Resx to a Html page

We put the Resx up – almost as is, just ditch the comments and place the items in <html> and <body> tags.

private void WriteHtml(FileInfo infile, FileInfo outHtml)
{
    XmlDocument sourceResx = new XmlDocument();
    sourceResx.Load(infile.FullName);
    XmlDocument xHtml = new XmlDocument();
    xHtml.LoadXml("<html><body/></html>");
    XmlNode body = xHtml.SelectSingleNode("//body");
    XmlNodeList list = sourceResx.SelectNodes("//data[@name]/comment");
    // remove comments
    for (int i = list.Count - 1; i >= 0;i-- )
        list[i].ParentNode.RemoveChild(list[i]);
    list = sourceResx.SelectNodes("//data[@name]");
    foreach (XmlNode node in list)
    {
        body.AppendChild(xHtml.ImportNode(node, true));
    }
    xHtml.Save(outHtml.FullName);
}

Getting the Translations

Next, we copy this to a website. I copied an example to one of my sites,  http://reddwarfdogs.com/ContentPage.html if you want to see what the output looks like.

 

Next, go to http://translate.google.com  and enter the URL and then pick the desired translation. Once the translation is presented I usually view source and then copy and paste it to a file with the cultureinfo as the name and .htm as the extension (this is assumed to happen in the next code sample). So we would have items like

  • he-IL.htm
  • ar.htm
  • es.htm

Creating the translated Resx files from the .htm files

We can now return to the world of code processing.

  • We use the CodePlex, HtmlAgilityPack library to fix the html into valid Xml so that processing is a lot easier, but before we do that we:
    • Add in a meta tag to identify the file as UTF-8 (if you forget to do this, you may get a lot of ???????? appearing instead).
  • Once we have valid Xml, we eliminate the original phrase that is put in the html from Google.
  • We then load a copy of the original Resx file and walk it, replacing the <value> with the one from the translation.
  • Just save to an appropriately named file.

The code:

void CreateTranslatedResx(FileInfo sourceFile)
{
    XmlDocument dom = new XmlDocument();
    dom.Load(sourceFile.FullName);
    string baseName = sourceFile.FullName.Substring(0, sourceFile.FullName.IndexOf("."));
    DirectoryInfo source = new DirectoryInfo(Environment.CurrentDirectory);
    FileInfo[] files = source.GetFiles("*.htm");
    foreach (FileInfo fi in files)
        if(fi.Extension==".htm")
    {
            //Update it with the encoding if not roman letters.
        string txt = File.ReadAllText(fi.FullName);
        if (!txt.Contains("utf-8"))
        {

            File.WriteAllText(fi.FullName, txt.Replace("<html>", "<html><meta http-equiv='Content-Type' content='text/html; charset=utf-8'>"));
        }
            HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();                
        doc.LoadHtml(File.ReadAllText(fi.FullName));
        doc.OptionOutputAsXml = true;                  
        doc.Save("temp.xml");                           
        string culture=fi.Name.Substring(0,fi.Name.IndexOf("."));
        XmlDocument htmDom = new XmlDocument();
        string xml = File.ReadAllText("temp.xml");
        htmDom.LoadXml(xml);
        XmlNodeList list = htmDom.SelectNodes("//span[@class='google-src-text']");
        for (int i = list.Count - 1; i >= 0; i--)
            list[i].ParentNode.RemoveChild(list[i]);
        XmlNodeList toMoveList = htmDom.SelectNodes("//data[@name]");
        foreach (XmlNode node in toMoveList)
        {
            XmlNode oldNode=dom.SelectSingleNode(
                string.Format("//data[@name='{0}']",node.Attributes["name"].Value));
            oldNode.SelectSingleNode("value").InnerText = node.SelectSingleNode("value").InnerText.Replace("?",string.Empty);
        }
        dom.Save(String.Format("{0}.{1}.Resx", baseName, culture));
    }            
}

Conclusion

That’s it!  The main things that can go wrong are:

  • Not saving with the correct CultureInfo name (What is the code for Welsh and Yiddish?)
  • Not saving the HTML from Google as UTF-8

Again, this is done only for testing purposes, read Googles terms of use etc if the files are to be shipped with the application or exposed on the real internet.

1 comment: