Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

failed to translate html file with MS document translator #156

Closed
laf-1226 opened this issue Jul 5, 2021 · 5 comments
Closed

failed to translate html file with MS document translator #156

laf-1226 opened this issue Jul 5, 2021 · 5 comments

Comments

@laf-1226
Copy link

laf-1226 commented Jul 5, 2021

We translated some html files with MS document translators, all html files are well-formed, but translation of some html files failed with error message: Error while processing document: xxxx.html Object reference not set to an instance of an object.

Here is the example file which failed to be translated.
The sample file is a little complicated, we created a small one to reproduce the error.

  • xxxxxx
  • if

  • size is really big (for example, text length>5000 chars), this file will fail with MS document translator.
    While if we put some text between

    and

  • , the file will be translated successfully.

    we don't know why, but we are wondering if something wrong with the code below:

        private static void AddNodes(HtmlNode rootnode, ref List<HtmlNode> nodes)
        {
            string[] DNTList = { "script", "#text", "code", "col", "colgroup", "embed", "em", "#comment", "image", "map", "media", "meta", "source", "xml"};  //DNT - Do Not Translate - these nodes are skipped.
            HtmlNode child = rootnode;
            while (child != rootnode.LastChild)
            {
                if (!DNTList.Contains(child.Name.ToLowerInvariant())) {
                    if (child.InnerHtml.Length > maxRequestSize)
                    {
                        AddNodes(child.FirstChild, ref nodes);
                    }
                    else
                    {
                        if (child.InnerHtml.Trim().Length != 0) nodes.Add(child);
                    }
                }
                child = child.NextSibling;
            }
        }
    

    Sorry that i failed to upload the sample files, either in html or in .docx format.
    Has someone met the similar issue with MS document translator? And does anyone know how to fix this issue? Many thanks!

  • @chriswendt1
    Copy link
    Member

    chriswendt1 commented Jul 5, 2021

    Hi @laf-1226 ,
    Thank you for reporting the error.
    There is a newer, and I think better document translation utility in the /DocumentTranslation project, in this same repository.
    Can you give it a try and see whether this works for your files?
    Please let me know if you cannot migrate to the newer utility.
    If that also fails, please attach a sample document. Maybe try renaming the file as a .txt file and then attach.

    @laf-1226
    Copy link
    Author

    laf-1226 commented Jul 6, 2021

    Hi Chris,

    Thanks a lot for your reply.
    We confirmed that we are using the Document translator version 2.9.3, which seems the latest version of Document Translator. And the code in this ticket is also from version 2.9.3.
    Could you let us know where to get the better document translation utility if there is newer version?

    I attached the sample files are you suggested:
    File_BSN_AP2.txt
    new_failed.txt
    new_works.txt
    You can see error message with the first 2 files, and the 3rd one can be MTed successfully.

    Thanks again,
    aifang

    @chriswendt1
    Copy link
    Member

    Thanks, aifang, for attaching the files. I will look at that tomorrow.
    By the newer utility I meant this: https://github.com/MicrosoftTranslator/DocumentTranslation.

    @laf-1226
    Copy link
    Author

    laf-1226 commented Jul 6, 2021

    Thank you, Chris!
    We did some change on the following code:
    in
    private static void AddNodes(HtmlNode rootnode, ref List nodes)
    change
    while (child != rootnode.LastChild)
    to
    while (child != null)

    After rebuild, the document translator can handle the html files which failed before (e.g. the 2 sample files).
    Although it works for the files failed before, we are not very sure if this change is correct or if it will bring some other issues, could you help to check it? Many thanks!

    chriswendt1 added a commit that referenced this issue Jul 7, 2021
    @chriswendt1
    Copy link
    Member

    The code was not all all prepared for a single element being larger than maxrequestsize.
    In my test, your code would not get the element translated either, it would just avoid a failure.
    In version 2.9.4 I made changes to this function as well as BreakSentences. You may want to take both. I also updated the binaries.

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    None yet
    Projects
    None yet
    Development

    No branches or pull requests

    2 participants