xgettext: rerun with UTF-8 encoding and/or properly process failures #14

armijnhemel · 2024-03-15T14:07:35Z

Something to consider is to rerun xgettext with different parameters in case it fails. The xgettext manual says:

By default the input files are assumed to be in ASCII.

Sometimes this will lead to incorrect results (or no results at all) and xgettext might be needed to rerun with a different option. One example where fidks fails is util-linux/fdisk.c from a recent BusyBox:

$ xgettext --omit-header --extract-all --no-wrap fdisk.c
xgettext: Non-ASCII string at fdisk.c:333.
          Please specify the source encoding through --from-code.

The culprit here is actually this sequence:

    "\x80" "Old Minix",        /* Minix 1.4a and earlier */

where xgettext thinks this might be some UTF-8 character (but, of course, it is not a valid sequence). No output file is generated in this case.

https://git.busybox.net/busybox/tree/util-linux/fdisk.c?h=1_35_stable

Another example is the attached file (lineedit.c from BusyBox, zipped) where I have replaced a string on line 893.

$ xgettext --omit-header --extract-all --no-wrap lineedit.c
xgettext: Non-ASCII string at lineedit.c:893.
          Please specify the source encoding through --from-code.

and no output file will be created.

When using the --from-code parameter the string will not be correctly extracted, but an output file will be created:

$ xgettext --omit-header --extract-all --no-wrap --from-code=UTF-8 lineedit.c
lineedit.c:442: warning: internationalized messages should not contain the '\r' escape sequence
lineedit.c:893: warning: The following msgid contains non-ASCII characters.
                         This will cause problems to translators who use a character encoding
                         different from yours. Consider using a pure ASCII msgid instead.
                         ë
lineedit.c:893: invalid multibyte sequence
lineedit.c:893: invalid multibyte sequence
lineedit.c:893: invalid multibyte sequence
lineedit.c:893: invalid multibyte sequence

It is not ideal, but better than getting no data at all. This could use some refinement.

Please note that this isn't true for all languages according to the xgettext manual:

       --from-code=NAME
              encoding of input files (except for Python, Tcl, Glade)

The text was updated successfully, but these errors were encountered:

pombredanne · 2024-03-15T14:13:28Z

Thanks! can you attach the problematic files? (with a URL so we can track where they are from)

pombredanne · 2024-03-15T14:13:53Z

Please note that this isn't true for all languages according to the xgettext manual:

What a mess :|

armijnhemel · 2024-03-15T14:22:07Z

Thanks! can you attach the problematic files? (with a URL so we can track where they are from)

lineedit.c.zip

I thought I had added it. The attached file is not the original, but slightly modified.

armijnhemel · 2024-03-15T14:22:31Z

Please note that this isn't true for all languages according to the xgettext manual:

What a mess :|

Not really as these are (I guess) treated as utf-8 by default?

armijnhemel · 2024-03-15T19:21:53Z

So, the solution for this seems to be to actually not pass --omit-header to xgettext.

armijnhemel · 2024-03-15T22:33:58Z

Of course, not every file that is non-ASCII will be in UTF-8, so it might be that some additional metrics are needed to find the right encoding to translate from. Alternatively, first try UTF-8, then Latin-1, then others. That will catch many instances.

If the file is not UTF-8, it might be that you would need to translate (encode/decode, etc.) the extracted string before further processing.

pombredanne · 2024-03-15T22:42:10Z

I have a proper encoding detection here https://github.com/nexB/typecode/blob/92feb7be3a87c1b541e7034c3f9797c96bc52305/src/typecode/magic2.py#L294 or something else here https://github.com/nexB/scancode-toolkit/blob/c80e502c06639c18e2ea606d63f2ac09f89230c1/src/textcode/analysis.py#L251 we could use at some point of time

@armijnhemel

Reference: #13 Reference: #14 Reported-by: Armijn Hemel @armijnhemel Signed-off-by: Philippe Ombredanne <[email protected]>

pombredanne · 2024-03-16T07:37:31Z

Actually to get a proper UTF-8 output, you need to fool xgettext with a unicode character as explained in https://bloomfield.online/posts/generate-utf-8-dictionaries-using-gettext/ by @peter-bloomfield ... but rather than to add an extra file, adding a dummy header copyright or similar with unicode content seems to be enough. 🤪

xgettext --copyright-holder="ø" --extract-all --no-wrap --output=foo.po --from-code=UTF-8 ...

it will complain from non-UTF-8 content, but will keep on trucking

@armijnhemel

Reference: #14 Reported-by: Armijn Hemel @armijnhemel Signed-off-by: Philippe Ombredanne <[email protected]>

armijnhemel · 2024-03-16T12:29:59Z

Actually to get a proper UTF-8 output, you need to fool xgettext with a unicode character as explained in https://bloomfield.online/posts/generate-utf-8-dictionaries-using-gettext/ by @peter-bloomfield ... but rather than to add an extra file, adding a dummy header copyright or similar with unicode content seems to be enough. 🤪

xgettext --copyright-holder="ø" --extract-all --no-wrap --output=foo.po --from-code=UTF-8 ...

it will complain from non-UTF-8 content, but will keep on trucking

Just dropping --omit-header would have been enough.

I ran the following two commands on the example I provided:

$ xgettext --copyright-holder="ø"  --extract-all --no-wrap --output=bar.po  --from-code=UTF-8 lineedit.c
$ xgettext  --extract-all --no-wrap --output=foo.po  --from-code=UTF-8 lineedit.c

and then ran diff on the outputs:

$ diff -u foo.po bar.po 
--- foo.po	2024-03-16 13:27:54.506623896 +0100
+++ bar.po	2024-03-16 13:27:31.994449031 +0100
@@ -1,5 +1,5 @@
 # SOME DESCRIPTIVE TITLE.
-# Copyright (C) YEAR THE PACKAGE'S COPYRIGHT HOLDER
+# Copyright (C) YEAR ø
 # This file is distributed under the same license as the PACKAGE package.
 # FIRST AUTHOR <EMAIL@ADDRESS>, YEAR.
 #

armijnhemel · 2024-03-19T13:39:58Z

Actually to get a proper UTF-8 output, you need to fool xgettext with a unicode character as explained in https://bloomfield.online/posts/generate-utf-8-dictionaries-using-gettext/ by @peter-bloomfield ... but rather than to add an extra file, adding a dummy header copyright or similar with unicode content seems to be enough. 🤪
xgettext --copyright-holder="ø" --extract-all --no-wrap --output=foo.po --from-code=UTF-8 ...
it will complain from non-UTF-8 content, but will keep on trucking

Just dropping --omit-header would have been enough.

I ran the following two commands on the example I provided:
$ xgettext --copyright-holder="ø"  --extract-all --no-wrap --output=bar.po  --from-code=UTF-8 lineedit.c
$ xgettext  --extract-all --no-wrap --output=foo.po  --from-code=UTF-8 lineedit.c
and then ran diff on the outputs:
$ diff -u foo.po bar.po 
--- foo.po	2024-03-16 13:27:54.506623896 +0100
+++ bar.po	2024-03-16 13:27:31.994449031 +0100
@@ -1,5 +1,5 @@
 # SOME DESCRIPTIVE TITLE.
-# Copyright (C) YEAR THE PACKAGE'S COPYRIGHT HOLDER
+# Copyright (C) YEAR ø
 # This file is distributed under the same license as the PACKAGE package.
 # FIRST AUTHOR <EMAIL@ADDRESS>, YEAR.
 #

I am using this version of xgettext btw.

$ xgettext --version
xgettext (GNU gettext-tools) 0.22
Copyright (C) 1995-2023 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <https://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Written by Ulrich Drepper.

pombredanne · 2024-04-29T12:58:05Z

re:

Just dropping --omit-header would have been enough.

But then it does not work if I do not know the encoding or it will create plain ASCII and NOT UTF-8 encoded .po file.

Using this file foo.c.zip and xgettext 0.21:

$ xgettext  --extract-all --no-wrap --output=no-copyright-with-utf.po  --from-code=UTF-8 foo.c 
foo.c:3: invalid multibyte sequence
foo.c:3: invalid multibyte sequence
foo.c:3: invalid multibyte sequence
foo.c:3: invalid multibyte sequence
foo.c:3: invalid multibyte sequence
foo.c:3: invalid multibyte sequence
foo.c:3: invalid multibyte sequence
foo.c:3: invalid multibyte sequence
foo.c:3: invalid multibyte sequence
foo.c:3: invalid multibyte sequence
foo.c:3: invalid multibyte sequence
foo.c:3: invalid multibyte sequence
$ xgettext  --extract-all --no-wrap --output=no-copyright-no-utf.po   foo.c 
xgettext: Non-ASCII string at foo.c:3.
          Please specify the source encoding through --from-code.
pombreda@computer4:~/tmp/xg/chardet-main/tests/iso-8859-2-slovene$ xgettext --copyright-holder="ø"  --extract-all --no-wrap --output=copyright-no-utf.po  foo.c 
xgettext: Non-ASCII string at foo.c:3.
          Please specify the source encoding through --from-code.
$ xgettext --copyright-holder="ø"  --extract-all --no-wrap --output=copyright-with-utf.po  --from-code=UTF-8 foo.c 
foo.c:3: invalid multibyte sequence
foo.c:3: invalid multibyte sequence
foo.c:3: invalid multibyte sequence
foo.c:3: invalid multibyte sequence
foo.c:3: invalid multibyte sequence
foo.c:3: invalid multibyte sequence
foo.c:3: invalid multibyte sequence
foo.c:3: invalid multibyte sequence
foo.c:3: invalid multibyte sequence
foo.c:3: invalid multibyte sequence
foo.c:3: invalid multibyte sequence
foo.c:3: invalid multibyte sequence
$ file *
copyright-with-utf.po:    GNU gettext message catalogue, Unicode text, UTF-8 text, with very long lines (399)
foo.c:                    ISO-8859 text, with very long lines (799)
foo.c.zip:                Zip archive data, at least v2.0 to extract, compression method=deflate
no-copyright-with-utf.po: GNU gettext message catalogue, ASCII text, with very long lines (399)

$ xgettext --omit-header  --extract-all --no-wrap --output=omit-with-utf.po  --from-code=UTF-8 foo.c 
foo.c:3: warning: The following msgid contains non-ASCII characters.
                  This will cause problems to translators who use a character encoding
                  different from yours. Consider using a pure ASCII msgid instead.
                  LJUBLJANA ? Zavod RS za zaposlovanje je na svojih spletnih straneh vzpostavil novo rubriko Skupaj do zaposlitve, v okviru katere bodo uporabnikom na voljo primeri dobrih praks oz. uspe�nih zgodb brezposelnih oseb, iskalcev zaposlitve in delodajalcev iz vse Slovenije. Trenutno so v rubriki objavljene tri zgodbe. Svoje izku�nje so javnosti zaupali Branko Ileni�, Darja Avgu�tin�i� in Mojca Rupert.
foo.c:3: invalid multibyte sequence
foo.c:3: invalid multibyte sequence
foo.c:3: invalid multibyte sequence
foo.c:3: invalid multibyte sequence
foo.c:3: invalid multibyte sequence
foo.c:3: invalid multibyte sequence
foo.c:3: invalid multibyte sequence
foo.c:3: invalid multibyte sequence
foo.c:3: invalid multibyte sequence
foo.c:3: invalid multibyte sequence
foo.c:3: invalid multibyte sequence
foo.c:3: invalid multibyte sequence
$ file *
copyright-with-utf.po:    GNU gettext message catalogue, Unicode text, UTF-8 text, with very long lines (399)
foo.c:                    ISO-8859 text, with very long lines (799)
foo.c.zip:                Zip archive data, at least v2.0 to extract, compression method=deflate
no-copyright-with-utf.po: GNU gettext message catalogue, ASCII text, with very long lines (399)
omit-with-utf.po:         GNU gettext message catalogue, ASCII text, with very long lines (399)

Hence why using a fake copyright works to get a UTF-8 output. All other modes can parse non-UTF BUT will return some random ASCII-like encoding

pombredanne added a commit that referenced this issue Mar 15, 2024

Call xgettext with UTF-8 and parse lines

d691562

Reference: #13 Reference: #14 Reported-by: Armijn Hemel @armijnhemel Signed-off-by: Philippe Ombredanne <[email protected]>

pombredanne mentioned this issue Mar 15, 2024

Improve xgettext handlings #16

Merged

pombredanne added a commit that referenced this issue Mar 16, 2024

Force xgettext to return UTF-8

812faf9

Reference: #14 Reported-by: Armijn Hemel @armijnhemel Signed-off-by: Philippe Ombredanne <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

xgettext: rerun with UTF-8 encoding and/or properly process failures #14

xgettext: rerun with UTF-8 encoding and/or properly process failures #14

armijnhemel commented Mar 15, 2024 •

edited

Loading

pombredanne commented Mar 15, 2024

pombredanne commented Mar 15, 2024

armijnhemel commented Mar 15, 2024

armijnhemel commented Mar 15, 2024 •

edited

Loading

armijnhemel commented Mar 15, 2024

armijnhemel commented Mar 15, 2024

pombredanne commented Mar 15, 2024

pombredanne commented Mar 16, 2024

armijnhemel commented Mar 16, 2024

armijnhemel commented Mar 19, 2024

pombredanne commented Apr 29, 2024

xgettext: rerun with UTF-8 encoding and/or properly process failures #14

xgettext: rerun with UTF-8 encoding and/or properly process failures #14

Comments

armijnhemel commented Mar 15, 2024 • edited Loading

pombredanne commented Mar 15, 2024

pombredanne commented Mar 15, 2024

armijnhemel commented Mar 15, 2024

armijnhemel commented Mar 15, 2024 • edited Loading

armijnhemel commented Mar 15, 2024

armijnhemel commented Mar 15, 2024

pombredanne commented Mar 15, 2024

pombredanne commented Mar 16, 2024

armijnhemel commented Mar 16, 2024

armijnhemel commented Mar 19, 2024

pombredanne commented Apr 29, 2024

armijnhemel commented Mar 15, 2024 •

edited

Loading

armijnhemel commented Mar 15, 2024 •

edited

Loading