-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
xgettext: rerun with UTF-8 encoding and/or properly process failures #14
Comments
Thanks! can you attach the problematic files? (with a URL so we can track where they are from) |
What a mess :| |
I thought I had added it. The attached file is not the original, but slightly modified. |
Not really as these are (I guess) treated as utf-8 by default? |
So, the solution for this seems to be to actually not pass |
Of course, not every file that is non-ASCII will be in UTF-8, so it might be that some additional metrics are needed to find the right encoding to translate from. Alternatively, first try UTF-8, then Latin-1, then others. That will catch many instances. If the file is not UTF-8, it might be that you would need to translate (encode/decode, etc.) the extracted string before further processing. |
I have a proper encoding detection here https://github.com/nexB/typecode/blob/92feb7be3a87c1b541e7034c3f9797c96bc52305/src/typecode/magic2.py#L294 or something else here https://github.com/nexB/scancode-toolkit/blob/c80e502c06639c18e2ea606d63f2ac09f89230c1/src/textcode/analysis.py#L251 we could use at some point of time |
Reference: #13 Reference: #14 Reported-by: Armijn Hemel @armijnhemel Signed-off-by: Philippe Ombredanne <[email protected]>
Actually to get a proper UTF-8 output, you need to fool xgettext with a unicode character as explained in https://bloomfield.online/posts/generate-utf-8-dictionaries-using-gettext/ by @peter-bloomfield ... but rather than to add an extra file, adding a dummy header copyright or similar with unicode content seems to be enough. 🤪
it will complain from non-UTF-8 content, but will keep on trucking |
Reference: #14 Reported-by: Armijn Hemel @armijnhemel Signed-off-by: Philippe Ombredanne <[email protected]>
Just dropping I ran the following two commands on the example I provided:
and then ran
|
I am using this version of
|
re:
But then it does not work if I do not know the encoding or it will create plain ASCII and NOT UTF-8 encoded .po file. Using this file foo.c.zip and xgettext 0.21:
Hence why using a fake copyright works to get a UTF-8 output. All other modes can parse non-UTF BUT will return some random ASCII-like encoding |
Something to consider is to rerun
xgettext
with different parameters in case it fails. The xgettext manual says:Sometimes this will lead to incorrect results (or no results at all) and
xgettext
might be needed to rerun with a different option. One example where fidks fails isutil-linux/fdisk.c
from a recent BusyBox:The culprit here is actually this sequence:
where
xgettext
thinks this might be some UTF-8 character (but, of course, it is not a valid sequence). No output file is generated in this case.https://git.busybox.net/busybox/tree/util-linux/fdisk.c?h=1_35_stable
Another example is the attached file (
lineedit.c
from BusyBox, zipped) where I have replaced a string on line 893.and no output file will be created.
When using the
--from-code
parameter the string will not be correctly extracted, but an output file will be created:It is not ideal, but better than getting no data at all. This could use some refinement.
Please note that this isn't true for all languages according to the
xgettext
manual:The text was updated successfully, but these errors were encountered: