python-format flag incorrectly forced on all extractors #35

wichert · 2013-07-27T19:51:34Z

During extraction the results from an extractor plugin are used to create a message which is then added to the catalog. When the Message instance is created it can set the python-format flag if the text matches a specific regular expression. That means that a text like this:

This is a text about 50% of all users.

will be flagged as a python-format string even it does not originate from python code, or from a piece of python code that will never use formatting. This is breaking our translations in a pretty bad way currently.

As far as I can see the extractor is the only thing that should decide the flags for a message. The Message class itself should never try to guess flags or force them upon plugins, especially since extractors have no way to override this behaviour.

The text was updated successfully, but these errors were encountered:

mitsuhiko · 2013-07-27T20:51:38Z

That is correct. The extractor interface currently does not provide ways to hand out flags. Not sure yet how to properly fix it, but the place to fix it is definitely the extractor interface. I suppose the correct thing to do would be to add a new optional last item to the message tuple. If it's missing the old guessing behavior kicks in, otherwise the flags are accepted from the extractor.

lelit · 2014-01-10T19:28:33Z

Not sure if this is related, but the current heuristic used to determine if a message is actually a python-format does not recognize strftime() format codes: such format strings should not be marked with the python-format flag because otherwise polint tools wrongly mark translations as defective.

JonathanRRogers · 2014-02-11T02:40:56Z

Issues with the heuristic detection of Python format strings are separate from this.

wichert · 2014-02-11T02:45:05Z

For what it's worth I have been working the last few days adding a extraction system to lingua, which fixes this amongst a few other things.

JonathanRRogers · 2014-02-11T02:52:13Z

By "fix", do you mean you extended Babel's extractor interface? If so, could you contribute that babel? I was about to do so, but don't want to duplicate effort. I'm not sure if lingua is meant to be an extension to Babel or an alternative.

wichert · 2014-02-11T02:56:00Z

I fixed this purely from lingua's point of view: lingua no longer uses Babel at all anymore now. Instead it now has its own extraction command (pot-extract, command line-compatible with xgettext), and uses a richer API for its own extractors. It is able to run Babel plugins for backwards compatibility to support file formats lingua does not support itself, but obviously that comes with the API limitations Babel plugins have.

lelit · 2014-02-11T08:48:13Z

Sigh, a pity the waste of resources in duplicating code bases... :-\

Help babel do the right thing, or help lingua? Both does the wrong thing wrt strftime() formats, but at least babel already works with Python 3 while lingua currently does not...

wichert · 2014-02-11T13:57:51Z

Lingua in git works perfectly on Python 3. It needs a few last changes and a bit more documentation before I can make a release, but that should happen in the next few days.

wichert · 2014-02-11T14:04:12Z

@lelit can you explain why you think lingua does the wrong thing with strftime formats? If it does something wrong I would like to fix that.

lelit · 2014-02-11T15:27:51Z

I tried lingua this morning, from a fresh clone from the git repo: at least this line prevents Python 3 to even load the module, maybe it's one the few last changes, dunno, maybe you have a different branch for Py3?

Wrt strftime format, both projects implement the same heuristic to determine if a string contains %x placeholders, but its not adeguate for strftime, because the translated format may contain different set of them (think about %p, ie "AM"/"PM" which is used by englishmen, while other languages, italian to mention one, does not use it and instead prefer %H for 24-hour-based hours): the usual %m/%d/%y %I:%M %p becomes %d/%m/%y %H:%M, and that %d will be recognised as a "c-format" placeholder.

IMHO, a more correct way, although not perfect, would be to check if all %-placeholders in string belongs to the possible strfmt() codes.

In other words, strftime() formats shouldn't be marked as "python-format", otherwise most polints I know will wrongly emit alerts about non-matching translations.

wichert · 2014-02-11T20:00:45Z

Ah, the last commit indeed broke Python 3 support. That was stupid, and I’ll fix that immediately.

The strftime thing is interesting, and something I had never run into before. Luckily it is quite easy to fix. I have a local fix which I’ll push as soon as I have an internet connection again.

For what it is worth I do think Babel’s use of the “python-format” flag is incorrect: as I see it this is really c-format. Python format as defined in the Python documentation is {..} notation, something which Babel does not support currently. This is also the naming used by xgettex.

lelit · 2014-02-11T21:09:29Z

I saw your fixes on lingua... I will try it out again :-) But is there any reason to keep two different implementation of the extraction machinery? I mean, is there any chance that lingua and Babel cooperate in this area, instead of wasting precious hours duplicating each other code?

I can agree on what you state about python-format and c-format confusion, but then I do not see the point of python-format at all, unless you restrict it to {digit}, ie positional arguments, and not generically to {name} placeholder. AFAICT, the c-format marker purpose is only to let polints to check that the translated string contains the same set of placeholders as the original string (assuming it is possible to change the order, generically depending on the actual function using the string, for example the fprintf(3) manpage mentions the syntax %m$ with m indicating the index of the argument to use...); when using named placeholders, is there any need for a special marker?

Thank you!

wichert · 2014-02-11T21:27:15Z

I see no particular reason that to have two implementations. lingua was originally created because Babel's python extractor did not support the i18n syntax user by Pyramid (via translationstring) and Zope (via zope.i18nmessageid), and I needed a new plugin for ZPT files. Unfortunately Babel has not been actively maintained for a long time and bug reports were ignored, which led to a growing desire to drop our dependency on, which is what I worked on over the last few days. I suspect this is also why polib was created instead of using Babel.

As I see it I Babel currently tries to do too many things: a Python API for CLDR, a pot/po-handling framework, a gettext-implementation and an implementation of PO/MO file format. Perhaps it would be useful to split Babel up into separate pieces, each with a single responsibility. This is the direction I wanted to take with lingua: I want it do be nothing other than a tool to extract messages from source code. I haven't made up my mind if it should include a tool to compile PO files, when msgfmt already does that very well. And yes, removing the polint tool currently in lingua is on the to-do list :)

I do see a point of having a python-format: polint tools can use that to check all placeholders from the msgid are also present in the msgstr, perhaps ignoring changes in format specifier. For example if you translate Hello {name} with Beste {naam} the translation is obviously flawed and would result in your software suddenly raising exceptions, but you need to know the text is python-format to be able to do that check.

wichert · 2014-02-11T21:28:54Z

For what its worth I do use Babel for its CLDR API - it is much, much nicer than the CLDR wrappers in zone.i18n (even if that is setting the bar low :) ).

lelit · 2014-02-11T21:39:33Z

I do see a point of having a python-format: polint tools can use that
to check all placeholders from the msgid are also present in the
msgstr

Yes, right, I missed the obvious checks on typos.

During extraction, Message instances can be created with the "python-format" flag, indicating that the message string contains Python percent-formatting placeholders. To avoid setting the flag erroneously because the string source is not Python code or otherwise is not expected to contain such placeholders, the extractor interface must be extended to allow extractor functions to indicate which flags are valid. Fixes python-babel#35

Changaco · 2016-09-01T11:13:03Z

As far as I can see the extractor is the only thing that should decide the flags for a message.

I disagree. For example, how would the python extractor know whether the flag should be python-format or python-brace-format (#333)? It can only guess, and not any better than the Message class can.

xgettext has a --flag argument that's supposed to handle this (doc). Calling xgettext -o - --flag=_:1:python-format file.py, with file.py containing calls like _("Some message"), results in all those messages being flagged with python-format.

During extraction, Message instances can be created with the "python-format" flag, indicating that the message string contains Python percent-formatting placeholders. To avoid setting the flag erroneously because the string source is not Python code or otherwise is not expected to contain such placeholders, the extractor interface must be extended to allow extractor functions to indicate which flags are valid. Fixes python-babel#35

lelit · 2018-08-07T08:31:22Z

Is there a particular reason why the PYTHON_FORMAT regexp accepts whitespaces after the initial % sign?
That's the main reason why wichert's example at the start (that is, This is a text about 50% of all users.) is being marked with python-format...
For now I replaced that with a FULLWIDTH PERCENT SIGN, to not trigger an error at compile time, but I wonder if the optional spaces should be removed from the regexp instead.

wichert · 2018-08-07T08:37:14Z

Because Python allows it:

Python 2.7.15 (default, Jun 17 2018, 12:46:58) 
[GCC 4.2.1 Compatible Apple LLVM 9.1.0 (clang-902.0.39.2)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> '% d' % 1
' 1'

Note the leading whitespace in the result, which is significant here. This is inherited from C, which also allows it:

#include <stdio.h>
#include <stdlib.h>

int main() {
	printf("% d\n", 1);
	return 0;
}

produces:

wichert@notos /tmp $ make x
cc     x.c   -o x
wichert@notos /tmp $ ./x
 1

lelit · 2018-08-07T08:45:59Z

Oh, didn't know about that! Thanks!

During extraction, Message instances can be created with the "python-format" flag, indicating that the message string contains Python percent-formatting placeholders. To avoid setting the flag erroneously because the string source is not Python code or otherwise is not expected to contain such placeholders, the extractor interface must be extended to allow extractor functions to indicate which flags are valid. Fixes python-babel#35

JonathanRRogers mentioned this issue Feb 11, 2014

Don't determine python-format flag implicitly #80

Open

sils added the (2) in progress label Sep 24, 2015

akx mentioned this issue Jan 22, 2016

New style python-brace-format translations #333

Open

akx added the area/messages label Feb 2, 2018

akx mentioned this issue Apr 18, 2018

Languages that have !=2 plural forms have wrong "python-format" checks #570

Open

KurzedMetal mentioned this issue Jul 12, 2019

Prepare project to use Zanata for translations remyroy/CDDA-Game-Launcher#362

Merged

emlys mentioned this issue Dec 8, 2021

sphinx-intl update adds python-format flag to messages containing % sphinx-doc/sphinx-intl#58

Open

ljodal linked a pull request Jan 31, 2022 that will close this issue

Allow extractors to provide message flags #838

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

python-format flag incorrectly forced on all extractors #35

python-format flag incorrectly forced on all extractors #35

wichert commented Jul 27, 2013

mitsuhiko commented Jul 27, 2013

lelit commented Jan 10, 2014

JonathanRRogers commented Feb 11, 2014

wichert commented Feb 11, 2014

JonathanRRogers commented Feb 11, 2014

wichert commented Feb 11, 2014

lelit commented Feb 11, 2014

wichert commented Feb 11, 2014

wichert commented Feb 11, 2014

lelit commented Feb 11, 2014

wichert commented Feb 11, 2014

lelit commented Feb 11, 2014

wichert commented Feb 11, 2014

wichert commented Feb 11, 2014

lelit commented Feb 11, 2014

Changaco commented Sep 1, 2016

lelit commented Aug 7, 2018

wichert commented Aug 7, 2018

lelit commented Aug 7, 2018

python-format flag incorrectly forced on all extractors #35

python-format flag incorrectly forced on all extractors #35

Comments

wichert commented Jul 27, 2013

mitsuhiko commented Jul 27, 2013

lelit commented Jan 10, 2014

JonathanRRogers commented Feb 11, 2014

wichert commented Feb 11, 2014

JonathanRRogers commented Feb 11, 2014

wichert commented Feb 11, 2014

lelit commented Feb 11, 2014

wichert commented Feb 11, 2014

wichert commented Feb 11, 2014

lelit commented Feb 11, 2014

wichert commented Feb 11, 2014

lelit commented Feb 11, 2014

wichert commented Feb 11, 2014

wichert commented Feb 11, 2014

lelit commented Feb 11, 2014

Changaco commented Sep 1, 2016

lelit commented Aug 7, 2018

wichert commented Aug 7, 2018

lelit commented Aug 7, 2018