-
-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[* ACTION REQUIRED *] Choosing a Core Syntax #499
Comments
(chair hat off) My stack rank: 3a > 1a > 2a
I prefer 3a because:
(chair hat on) Please generally follow the above format in your own responses, although you are welcome to write as long or short as needed in the "I prefer"/"I hate" section or to common on other's responses. |
Commentary in no special order:
So I think my ranking would be: 3a == 2a > 1a I'm torn on multi-line code blocks vs single-line code blocks. However, I see the advantage of single-line blocks being only beginner-delimited, so if we go single-line code, that seems reasonable. Also, regarding the sigil to use... Another option could be |
1a > 3a > 2a I really like the first-order simplicity of 1a, where all the text is outside curly braces, and all the special stuff is inside them. I honestly think that this is the best or least-worst option when considering translators who end up working directly with messages in MF2 syntax, rather than through any purpose-built tooling. The only place where 1a nests braces is within the I think 2a is two syntaxes in a trenchcoat. I think it'll lull many developers to getting familiar-ish with the first single-pattern syntax while ignoring the second, and make it that much more likely to not pluralize messages that really ought to be pluralized, i.e. use
A little bit of 2a's excessive complexity could be trimmed off by dropping the terminal I think 3a is better than 1a in differentiating statements from placeholders by using the
Otherwise, conceptually, 1a and 3a are pretty close to each other. When comparing the formats, the message I've mostly been staring at is the simple-selector message, because I figure that'll be the most common not-simple message, and therefore it matters the most. And for that, the minified single-line syntax provides the most differentiation:
My greatest stumbling block when reading the above is catching that for 2a the first I don't really mind the character length of 1a being greater than the others. What's most important to me is the amount of stuff I need to mentally remember and track, and that to me is minimal in 1a. One "ha ha, only serious" possibility for 3a would be to use the already escaped
My understanding is that we're now picking a general syntax direction so that we have a baseline from which to consider further questions, such as:
In addition to the negatives I've listed above, I dislike 2a because it forces us to decide about all these things at the same time, rather than allowing us to make stepwise progress. Both of the above concerns may be addressed later within a syntax based on either 1a or 3a. If we try to pick 2a, we'll need to resolve not just the general syntax direction but also at least the above issues before being able to make that decision. |
1a > 3a > 2a For me 1a makes a lot of sense, although I'd revisit and simplify the whitespace quoting:
I do prefer the concept of 2a over 3a, explicitly entering code mode. However, I dislike the code mode syntax as per Eemeli's reasoning above. I also have a hard time parsing that as a human, especially the I do like option 3a in the sense that using a sigil instead of braces would work for me and could result in a cleaner look. However, I'd suggest to not abbreviate away the |
I agree that these are the two currently most impactful axis of decisions. (With the first one being a proxy for the decision between the triple-layer and dual-layer models.) And I think they are, in fact, the direction that we're looking for. It seems, however, that you'd like to consider them afterwards, which leaves me puzzled. What is the direction that we're picking now in this case? I only thought about it today, but I wonder if it would be more helpful to vote on a matrix representing the above two questions. 1a and 3a cover one cell (trim whitespace, separate statements), 2a covers another (don't trim whitespace, group statements in a preamble), and we'd need to define what the other cells entail.
Arguably, the same point can be made about 2a. It can also evolve to incorporate ideas from 1a/3a, most notably, the dual-layer model. (A longer comments coming soon.) |
I find it amusing that you're unhappy with 3a because it removes
@stasm noted:
I don't think it would. We need a syntax. The various whitespace and organizational hiccups are inherently entangled with the syntax. We can pretend to discuss them separately, but the choices we make depend, fundamentally, on details of the message grammar. I'll point out again what I've said elsewhere: the whitespace problem can be done away with by requiring the code-internal pattern to be quoted. So:
Some syntaxes can have code preambles (blocks) and others won't make sense with them. But each is a variation of a given syntax. Let's pick one so that we can go back to what works in this WG: concrete changes to the ABNF. |
I struggle to cast a definitive vote because there are many latent decisions in each of the proposals. Voting is polarizing, and I tend to think that the final solution should instead try to combine multiple good ideas from each of the proposals. Furthermore, I'm optimistic in that I think that the final syntax can be derived from any of the 3 currently discussed ones. My priorities are:
If we're voting on the general approach, or the mental model for what happens to variant patterns (see the illustration in #496 (comment)), then 1a/3a > 2a.
If we're voting on specific look & feel of the syntax, then 2a > 3a > 1a.
If I could extend the comparison table with subjective reasons, I'd list the following pros and cons: In 1a:
In 2a:
In 3a:
I think the final solution should combine most of the pluses from the above list. In essence, this boils down to combining the code-mode block from 2a (in form of a preamble) with the dual-layer model for variants from 3a. For instance:
|
@stasm I'm unclear in your final example, where does the match block end? Can anything come after "things.", and how do we know which is which? It looks like you're out of code mode there, but still within the conceptual match block, which is confusing to me. |
Either 1a or 3a are good. I think there will be an issue with foced multiline translation strings as @eemeli pointed out. |
I think we're picking a general shape for the syntax, with a specific form to start, from which we may iterate further. I would classify the choices as:
That is a better syntax in many ways, but it does require escaping errant |
The problem with |
I agree that I think double-sigils is a better guard against the need to do elaborate escapes. I don't personally agree that including Perhaps:
1a's version:
2a's version:
|
1a > 3a > 2a I like how statements are fully enclosed in 1a, like 3a requires more context when reading the message. For example, I don't like that Not a fan of 2a because there are even more contexts to keep track of. I can be in text mode, code mode, or text-embedded-in-code mode. In addition, it has the same problem as 3a where I need to keep track of where statements start and end. |
This is a good observation. It is less pronounced in the current Once we break up statements and variant keys visually, like in To fix the different arity of
|
(following @eemeli suggestion to comment directly in the issue) For me 3a > 1a > 2a 3a and 1a are more or less on the same level. I find the My main concern with 2a is the amount of syntax keywords that could be interpreted as localizable text, either by human or machine. One note about alternative sigils: some are surprisingly painful to use on international keyboard layouts (e.g. |
Still a couple of days to share your opinion before the teleconference. So far I have a matrix that looks like the table below. This doesn't include @stasm's input, because his comment has two different stack rankings. Recall that we are not formally voting: the stack rank merely informs our discussion. However, if you don't put in your ranking and/or comments and you aren't able to join the teleconference, we'll have to proceed without your input!
(NV == no vote) |
Please use 3a > 2a = 1a as my stackrank. The direction of both 1a and 3a is the right one when it comes to allowing unquoted patterns in variants. At the same time, I don't think they are good enough in their current form.
I also have a strong preference towards evolving the current |
@stasm Thanks! I've updated the summary table.
I think that it would be hard to do changes piecemeal. But it's also important to recognize that the proposals (or their variants) pretty much recycle the lower-level parts of our current ABNF, e.g. I agree that our syntax is slightly sigil happy. I think this is partly an outgrowth of allowing operand-free expressions:
@eemeli has suggested that we could drop the expression = "{" [s] operand [s annotation] [s] "}"
operand = variable / literal / blank
blank = "$_"
annotation = (function *(s option)) / reserved / private-use |
1a >> 3a > 2a I agree with @eemeli's reasoning in #499 (comment) , which strengthened my preexisting preferences. Enclosing all special semantics like 3a is dragged down by too many special syntax forms ( |
Actually, that doesn't look bad to me (modulo #483 etc.)—intuitively, unnamed input to a function can be either a variable, literal text (quoted or unquoted), or absent. |
Stack rank: 2a > why are we reopening multiple longstanding decisions to solve one problem, seeing as we can avoid it?
Notice that 1a wraps every
I disagree, and this is overstated. 2a is basically our current syntax, but non-simple messages are wrapped in If 2a is so terribly bad, then it means that our current syntax is bad, and we should blame the authors of our current syntax. That means us. That also invalidates our decisions and reasoning for the last 1.5 years. I think our current syntax fine, and not terribly bad.
No, actually, it's the opposite. 2a is only about adding
@vdelau Agreed. This is my personal ideal preference, too. However, I have only been discussing 2a as it starts from where we as a group have arrived so far, solves the problem at hand, and doesn't reopen any further decisions. However, if reopening any number of decisions is fair game, then yes, the EM proposal (ca. Jan 2022) provides elegant syntax that I quite like (and Annex 3 offers slight twists for people who like character shaving). It predates the concepts of
Also, However, the implication of reopening more than 1 decision is what worries the most about the discussions here and over the last month. It took us months to go from EM/EZ/SM proposals of Jan 2022 to a somewhat stable syntax in July 2022 in time for an ICU4J preview implementation. We started with 3/3 proposals all using sigils and 2/3 starting all messages in code mode and 2/3 delimiting patterns in non-simple messages, and arrived at our current syntax, and were okay for a year. Why the surge in interest to reconsider everything, and all at once? And why just to solve simple messages in text mode? I know our process requires unanimous consent to in order to overturn previously made decisions, and attempting to do it all at once is a tall order. Only a few people are acknowledging that, and I have yet to hear a satisfying explanation of this all is happening so fast and furious. Regardless of the outcome, this does call into question our group's ability to understand and stick to its own decisions months or years down the line without the urge to reconsider it all. Why can't it happen again -- that we want to redesign everything based on a simple requirement change request -- if it has already happened before? What were our reasons for our previous decisions? Guiding principles inspiring those reasons? What has changed about our thinking now? Are we able to precisely describe principles guiding our thoughts so that we are clear in the future, or do we just rinse & repeat some number of months down the road?
Thanks @aphillips. I personally think that requiring code-internal patterns to be quoted would solve a lot, though not all, of problems. I don't discern consensus there, either.
This concern for non-simple messages is the consequence of wanting to have simple messages start in text mode. Options 1a and 3a may reduce the concern somewhat because they introduce sigils, but the consequence of that is that users need to worry about escaping more sigils. If the decision is between declaring that any text around non-simple messages is invalid (and having implementations reject their naive attempts to do so) vs. giving more sigils for users to worry about escaping, I think more sigil escaping is a much worse problem. It is an error-prone user experience, and it forces users to think about the relationship of sigils to any host syntax they embed messages within. @stasm I appreciate you defining your values, because as a group, we need to do that, both for evaluating options and for long-term logical consistency. I wonder if stack ranking values would be something than scrutinizing each side of every syntax tradeoff?
@sffc Yep, Lisps are known for minimal, unambiguous syntax, and some dialects reduce the noise well. In cherry-picked small examples, Lisps might seem verbose, ex: (or (= shifted-epact 0)
(and (= shifted-epact 1)
(< 10 (mod g-year 19)))) And you end up cleaning up syntactic clutter by combining things without loss of clarity, ex: a series of (let [year (gregorian-year-from-fixed date)
prior-days (- date (gregorian-new-year year))
correction ...]
...) Among things previously discussed, the EM proposal syntax (above) comes closest to this set of design principles, followed by our current syntax, followed by 2a, and then 1a and 3a are furthest. In this regard, I'm not excited by 2a, but compared to 1a/3a, 2a makes me facepalm fewer times.
@stasm I like your sentiment behind this, and how about instead: making the minimal amount of change possible and avoiding dragging in other topics, if we can avoid it? Because I think we can avoid it. And also solidifying our decisions with clear guiding principles? We've almost gone full circle on some topics in the last 1.5 years, it feels Ouija board-esque. I'm worried about unwittingly ending up designing a Homermobile. Beyond designing a Homermobile, the thing that keeps me up at night is the thought of having to support it for the many developers potentially making the same design-induced mistakes across a very large company, and the many more orders of magnitude of end users that would deal with poorer experiences as a result. |
My stack rank: 2a > 1a > 3a I don't feel strongly about the I can live with 3 or 1 if they didn't have the "magic space trimming". I know we talk syntax now, but things are related. By making small decisions on bits and pieces we will end up with something that does not work well together. Not wrapping the message part in selectors also forces us to do more "gymnastics" to try to detect the end of the message. So I can probably live with 1, if it is mandatory to "quote" the pattern (in the complex case only). I find 3 very hard to read, especially once it gets on one line. |
Something I commented on PR #496 but too late, so it probably went under the radar. I've been trying to think more like an HTML developer, also checked again the dom localization proposal, the Google soy format (which is kind of a templating language). And I think that the "automatic trimming of spaces" will also hurt people used to html. Let' say I do this: <style>
.foo { white-space: pre; }
#bar { white-space: pre-wrap; }
</style>
...
<p>
Hello world one!
</p>
<p space="preserve"> Hello world two! </p>
<p class="foo"> Hello world three! </p>
<p id="bar"> Hello world four! </p> This will render with a space in front of the first message, and preserves all spaces for messages 2, 3 and 4. Now I am asked to internationalize this and prepare for translation. Using DOM localization. So I do: <style>
.foo { white-space: pre; }
#bar { white-space: pre-wrap; }
</style>
...
<p l10n="msg1">
Hello world one!
</p>
<p l10n="msg2" space="preserve"> Hello world two!</p>
<p l10n="msg3" class="foo"> Hello world three!</p>
<p l10n="msg3" id="bar"> Hello world four!</p> and the "message catalog" (might even be extracted automatically, {
"msg1": "Hello world one!",
"msg2": " Hello world two!",
"msg3": " Hello world three!",
"msg3": " Hello world four!"
} One would expect everything to render 100% the same. But IF the messages automatically go through MF2, the spaces in msg2, 3, and 4 are trimmed (by MF2). So it is one of those where "ah, this looks familiar", but then I am hurt by it because it really isn't the same. Yes, the answer is "if you want your spaces wrap the message in But why should I be hurt by that and forced to fix it? That is the reason why I am arguing for WYSIWYG, both in simple mode and in complex mode. **Note: ** I chose json to store the strings instead if the properties-like format in the proposal to not introduce another layer of unknown behavior with the message catalog (I don't know if the proposed TLDR: trimming will actually hurt people familiar with the HTML behavior. |
1a = 2a = 3a English keywords over sigils (1a, 2a)These statements allow recognition over recall. I think the sigil characters rely more on the user memorizing the syntax, which might make it less accessible to new users. Variety of enclosing characters (3a)3a's varied syntax makes it easy to distinguish All-encompassing code mode rather than code statement (2a)3a repeats I prefer 2a's all-encompassing code mode if I make two assumptions:
If this user is working with patterns that start and end with Linked with @stasm's point on mental models, the use-case above would see 2a treated like a 2-layer model in the majority of cases - i.e. start in code mode using Other thoughts:
|
If we introduce a "preamble" for all statements to live in, then I think the Here's another way of spelling my final example (all sigils TBD):
I agree that
I agree that leaving an unbalanced
I agree with you that trimming the spaces in case of simple messages is a tripping hazard. @eemeli observed that we could delegate the exact handling to the host format. I.e. Java properties would trim, while JSON wouldn't. If a translator puts a space in front of the translation in OTOH, I think trimming in variant patterns is similarily aligned with the original intent, because the syntax itself suggest that we put a space after the variant key. I realize that I'm advocating for an inconsistent behavior between simple patterns and variant patterns, but in my talking to people outside the WG this seemed to be the least surprising behavior. In fact, here's what I heard (rephrased):
|
Consideration of those topics is on the table irrespective of the general syntax choice we're currently making, just as they've always been. Were we to choose 2a, we'd still need to consider each of the above as we're starting in text mode, not delimiting simple patterns, and introducing new sigils for entering and exiting code mode.
The short answer here is "paradigm shift". To allow for simple messages without delimiters, we need to start in text rather than code. Previously, our wrapping syntax was built on the expectation of starting in code, and now we're not doing that. We've changed a key premise, and now we need to build up the structures around patterns again. Thankfully, we do not currently need to look at what's happening within patterns, the data model, or the message formatting, as each of those is kept constant: It's only the syntax wrapping the patterns that we're reconsidering. So a vast majority of the work we've done and the choices we've made so far continue to be fully valid and supported.
Actually, our process does not require unanimous consent. If opposition to a choice is sustained, our chair may call for a ballot to resolve the deadlock. |
I have update the ranking table for comments up to here. |
3a > 1a >> 2a I prefer 3a over 1a due to what feels like unnecessary verbosity in the syntax but don't feel too strongly about this. 2a I feel is just overly complex from a DX perspective overall. |
0 > 2a >> 1a > 3a I know that 0 is off the table. I don't understand why, since we arrived at that last year after significant discussion. It wasn't my original favorite, but I came around to it after listening to the arguments. I strongly prefer enclosing user-visible text with visible syntax, ideally always and consistently, from my experience with ICU MessageFormat. I have extended that format, or worked with contributors to extend it, several times. I have reimplemented its parser and formatter. And then I got to work for years with developers, localization product managers, and translators at and for Google to document it, explain it, and trouble-shoot messages written in it. The simpler and the more consistent the better. A sense of "messages have always been mostly text with a sprinkle of placeholders" needs to take a step behind making it work reliably. One of the problems has of course been inconsistent use and trimming of white space. Always enclosing user-visible text eliminates that completely and elegantly. Also, when you consider white space, don't limit yourselves to ASCII. Ideographic space and no-break spaces can sneak in but may or may not be just as intentional as ASCII space and line feed. |
This doesn't answer much, but it raises questions. It also doesn't address higher level technical issues of whether pulling in multiple other issues is truly necessary, or the higher level group question of why are we doing this now, and when will overturning our decisions happen yet again? It is... unsettling. And disappointing. To put it mildly.
As far as the "paradigm shift", don't consider me included in deciding that because I feel excluded from this recent push. And if it will cause the amount of complexity in other parts of the syntax & user experience, like it seems it would for all the reasons above, then I definitely don't agree that the benefits outweigh the costs. Look, I get that optional delimiters & the other topics all being brought into discussion lead to options that start to look like the EZ proposal that @eemeli coauthored. If the way we use our process is to repeat, lather & rinse until we end up with that, something seems broken. It doesn't make sense, and the implications for usage concerns and mistakes have me worried. |
Closing this issue per the discussion in the 2023-10-23 teleconference. The consensus was to adopt "2a with additional ugliness" to be followed immediately by a discussion of PEWS. That discussion will produce changes which might include removing the ugliness foisted on 2a or adopting one of the other syntaxes in an iterative way. |
I agree with @markusicu that it makes things cleaner to just start in code mode. I also agree with @eemeli that it's a paradigm shift to start in text mode instead of code mode. It sounds like the committee is moving in that direction, though, so I'd advise some patience and humility when arriving at the new syntax. |
In other words: my overall feeling is that we should start in code mode (option 0); it looks a little strange at first but it's an easy mental model to grok and everything flows elegantly from there. However, if the committee wishes to start in text mode, in my opinion, I think it's worth fully embracing it and designing an equally elegant syntax (something in the direction of 1a) rather than "well it's text mode except when it's not" (option 2a). That is, "0 > 1a > 2a". I don't see a path toward a good outcome of starting in text mode without setting the committee back several more quarters. |
One more observation: No one as far as I can tell really loves 2a, and many people strongly dislike it. In the first vote, most people who preferred 2a actually preferred 0. Option 2a is just an ugly middle ground. Speaking personally, I don't want to see this committee land on such a solution. I would rather have an elegant solution with tradeoffs than a grotesque solution that nobody loves. (sorry for the multiple replies) |
@sffc (and @markusicu). Thank you for your comments. Please note that this thread is closed. Comments about syntax should be directed to #474 or to the pending update of the pattern-exterior-whitespace design doc. Option 0 is cleaner if we only consider the world of message format messages, not the world of localizable messages that formattable messages lives within or the feedback from the larger community. The group has fairly solid consensus to start in text mode. 2a-with-additional-ugliness was "chosen" to enable this group to make progress on the "elegant solution with tradeoffs". The primary problem is that there is a schism in the group.
Note that it is always possible to quote the pattern or the whitespace. Note that non-variant patterns (i.e. simple messages such as Different syntax options can be applied to any of these in a quest for elegance. Generally speaking, options like 1a, 3a and the like are having to deal with the need to describe how best to distinguish code from text in cases where the pattern is unquoted. This group's next step is to directly tackle the problem of pattern exterior whitespace for patterns. If we can achieve a consensus that allows unquoted patterns, we are likely to adopt a syntax designed for that (i.e. based on 1a or 3a or @stasm's predicate block proposal). If we achieve a consensus that says all variant pattern must be quoted, we'll likely move to beautify 2a (which is designed for that).
I disagree. If we can resolve the code/pattern boundary issue (either by quoting patterns, or choosing a trimming strategy for unquoted patterns [which includes an option of never trimming]) then we are well positioned to deliver all of the remaining details. If this group cannot compromise on this issue, I will be forced to use the official voting mechanism in our process. I am most strenuously seeking to avoid that, as I think such a step would produce undesirable outcomes. |
Per our discussion in the 2023-10-16 teleconference, we have narrowed the candidates for the syntax to three. These are described in this document
Following @aphillips's comment below as a general template, please stack rank your choice for the syntax to use in modifying the ABNF. Please respond before the group teleconference on Monday, 23 October 2023. Responses after that time will be ignored.
Any syntax that we choose is still subject to specific modifications using the normal group consensus process.
Of particular note, option
2a
might be changed to have a non-enclosing "starter" sigil or starter character sequence instead of the enclosing sequence shown in the document. Similarly, option3a
uses a sigil%
/%[
which is subject to change.Important
"Voting" or stack ranking will be used to inform a rough consensus discussion. This is not a "winner take all" type of exercise.
We are most interested in making a good technical choice. Spend the time to elucidate your reasoning.
The text was updated successfully, but these errors were encountered: