Skip to content

3.2.2 Approaches Considered (proposed rewrite)

r12a edited this page Jul 25, 2018 · 2 revisions

[still in development!]

There are some general principles and issues that apply to all approaches.

For each of the approaches outlined below, it can probably be safely assumed that first-strong detection is a useful heuristic to apply most of the time. This is described in the Unicode Bidirectional Algorithm, and involves looking for the first strongly-directional character in the string and then assuming that it represents the appropriate base direction for the string. Most of the concerns circle around situations where the first strong character in the string doesn’t indicate the appropriate base direction. All but the first of the use cases in section 3.2.1 represent instances of this kind.

In all cases, there are two essential factors:

  • the producer must be able to correctly associate direction with those strings that need information about base direction (far from obvious)
  • consumers must have the correct expectations about how to decode information about base direction.

In each case, the producer of the string must determine whether some special action is needed to correctly identify the string’s base direction. Machines are unable to ascertain when such action is needed (if they were, you wouldn’t be reading this, because it would be possible to automatically ascertain the appropriate base direction at the point of consumption). At some point, human intervention is necessary.

If the producer is a human creating strings in an HTML form, it may not be possible to detect situations where a directional override is needed.

  1. it may be possible for the application to detect the need for a marker if the user overrides the default direction, eg. by changing the base direction of a form during input using shortcut keys or menu selections. It is possible, however, that users typing small amounts of data will often not bother to do that consistently.
  2. the user may be completely unaware that there is a problem. If a person types use case #2 into an HTML form and the form's base direction is set to RTL, they would not need to add RLM/LRM to make the string look correct at that time (because the base direction is already set by the HTML), but outside of that context the string would look incorrect unless it was presented wrapped in a RTL base direction. Similarly, strings scraped from a web page that has dir=rtl set in the html element would not normally have or need an RLM/LRM character at the start of the string in HTML.

If the producer is a human creating strings in an HTML form, they could theoretically apply one of these characters when creating a string in order to signal the directionality. It is hard to imagine users adding RLM/LRM characters at the start of a string for a number of reasons:

  1. the user may not know about RLM/LRM characters, and if they do and they feel motivated to input them, they may be unable to input them, especially on mobile devices which normally do not have those characters available on the keyboard.
  2. it may be possible for the application to detect the need for a marker if the user overrides the default direction, eg. by changing the base direction of a form during input using shortcut keys. It is likely, however, that users typing small amounts of data will often not bother to do that consistently.
  3. the user may be completely unaware that there is a problem. If a person types use case #2 into an HTML form and the form's base direction is set to RTL, they would not need to add RLM/LRM to make the string 'look correct' for themselves, but outside of that context the string would look incorrect unless the RLM had been added. Similarly, strings scraped from a web page that has dir=rtl set in the html element would not normally have or need an RLM/LRM character at the start of the string in HTML.

If the producer is a human creating (or modifying) data in a structured environment, it could be argued that metadata constructs can often provide for more consistent and easier to manage labelling of strings which need it.

Whichever approach is used, the consumer needs to be aware of how to decode the relevant information.

If the assumption holds that the default way to indicate the overall base direction of a string is to rely on the first strong character, it must equally hold that consumers of those strings expect to decode that information by applying first-strong heuristics. Where first-strong is insufficient, whatever tactic is used to counteract the default must either change the first strong character, or pass information about the appropriate base direction by some other means which is understood by the consumer. The latter would require agreement on an interchange format between producer and consumer in order to function.

While not universally applied, it may be easier to rely on consumers using first-strong heuristics for general string handling. Applications that don’t use first-strong heuristics and are not part of an alternative interchange format probably don’t handle bidi text well anyway. That said, first-strong is not the only heuristic used to detect the appropriate base direction for a string. Notable examples are Twitter and Facebook, which currently use different default heuristics for guessing the base direction of text – neither use just simple first-strong detection, and one uses a completely different method.

Another general issue has to do with strings that begin with markup. See use case #3 above. Arrangements that depend upon first-strong detection need to skip the markup while looking for the first strong character, otherwise they will almost invariably be detected as LTR, but only if that markup was to be parsed by the consumer. If that string were, say, some example code text, then the appropriate base direction of the string would indeed be LTR. If it is appropriate to skip the markup, the consumer must understand the syntax of the markup, and must ignore attribute values or their equivalent.

It may be, however, that the markup itself indicates the appropriate base direction for the string. This would apply for use case #4 above, which starts with an element that contains dir=‘rtl’, but who’s content begins with a strong LTR character. This requires that the consumer understand not only the syntax, but also the semantics of the markup.

Another issue related to markup is that the consumer needs to check whether the markup begins and ends the string, containing everything else; or whether the string simply starts with an bit of inline markup, which would not be valid for interpreting the base direction of the string as a whole.

First-strong (only)

recommended no

pros:

  • where it is reliable, information about direction can be obtained without any changes to the string

cons:

  • the base direction applied is unreliable, because the first strong character is not always indicative of the necessary base direction for the string
  • any string containing HTML bounded by an element with a dir attribute makes the direction undetectable, since dir isolates
  • the same goes for strings that begin with RLI, etc and end with PDI
  • it’s not clear how to establish whether markup at the start of a string should be considered when checking for first-strong characters
  • consumers need to know the semantics of any markup vocabulary used if embedded markup contains the directional information

to note:

  • the consumer must know to check the string for first-strong heuristics
  • needs to skip characters at start of string without strong directional property, and internal isolated sequences
  • if no directional character is found in the string, there must be an agreement on the default direction
  • if a string is bounded by markup (eg. ) the directionality of the characters in the markup must be ignored when checking for the first-strong character if, and only if, the markup is going to be handled as markup by the consumer; if this is, say, just some example code, then the direction of the markup characters counts; it’s not clear how to tell the difference
  • if a string is bounded by markup with directional information (eg. ..) which indicates the base direction to be used, the directional properties of the characters in the string must be ignored

First-strong + RLM/LRM

recommended no

pros:

  • it provides a reliable way of indicating base direction, as long as the producer can reliably apply markers
  • if correctly applied, the consumer needs to do nothing beyond first-strong heuristics

cons:

  • this approach changes the identity and content of the string
  • it is not clear that the producer of a string would always apply RLM/LRM when appropriate; a machine is not able to identify cases where those characters would be needed, and humans may not know how to do it.
  • consumers wanting to remove the RLM/LRM marker, are not able to determine when the RLM/LRM is a marker appended by the producer, and when it was part of the original string.

to note:

  • the issues related to parsing and understanding markup, and knowing whether or not markup is functional or decorative, apply here
  • producers must ensure that they do not accumulate markers

In cases where a first-strong heuristic would fail (eg. a string that needs an overall direction of RTL, but who’s initial strong characters are LTR such as use case #2 above), it is possible to imagine that inserting an invisible strongly directional character of the correct type at the beginning of the string would fix the problem. The most appropriate characters to use would be U+200F RIGHT-TO-LEFT MARK or U+200E LEFT-TO-RIGHT MARK, referred to here as RLM and LRM, respectively.

Thus the in memory encoding of use case #2 would be (where we show the invisible RLM character as a code point value between angle brackets): “<200F>bidi בינלאומי"

The RLM/LRM here plays the role of a semantic marker. It is not functional in any way. It simply passes a message to the consumer to indicate the base direction that should be established for the string in its target context. Once that message has been acted upon by the consumer, it could be discarded.

Note that the consumer still needs to apply a first-strong heuristic to pick up the appropriate direction. The preference is that RLM/LRM markers are only attached to strings when needed to convey the required base direction.

It is hard to imagine users adding RLM/LRM characters at the start of a string for a number of reasons, not least because the user may not know about RLM/LRM characters, and if they do and they feel motivated to use them, they may be unable to input them, especially on mobile devices which normally do not have those characters available on the keyboard. The issues described earlier related to how a producer identifies strings needing a marker are very much relevant here.

On the other hand, consumers that already use first-strong heuristics to decode the base direction don’t need to change.

A key issue with this approach is that it changes the value of the string. Changing the value of a string changes its identity as well as its length, so may for example impact string searches, comparisons, pointer positions, etc.

When dealing with strings that begin with markup, such as use cases #3 or #4, similar considerations apply as for the more standard use of first-strong heuristics. The markup itself will usually contain strong LTR characters, and so it must be ignored.

If directional information is contained in markup that will be parsed as such by the consumer (for example, dir=rtl in HTML), the producer of the string needs to understand that markup in order to set or not set an RLM/LRM character as appropriate. If the producer always adds RLM/LRM to the start of such strings, the consumer is expected to know that. If the producer relies instead on the markup being understood, the consumer is expected to understand the markup.

The producer of a string should not automatically apply RLM or LRM to the start of the string, but should test whether it is needed. For example, if there's already an RLM in the text, there is no need to add another. If the context is correctly conveyed by first-strong heuristics, there is no need to add additional characters either. Note, however, that testing whether supplementary directional information of this kind is needed is only possible if the producer has access, and knows that it has access, to the original context of the string. Many document formats are generated from data stored away from the original context. For example, the catalog of books in the original example above is disconnected from the user inputing the bidirectional text.

Paired formatting characters

recommended no

pros:

  • none

cons:

  • isolating formatting characters must be used, but they are not yet well supported by consumers
  • consumers that use first-strong heuristics, rather than recognising this approach, would fail
  • Unicode limits for embedding levels may be exceeded

Metadata

recommended? yes

pros:

  • simple, effective & efficient
  • doesn’t affect the content of the string
  • no need to parse the string or know how to interpret it

cons:

  • out-of-band information needs to be associated with and kept with strings

to note:

  • best used only where necessary, and rely on first-strong heuristics otherwise
  • producers need to know when to attach metadata because first-strong doesn’t work
  • it must be possible to associate metadata with any string, but it may also be useful to additionally set a default for all strings

Script tags

recommended (provisionally, pending deeper investigation) yes, but only when the metadata approach above is not possible

pros:

  • no need to change the string
  • no need to inspect the string
  • no complications when dealing with markup in strings

cons:

  • only works where it is possible to associate separate language metadata with each string (eg. JSON-LD, RDF, etc)
  • some scripts in archaic use switch between LTR and RTL according to the preference of the author or the context of the content; the language tag is unable to handle non-default approaches for such strings, but this is expected to be an edge-case
  • new script tags may be coined, and these will need to be added to the lists used by consumers

to note:

  • may be more efficient to assume a default, in the absence of a script subtag, and use first-strong heuristics in non-problematic cases

The W3C Internationalization Working Group recommends that formats and applications should associate dedicated metadata relating to base text direction with strings wherever possible. In cases where that is not possible due to legacy constraints, but where language metadata CAN be associated with each string, it may be possible to use the language metadata as a fallback method of identifying the direction for a string (eg. JSON-LD, RDF, etc).

Note, however, that this is ONLY appropriate when declaring information about the OVERALL base direction to be associated with a string. We do NOT recommend generalised use of language data to indicate text direction, especially within strings, since the usage patterns are not interchangeable.

Note, secondly, that language information must use BCP 47 subtags, and that the tag that carries the information should be the script subtag, not the language subtag. For example, Azeri may be written LTR (with the Latin or Cyrillic scripts) or RTL (with the Arabic script). Therefore, the subtag az is insufficient to clarify intended direction. A language tag such as az-Arab, however, can generally be relied upon to indicate that the overall base direction should be RTL. Furthermore, there are many strings which are not language-specific, such as Mac addresses, but which absolutely need to be associated with the correct base direction for correct consumption.

The expected way in which this information is used is as follows. It may be reasonable to assume a default of LTR for all strings unless marked with a script subtag that indicates RTL. Any string that needs to have an overall base direction of RTL should be labelled for language by the producer using a script subtag. If a script subtag exists, the consumer would check the script against a list of script subtags that indicate a RTL base direction, and if found would take appropriate action.

The list of script subtags may be added to in future. In that case, any subtags that indicate a default RTL direction need to be added to the lists used by the consumers of the strings.

It is perhaps possible to limit the use of script subtag metadata to situations where first-strong heuristics are expected to fail - provided that such cases can be identified, and appropriate action taken by the producer (not always reliable). Consumers would then need to use first-strong heuristics in the absence of a script subtag in order to identify the appropriate base direction.

This avoids the issues associated with first-strong detection when the first-strong character is not indicative of the necessary base direction for the string, and avoids issues relating to the interpretation of markup.

Note that a string that begins with markup that sets a language for the string text content (eg. ) is not problematic here, since that language declaration is not expected to play into the setting of the base direction.

There are some rare situations where the base direction can not necessarily be identified from the script subtag, but these are really limited to archaic usage of text. For example, Japanese and Chinese text prior to World War 2 was often written RTL, rather than LTR. Languages such as those written using Egyptian Hieroglyphs, or the Tifinagh Berber script, could formerly be written either LTR or RTL, however the default for scholastic research tends to LTR.