-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rewrite Internationalization section. #641
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
b2f3fdb looks much improved. Much depends on the outcome of our discussions, of course.
@aphillips wrote:
You might also find the more human readable version easier to read through: |
Alright folks, after about 4 weeks of work and 5 complete rewrites, I think we're getting close to fixing all the i18n issues in the Verifiable Credentials Data Model specification! :P A number of us from the VCWG, JSON-LD WG, I18N WG, IETF langtag community, and random friends on the Internet got together and worked through a variety of the concerns here and elsewhere. The result was a new PR against the string-meta spec: A complete rewrite of this PR (again), human-readable version here: Please review and see if you agree with the latest, noting that the thing that is really time critical is the Verifiable Credentials spec (this PR). We have time to fine tune the string-meta spec (and of course, fix JSON-LD, RDF, etc.). /cc @aphillips @r12a @dlongley @burnburn @chaals @iherman @gkellogg and @pchampin. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think some of the explanations and examples should be changed. Examples should generally show a meaningful case, and tagging plain Arabic text as rtl
isn't necessary.
</p> | ||
|
||
<pre class="example nohighlight" title="Arabic text with a base direction of right-to-left"> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pure arabic text has a strong directionality already, and doesn't really need this. The sort of example that does is where the text that starts the string runs in the opposite direction to the string itself:
מאַזל-טאָוו W3C, how do we handle this now?
is a left-to-right text in english (for people whose english includes a sprinkling of yiddish written as it often is in the hebrew alphabet). But without being tagged ltr, it can end up with the yiddish text and the question mark swapping places.
בינלאומי!
bidi בינלאומי
are broken - the exclamation should be on the left, and the word "bidi" should be on the right, as they would if the entire string were tagged as RTL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding (which has changed many times on this issue, so I wouldn't be surprised if it needs more updating)... is that the primary reason for adding "direction" is to handle cases where the language tag is incorrect or missing but there is direction meta data that is correct. There is apparently a common pattern where the wrong language tag gets assigned to some data but the direction is proper and should be given preference. Because of this situation, "direction" is being added here to capture this otherwise correct and dropped meta data.
For example, the text is Arabic, it is marked with an incorrect tag of "en" but the "direction" is properly "rtl".
I realize it would be odd to add an example with this erroneous information, but isn't that the point of the feature? We don't need it otherwise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Chaals is correct to point out that the direction information is not essential for a single arabic word, such as used in the example currently.
If the word was "Revoked!" it would be necessary for the consumer to apply RTL base direction, so that the exclamation mark appears on the correct side of the arabic word. However, if the consumer is using first-strong heuristics to detect direction (which we recommend) then it can determine the direction from that, since the first letter in the arabic word is RTL.
If, however, the first letter in an arabic translation is not RTL, then the base direction needs to be indicated, since otherwise the heuristics will make the wrong decision. So an example with some initial text in latin script would be a stronger example.
In string-meta we use:
HTML و CSS: تصميم و إنشاء مواقع الويب
which means
HTML and CSS: Designing and Creating Websites
Btw, there's another issue which is common when dealing with examples of bidi text written in english: How do you show the arabic text in the example? The way a user would see if correctly displayed is with "CSS و HTML" to the right of the remaining arabic text. To achieve that in the example, you'd need to use markup under the hood. However, then it may be less clear to the casual ready of the spec that the string begins with H. It's not a problem with an easy solution. Especially in Arabic, where overriding the text to show order of characters in memory produces strange joining behaviour (which is why we use Hebrew for that).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I slightly disagree with @r12a here. The base direction is needed for other reasons besides just the bidi algorithm. For example, layout, text alignment, or bullet placement in a list might depend on the base direction. Sometimes we need base direction information even when it doesn't affect the text composition.
It's of course helpful to choose examples in which setting the direction has some observable effect, and this always means mixed direction text. The best examples start (as ours does) with a strongly "anti-directional" character and ends (as ours does not) with punctuation (?
is ideal); even better if one can get a number and some enclosing punctuation (parentheses).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aphillips, if you're referring to the first para in my comment about a single arabic word, i agree that you'd need an overall base direction to be applied by the consumer if it wasn't inserted into some other text, but the FS heuristics would get you that base direction.
Btw, a caution about ?, since Arabic text would use ؟ U+061F ARABIC QUESTION MARK, which has the bidi category of AL - Right-to-left Arabic. In short, it will end up in the right place when dropped into a LTR context if it follows arabic letters. It could be used with Hebrew examples, however, since Hebrew uses the european question mark.
following key-value pair: | ||
|
||
<pre> | ||
"nameHtml": "<span dir="rtl" lang="ar">HTML و CSS: تصميم و إنشاء مواقع الويب</span>" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This text is a more meaningful example, because if you don't have the base direction it comes out wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reverted to this example in 37daadb.
index.html
Outdated
|
||
<pre> | ||
"nameHtml": "<span dir="rtl" lang="ar">HTML و CSS: تصميم و إنشاء مواقع الويب</span>" | ||
The next example expresses the English text "revoked" in the Arabic |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of
expresses the English text "revoked" in the Arabic
how about something like
uses a term in Arabic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in 905fec0.
index.html
Outdated
<code>nameHtml</code> property is of type <code>rdf:HTML</code>, then software | ||
agents have sufficient information to deterministically identify the text | ||
direction of the language. | ||
Utilization of the design pattern above assumes that the JSON-LD |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This paragraph is confusing especially if you're not automatically thinking about JSON-LD. I'll have a think about how to make it clearer...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Attempted to do this in 9956dfa.
I also agree that discouraging HTML content is generally the right thing to do... (which makes me think we should be asking the HTML and Security communities to provide some guidance around how to sanitise HTML input and why...) |
index.html
Outdated
<pre class="example nohighlight" title="Expressing natural language text as English"> | ||
"statusReason": { | ||
"value": "<span class="highlight">Revoked</span>", | ||
"lang: "<code>en</code>" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"lang: "<code>en</code>" | |
"lang": "<code>en</code>" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in 77e1d9f.
index.html
Outdated
"id": "did:example:c276e12ec21ebfeb1f712ebc6f1", | ||
"name": [{ | ||
"value": "Example University", | ||
"language": "en" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These examples use the alias "language" instead of "lang". Which should it be?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replaced all "language" with "lang" in ce371e3.
specification and significantly diminish its value as a standard. | ||
There are a number of internationalization considerations implementers | ||
are advised to be aware of when publishing data described in this specification. | ||
As with any web standards or protocols implementation, ignoring |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As with any web standards or protocols implementation, ignoring
-> As with implementation of any web standard or protocol, ignoring
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in f0b4f50.
index.html
Outdated
|
||
<p> | ||
Implementers are strongly advised to read the Strings on the Web: Language and | ||
Direction Metadata document [[STRING-META]] published by the W3C |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the Strings on the Web: Language and
Direction Metadata document
->
the <em>Strings on the Web: Language and
Direction Metadata</em> document
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just published an update of the Strings on the Web doc in /TR at https://www.w3.org/TR/string-meta/. You may want to point to that location, instead of the ED. I'll try to update regularly using echidna when we add new text to the ED.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
index.html
Outdated
</p> | ||
|
||
<p> | ||
This section outlines general internationalization considerations to take into | ||
account when utilizing this data model. | ||
account when utilizing this data model and is intended to highlight specific | ||
parts of the Strings on the Web: Language and Direction Metadata document |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the Strings on the Web: Language and Direction Metadata document
->
the <em>Strings on the Web: Language and Direction Metadata</em> document
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in f0b4f50.
index.html
Outdated
</p> | ||
|
||
<p> | ||
Implementers are strongly discouraged from encoding information as HTML |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implementers are
-> Despite that possibility, implementers are
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in 966b348.
index.html
Outdated
internationalization. | ||
Implementers considering the use of HTML to encode complex language and/or | ||
base direction information might consider deconstructing the data into a | ||
format that does not require complex markup, such as an array of elements |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This advice worries me, since it could lead people to think that they should break a sentence up into separate strings everywhere the direction changes. That would cause problems for translation of the string (see Working with Composite Messages), not to mention the fact that an overall base direction (for the whole string) is still necessary, in order to pull together the individual parts in the correct order.
I'm not sure that much can be done for language, without markup.
But for base direction, the answer for plain text strings is to use the Unicode formatting characters (see How to use Unicode controls for bidi text).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to this.
Most "multilingual" text actually has a base language. It's true that it is possible to write a sentence that demonstrates how language identification is needed within a single string (particularly when it comes to font selection or language-specific styling). But this is a corner case and is the sort of thing that calls for markup in any event.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Removed the bad advice in ea43b25.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, all change requests implemented, merging.
index.html
Outdated
<pre class="example nohighlight" title="Expressing natural language text as English"> | ||
"statusReason": { | ||
"value": "<span class="highlight">Revoked</span>", | ||
"lang: "<code>en</code>" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in 77e1d9f.
index.html
Outdated
|
||
<pre> | ||
"nameHtml": "<span dir="rtl" lang="ar">HTML و CSS: تصميم و إنشاء مواقع الويب</span>" | ||
The next example expresses the English text "revoked" in the Arabic |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in 905fec0.
following key-value pair: | ||
|
||
<pre> | ||
"nameHtml": "<span dir="rtl" lang="ar">HTML و CSS: تصميم و إنشاء مواقع الويب</span>" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reverted to this example in 37daadb.
index.html
Outdated
<code>nameHtml</code> property is of type <code>rdf:HTML</code>, then software | ||
agents have sufficient information to deterministically identify the text | ||
direction of the language. | ||
Utilization of the design pattern above assumes that the JSON-LD |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Attempted to do this in 9956dfa.
index.html
Outdated
</p> | ||
|
||
<p> | ||
Implementers are strongly discouraged from encoding information as HTML |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in 966b348.
index.html
Outdated
internationalization. | ||
Implementers considering the use of HTML to encode complex language and/or | ||
base direction information might consider deconstructing the data into a | ||
format that does not require complex markup, such as an array of elements |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Removed the bad advice in ea43b25.
I've implemented all changes requested by reviewers. This is also in line w/ what seems to be emerging between the JSON-LD WG, i18n community, IETF folks, and the VCWG. Merging. |
Merged. Thanks to all for the help over the past month to get this PR to a point that was acceptable to all communities involved! :) |
Attempt to address all remaining i18n considerations after talking w/ i18n folks, JSON-LD folks, and various people with strong opinions. This PR suggests a mechanism that works for JSON, JSON-LD, RDF, and HTML syntaxes/approaches while addressing all known i18n concerns raised to date.
Related to issue #436.
Preview | Diff