Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite Internationalization section. #641

Merged
merged 19 commits into from
Jun 30, 2019
Merged

Rewrite Internationalization section. #641

merged 19 commits into from
Jun 30, 2019

Conversation

msporny
Copy link
Member

@msporny msporny commented May 25, 2019

Attempt to address all remaining i18n considerations after talking w/ i18n folks, JSON-LD folks, and various people with strong opinions. This PR suggests a mechanism that works for JSON, JSON-LD, RDF, and HTML syntaxes/approaches while addressing all known i18n concerns raised to date.

Related to issue #436.


Preview | Diff

Copy link
Contributor

@aphillips aphillips left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@msporny Here are my comments on the PR you mentioned earlier.

@r12a You should have a look also.

index.html Outdated Show resolved Hide resolved
index.html Outdated Show resolved Hide resolved
index.html Outdated Show resolved Hide resolved
index.html Outdated Show resolved Hide resolved
index.html Outdated Show resolved Hide resolved
index.html Outdated Show resolved Hide resolved
index.html Outdated Show resolved Hide resolved
index.html Outdated Show resolved Hide resolved
index.html Outdated Show resolved Hide resolved
index.html Outdated Show resolved Hide resolved
Copy link
Contributor

@aphillips aphillips left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

b2f3fdb looks much improved. Much depends on the outcome of our discussions, of course.

@msporny
Copy link
Member Author

msporny commented May 26, 2019

@aphillips wrote:

b2f3fdb looks much improved. Much depends on the outcome of our discussions, of course.

You might also find the more human readable version easier to read through:

https://pr-preview.s3.amazonaws.com/w3c/vc-data-model/pull/641.html#internationalization-considerations

index.html Outdated Show resolved Hide resolved
index.html Outdated Show resolved Hide resolved
index.html Outdated Show resolved Hide resolved
index.html Outdated Show resolved Hide resolved
@msporny
Copy link
Member Author

msporny commented Jun 9, 2019

Alright folks, after about 4 weeks of work and 5 complete rewrites, I think we're getting close to fixing all the i18n issues in the Verifiable Credentials Data Model specification! :P

A number of us from the VCWG, JSON-LD WG, I18N WG, IETF langtag community, and random friends on the Internet got together and worked through a variety of the concerns here and elsewhere. The result was a new PR against the string-meta spec:

w3c/string-meta#35

A complete rewrite of this PR (again), human-readable version here:

https://pr-preview.s3.amazonaws.com/w3c/vc-data-model/pull/641.html#internationalization-considerations

Please review and see if you agree with the latest, noting that the thing that is really time critical is the Verifiable Credentials spec (this PR). We have time to fine tune the string-meta spec (and of course, fix JSON-LD, RDF, etc.).

/cc @aphillips @r12a @dlongley @burnburn @chaals @iherman @gkellogg and @pchampin.

@msporny msporny requested review from r12a, brentzundel and TallTed June 9, 2019 19:31
Copy link
Contributor

@chaals chaals left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think some of the explanations and examples should be changed. Examples should generally show a meaningful case, and tagging plain Arabic text as rtl isn't necessary.

</p>

<pre class="example nohighlight" title="Arabic text with a base direction of right-to-left">
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pure arabic text has a strong directionality already, and doesn't really need this. The sort of example that does is where the text that starts the string runs in the opposite direction to the string itself:

מאַזל-טאָוו W3C, how do we handle this now?

is a left-to-right text in english (for people whose english includes a sprinkling of yiddish written as it often is in the hebrew alphabet). But without being tagged ltr, it can end up with the yiddish text and the question mark swapping places.

בינלאומי!
bidi בינלאומי

are broken - the exclamation should be on the left, and the word "bidi" should be on the right, as they would if the entire string were tagged as RTL

Copy link
Contributor

@dlongley dlongley Jun 10, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding (which has changed many times on this issue, so I wouldn't be surprised if it needs more updating)... is that the primary reason for adding "direction" is to handle cases where the language tag is incorrect or missing but there is direction meta data that is correct. There is apparently a common pattern where the wrong language tag gets assigned to some data but the direction is proper and should be given preference. Because of this situation, "direction" is being added here to capture this otherwise correct and dropped meta data.

For example, the text is Arabic, it is marked with an incorrect tag of "en" but the "direction" is properly "rtl".

I realize it would be odd to add an example with this erroneous information, but isn't that the point of the feature? We don't need it otherwise.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Chaals is correct to point out that the direction information is not essential for a single arabic word, such as used in the example currently.

If the word was "Revoked!" it would be necessary for the consumer to apply RTL base direction, so that the exclamation mark appears on the correct side of the arabic word. However, if the consumer is using first-strong heuristics to detect direction (which we recommend) then it can determine the direction from that, since the first letter in the arabic word is RTL.

If, however, the first letter in an arabic translation is not RTL, then the base direction needs to be indicated, since otherwise the heuristics will make the wrong decision. So an example with some initial text in latin script would be a stronger example.

In string-meta we use:
HTML و CSS: تصميم و إنشاء مواقع الويب
which means
HTML and CSS: Designing and Creating Websites

Btw, there's another issue which is common when dealing with examples of bidi text written in english: How do you show the arabic text in the example? The way a user would see if correctly displayed is with "CSS و HTML" to the right of the remaining arabic text. To achieve that in the example, you'd need to use markup under the hood. However, then it may be less clear to the casual ready of the spec that the string begins with H. It's not a problem with an easy solution. Especially in Arabic, where overriding the text to show order of characters in memory produces strange joining behaviour (which is why we use Hebrew for that).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I slightly disagree with @r12a here. The base direction is needed for other reasons besides just the bidi algorithm. For example, layout, text alignment, or bullet placement in a list might depend on the base direction. Sometimes we need base direction information even when it doesn't affect the text composition.

It's of course helpful to choose examples in which setting the direction has some observable effect, and this always means mixed direction text. The best examples start (as ours does) with a strongly "anti-directional" character and ends (as ours does not) with punctuation (? is ideal); even better if one can get a number and some enclosing punctuation (parentheses).

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aphillips, if you're referring to the first para in my comment about a single arabic word, i agree that you'd need an overall base direction to be applied by the consumer if it wasn't inserted into some other text, but the FS heuristics would get you that base direction.

Btw, a caution about ?, since Arabic text would use ؟ U+061F ARABIC QUESTION MARK, which has the bidi category of AL - Right-to-left Arabic. In short, it will end up in the right place when dropped into a LTR context if it follows arabic letters. It could be used with Hebrew examples, however, since Hebrew uses the european question mark.

following key-value pair:

<pre>
"nameHtml": "&lt;span dir="rtl" lang="ar">HTML و CSS: تصميم و إنشاء مواقع الويب&lt;/span>"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This text is a more meaningful example, because if you don't have the base direction it comes out wrong.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverted to this example in 37daadb.

index.html Outdated

<pre>
"nameHtml": "&lt;span dir="rtl" lang="ar">HTML و CSS: تصميم و إنشاء مواقع الويب&lt;/span>"
The next example expresses the English text "revoked" in the Arabic
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of

expresses the English text "revoked" in the Arabic

how about something like

uses a term in Arabic

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 905fec0.

index.html Outdated
<code>nameHtml</code> property is of type <code>rdf:HTML</code>, then software
agents have sufficient information to deterministically identify the text
direction of the language.
Utilization of the design pattern above assumes that the JSON-LD
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This paragraph is confusing especially if you're not automatically thinking about JSON-LD. I'll have a think about how to make it clearer...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Attempted to do this in 9956dfa.

@chaals
Copy link
Contributor

chaals commented Jun 10, 2019

I also agree that discouraging HTML content is generally the right thing to do... (which makes me think we should be asking the HTML and Security communities to provide some guidance around how to sanitise HTML input and why...)

index.html Outdated
<pre class="example nohighlight" title="Expressing natural language text as English">
"statusReason": {
"value": "<span class="highlight">Revoked</span>",
"lang: "<code>en</code>"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"lang: "<code>en</code>"
"lang": "<code>en</code>"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 77e1d9f.

index.html Outdated
"id": "did:example:c276e12ec21ebfeb1f712ebc6f1",
"name": [{
"value": "Example University",
"language": "en"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These examples use the alias "language" instead of "lang". Which should it be?

Copy link
Member Author

@msporny msporny Jun 30, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replaced all "language" with "lang" in ce371e3.

specification and significantly diminish its value as a standard.
There are a number of internationalization considerations implementers
are advised to be aware of when publishing data described in this specification.
As with any web standards or protocols implementation, ignoring
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As with any web standards or protocols implementation, ignoring
-> As with implementation of any web standard or protocol, ignoring

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in f0b4f50.

index.html Outdated

<p>
Implementers are strongly advised to read the Strings on the Web: Language and
Direction Metadata document [[STRING-META]] published by the W3C
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the Strings on the Web: Language and
Direction Metadata document

->

the <em>Strings on the Web: Language and
Direction Metadata</em> document

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just published an update of the Strings on the Web doc in /TR at https://www.w3.org/TR/string-meta/. You may want to point to that location, instead of the ED. I'll try to update regularly using echidna when we add new text to the ED.

Copy link
Member Author

@msporny msporny Jun 30, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TallTed done in 34e6d2b.

@r12a done in 34e6d2b.

index.html Outdated
</p>

<p>
This section outlines general internationalization considerations to take into
account when utilizing this data model.
account when utilizing this data model and is intended to highlight specific
parts of the Strings on the Web: Language and Direction Metadata document
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the Strings on the Web: Language and Direction Metadata document
->
the <em>Strings on the Web: Language and Direction Metadata</em> document

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in f0b4f50.

index.html Outdated
</p>

<p>
Implementers are strongly discouraged from encoding information as HTML
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implementers are
-> Despite that possibility, implementers are

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 966b348.

index.html Outdated
internationalization.
Implementers considering the use of HTML to encode complex language and/or
base direction information might consider deconstructing the data into a
format that does not require complex markup, such as an array of elements
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This advice worries me, since it could lead people to think that they should break a sentence up into separate strings everywhere the direction changes. That would cause problems for translation of the string (see Working with Composite Messages), not to mention the fact that an overall base direction (for the whole string) is still necessary, in order to pull together the individual parts in the correct order.

I'm not sure that much can be done for language, without markup.

But for base direction, the answer for plain text strings is to use the Unicode formatting characters (see How to use Unicode controls for bidi text).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to this.

Most "multilingual" text actually has a base language. It's true that it is possible to write a sentence that demonstrates how language identification is needed within a single string (particularly when it comes to font selection or language-specific styling). But this is a corner case and is the sort of thing that calls for markup in any event.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Removed the bad advice in ea43b25.

Copy link
Member Author

@msporny msporny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, all change requests implemented, merging.

index.html Outdated
<pre class="example nohighlight" title="Expressing natural language text as English">
"statusReason": {
"value": "<span class="highlight">Revoked</span>",
"lang: "<code>en</code>"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 77e1d9f.

index.html Outdated

<pre>
"nameHtml": "&lt;span dir="rtl" lang="ar">HTML و CSS: تصميم و إنشاء مواقع الويب&lt;/span>"
The next example expresses the English text "revoked" in the Arabic
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 905fec0.

following key-value pair:

<pre>
"nameHtml": "&lt;span dir="rtl" lang="ar">HTML و CSS: تصميم و إنشاء مواقع الويب&lt;/span>"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverted to this example in 37daadb.

index.html Outdated
<code>nameHtml</code> property is of type <code>rdf:HTML</code>, then software
agents have sufficient information to deterministically identify the text
direction of the language.
Utilization of the design pattern above assumes that the JSON-LD
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Attempted to do this in 9956dfa.

index.html Outdated
</p>

<p>
Implementers are strongly discouraged from encoding information as HTML
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 966b348.

index.html Outdated
internationalization.
Implementers considering the use of HTML to encode complex language and/or
base direction information might consider deconstructing the data into a
format that does not require complex markup, such as an array of elements
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Removed the bad advice in ea43b25.

@msporny
Copy link
Member Author

msporny commented Jun 30, 2019

I've implemented all changes requested by reviewers. This is also in line w/ what seems to be emerging between the JSON-LD WG, i18n community, IETF folks, and the VCWG.

Merging.

@msporny msporny merged commit 92ba5d5 into gh-pages Jun 30, 2019
@msporny
Copy link
Member Author

msporny commented Jun 30, 2019

Merged. Thanks to all for the help over the past month to get this PR to a point that was acceptable to all communities involved! :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants