Support providing text field data in multiple languages #39

cjerdonek · 2015-03-07T18:33:07Z

For all text fields, the spec should have a way to allow providing that information in multiple languages. In San Francisco, for example, it is a requirement that all election information be provided in English, Spanish, and Chinese (as well as in Filipino starting January 1, 2016).

Currently, it seems like VIP consumers wouldn't be able to meet the same language requirements that jurisdictions may have (unless there is a way of providing additional languages that isn't documented in the web site documentation).

jungshadow · 2015-03-10T16:15:14Z

I'm adding this to the "Version 5.0" bucket, but there may be some push back. Personally, I want this to happen and soon. It will ultimately come down to how easy it is to incorporate translations in every VR/EMS system (assuming the translation data is even held in either).

I imagine that @pstenbjorn might have some insight on this.

cjerdonek · 2015-03-12T22:46:24Z

One of my action items from the meeting was to propose something for this issue.

Before coding it, I wanted to describe what I was going for. My goal was to make adding multi-language support to an element in the schema as simple and DRY as possible, for example by changing this--

<xs:element name="greeting" type="xs:string"/>

to this--

<xs:element name="greeting" type="multiLangText"/>

A concrete example would look like this--

<greeting>
    <text lang="en">Hello</text>
    <text lang="es">Hola</text>
    <text lang="fr">Bonjour</text>
</greeting>

Ideally, the following would also be acceptable (if only English were available and for backwards compatibility, etc)--

<greeting>Hello</greeting>

(It looks like <xs:union> would allow a type to be defined that is either an xs:string or a complex type.)

Does this seem good to people?

cjerdonek · 2015-03-12T23:17:35Z

And here is a rough stab at a schema definition for the idea described in the previous comment (I make no claims to have expertise in XML):

<!-- A string with a required language specifier. -->
<xs:complexType name="langString">
    <xs:simpleContent>
        <xs:extension base="xs:string">
            <!-- TODO: should the language values be restricted to certain values? -->
            <xs:attribute name="lang" type="xs:string" use="required"/>
        </xs:extension>
    </xs:simpleContent>
</xs:complexType>
<!-- A text value in one or more languages. -->
<xs:complexType name="multiLangText">
    <xs:union>
        <xs:complexType>
            <xs:sequence>
                <xs:element name="text" type="langString" minOccurs="1" maxOccurs="unbounded"/>
            </xs:sequence>
        </xs:complexType>
        <!-- A simple string can be provided instead if only English is available. -->
        <xs:string/>
    </xs:union>
</xs:complexType>

pstenbjorn · 2015-03-13T12:33:43Z

@cjerdonek this is a good start. The W3C has defined an xsd type of xs:language - see documentation here.

Below is an example based on our conversation using the existing VIP schema. The xs:language element expects valid RFC 3066 language definition - en-US.

<xs:element name="referendum">
  <xs:complexType>
    <xs:choice maxOccurs="unbounded">
      <xs:element name="title" type="xs:string" />
      <xs:element minOccurs="0" maxOccurs="unbounded" name="subtitle" type="ballotLanguage" />
      <xs:element minOccurs="0" maxOccurs="unbounded" name="brief" type="ballotLanguage" />
      <xs:element minOccurs="0" maxOccurs="unbounded" name="text" type="ballotLanguage" />
      <xs:element minOccurs="0" maxOccurs="unbounded" name="pro_statement" type="ballotLanguage" />
      <xs:element minOccurs="0" maxOccurs="unbounded" name="con_statement" type="ballotLanguage" />
      <xs:element minOccurs="0" name="passage_threshold" type="xs:string" />
      <xs:element minOccurs="0" name="effect_of_abstain" type="xs:string" />
      <xs:element name="ballot_response_id">
        <xs:complexType>
          <xs:simpleContent>
            <xs:extension base="xs:string">
              <xs:attribute name="sort_order" type="xs:integer" />
            </xs:extension>
          </xs:simpleContent>
        </xs:complexType>
      </xs:element>
    </xs:choice>
    <xs:attribute name="id" type="xs:string" use="required" />
  </xs:complexType>
</xs:element>

<xs:complexType name="ballotLanguage">
  <xs:all>
    <xs:element name="text" type="xs:string" />
    <xs:element name="lang" type="xs:language" />
  </xa:all>
</xs:complexType>

cjerdonek · 2015-03-13T17:35:04Z

@pstenbjorn Thanks. I have two main comments on your proposal.

First, my preference is that the type definition itself include the sequence aspect. This makes the schema simpler and more DRY (i.e. by not having to include the maxOccurs="unbounded" in every usage of the type, but rather just once in the type definition).

Second, I also think it's important that the translations of a particular element be semantically grouped to reflect the structure, as opposed to having everything flattened.

So this--

<subtitle type="multiLang">
    # Translations
</subtitle>
<brief type="multiLang">
    # Translations
</brief>
<text type="multiLang">
    # Translations
</text>

as opposed to this--

<subtitle type="ballotLanguage"></subtitle>
<subtitle type="ballotLanguage"></subtitle>
<subtitle type="ballotLanguage"></subtitle>
<brief type="ballotLanguage"></brief>
<brief type="ballotLanguage"></brief>
<brief type="ballotLanguage"></brief>
<text type="ballotLanguage"></text>
<text type="ballotLanguage"></text>
<text type="ballotLanguage"></text>

I think this is conceptually clearer. The grouped approach also has the advantage that if you wanted to allow multiple text elements with the same tag, then you could still do this. For example, for--

<xs:element maxOccurs="unbounded" name="alias" type="multiLang" />

you could do--

# The first alias has an English form and translations.
<alias type="multiLang"></alias>
# The second alias has an English form and translations.
<alias type="multiLang"></alias>

With the "flattened" approach, the maxOccurs="unbounded" has already been "used up" for the translations, so you wouldn't be able to simply add the maxOccurs attribute as you normally would to allow more elements.

Both of these issues I addressed in my proposal. Otherwise, I'm okay with the suggestion to use xs:language.

cjerdonek · 2015-03-13T18:07:52Z

Rather than updating my suggestion in the discussion thread, I created a pull request #49 (which also incorporates @pstenbjorn's suggestion to use xs:language).

This commit adds two new types to the schema: langString and interText.

pkoms · 2015-03-17T17:26:47Z

In case it's dropped out of the project consciousness, I'd like to add a reminder here that we should hold on merging in any XML changes (e.g., #52) until we've considered the implications for CSV design and CSV -> XML transformation. @nomadaisy currently has point on that.

jungshadow · 2015-03-17T18:23:16Z

@pkoms Hmm...I didn't think we planned to hold on all changes. If you think there may be issues with certain changes, you should definitely flag those (NB: this is probably one), but the XML dev should constantly be moving forward. If you are going to request a block, please give a general idea of how long you'll need for an assessment.

pkoms · 2015-03-17T18:26:08Z

Sorry, poor word choice. I meant to say "merging in any XML changes [on multi-language support.]" Completely agreed that the XML should be constantly moving forward.

We should be considering our CSV options this week.

jungshadow · 2015-03-17T18:36:50Z

@pkoms Ah...lost in translation :) Great! Thanks!

cjerdonek · 2015-03-17T19:09:17Z

@pkoms @nomadaisy Could one of you describe what's needed on the CSV side for this, as well as any requirements that have to met? I may be able to propose something or at least offer some suggestions.

pkoms · 2015-03-17T20:49:53Z

@cjerdonek before I go into any depth, can you let me know how familiar you are with the VIP CSV specification? Knowing that will make it easier to pitch my explanation appropriately.

cjerdonek · 2015-03-17T21:01:14Z

Thanks, @pkoms.

can you let me know how familiar you are with the VIP CSV specification?

Not at all, honestly. But if you point me to documentation and/or a good sample file or two (if those already exist somewhere), I can bring myself up to speed on those aspects. That way you don't need to explain everything yourself from the beginning. If the main VIP docs have all I need to know, I can just read that.

cjerdonek · 2015-03-17T21:07:51Z

Okay, it looks like the VIP docs have a quite a bit on this. A couple CSV approaches occur to me. I can describe them briefly and see what you think.

cjerdonek · 2015-03-17T21:57:23Z

Here are a couple CSV approaches.

Both approaches use "resource files," which in the context of internationalization means that the translations are provided in one or more separate files (CSV files in this case).

In both approaches, for item types that contain internationalized strings, the comma-delimited flat file for an item would contain the internationalized_text_id as the value for the item, and not the actual text. For example, for the following--

<office id=1>
    <name internationalized_text_id="office_mayor">
        <text lang="en">Mayor</text>
        <text lang="es">Alcalde</text>
        <text lang="zh">市長</text>
    </name>
</office>

the CSV would look like--

name,id
office_mayor,1

The text translations would then be in separate resource files (which would be global for the entire XML feed). Two possible approaches for this are as follows.

Approach 1: Multiple resource files.

In this approach, there would be a separate file for each language (with the file suffix indicating the language). For example--

File language_en.csv:

internationalized_text_id,text
office_mayor,Mayor
...

File language_es.csv:

internationalized_text_id,text
office_mayor,Alcalde
...

Approach 2: Single resource file.

In this approach, a single file contains all the languages, with the column headers indicating the language for that column. For example--

File: translations.csv

internationalized_text_id,en,es,zh
office_mayor,Mayor,Alcalde,市長

Do either of those approaches sound okay?

cjerdonek · 2015-03-17T22:05:56Z

One advantage of both approaches above is that they are DRY (and so less verbose). The translations of a given string of text occur only once in the entire feed (i.e. in the resource file), as opposed to in every occurrence in the feed.

Also, one advantage to Approach 1 above is that support for additional languages can be provided simply by adding a new CSV file for that language, without having to touch any other part of the feed. Similarly, jurisdictions can add support for a language simply by sending their "English CSV" off to a translation service, and getting back another CSV for the new language.

pkoms · 2015-03-18T17:28:27Z

Thanks! We had something close to 2 on the table, but 1 is interesting as well. Tagging @nomadaisy just to make sure these get on her radar.

nomadaisy · 2015-03-19T21:53:15Z

Hi all, I've asked our Dev team, and the easiest thing for them to incorporate would be the single-resource file, like Chris's example 2, with additional columns for the additional languages for that field. I've identified the following fields as display text that should be translated:

*ballot_response.txt: text
*candidate.txt: party, biography
*contest.txt: primary party, electorate_specifications, office
*custom_ballot.txt: heading
*early_vote_site.txt: directions, voter_services, days_times_open
*election.txt: registration_info, absentee_ballot_info
*election_official.txt: title
*polling_location.txt: directions
*referendum.txt: title, subtitle, brief, text, pro_statement, con_statement, passage_threshold, effect_of_abstain
*source.txt: description

I'm interested in @kennethmbennett's take on how many fields are stored in the database in other languages. Should we enable translation for display text fields only or include all text fields like precinct.txt: name? Does your system store non-ballot, non-display text fields in other languages?

I also did not include location names, such as polling_location.txt: name and early_vote_site.txt: name. Is it helpful to translate those, or should we leave them as they appear, since they refer to a proper name that might not need to be translated?

cjerdonek · 2015-03-19T22:57:40Z

@nomadaisy One brief comment re: proper names. Candidate names are one example of a type of proper name provided in other languages, at least in San Francisco. San Francisco provides them in Chinese. I imagine that in cases where the language uses different characters, a translation of a proper name would be possible (though I don't know these languages firsthand).

Second, on a slightly different topic, it might be worth talking about how to choose or generate the internationalized_text_id for a string.

A convention of something like object_type__field_name__id might be a good starting candidate. So for the following, we would have something like--

party.name: party_name_democrat, party_name_republican, etc.
party.description: party_description_democrat, party_description_republican, etc.

If the ID portion could be generated programmatically, that would be even better (e.g. the ID of the parent object).

jungshadow · 2015-03-20T14:30:12Z

@nomadaisy Agree that a single resource file would be easier for the devs on our side, but I'm wondering how difficult it would be for the states to structure the information in that way (i.e. I have a feeling the data isn't linear or in the same system). Adding @Josh-LACRRCC to this conversation to assess since he's the database/ballot/XML expert.

Josh-LACRRCC · 2015-03-20T17:37:03Z

Here in LA County, we use a contracted vendor for all our translated election materials. We would like to bring the support of various languages into our EMS, but we have not yet begun building specifications for our requirements. We recently produced a bilingual text ballot as a manual process.

A single resource file does cut down on clutter, but operationally; language specific resource files provide a smoother workflow. Also, as @cjerdonek points out, the separate resource files have a better scalability. LA County currently supports a total of eleven languages with the possibility of adding another three.

Under our current system, we do not translate proper names. Candidate names, and location name (both in proper elements like polling_location.txt: name or a city name within referendum.txt: text) are left in English. If an element is missing / optional in the resource file and a system knows to "default" to English, then both LA County and San Francisco's policy about candidate names would be covered.

There are a couple items that I would add to @nomadaisy list of translation elements
*election.txt: name, election_type
*party.txt: name
Some of the enumerations would also be relevant.

cjerdonek · 2015-03-20T18:31:32Z

FWIW, I learned a little about SF's process this week. SF also uses a translation vendor. They have some Excel spreadsheets that are roughly of the form: one word or phrase per row, with different languages in each column. Those spreadsheets currently have three and a half languages (English, Chinese, and Spanish, and Filipino is still being worked on). You can see one of the spreadsheets exported to CSV form here. In that same repo, I'm going to be playing around with these files to experiment with different formats (e.g. generating YAML, which seems to be more suitable for editing multi-line strings by hand; JSON; HTML for display, etc).

Also FWIW, on the county side, a script to convert a single-file resource file to multiple files (i.e. one per language) or vice versa should be pretty simple as conversions go. If the files are structured to begin with, I would guess thirty lines of code or so.

This addresses issue #39 to add support for multiple languages (aka internationalization) and recreates pull request #52 as a feature branch in the VIP repo.

cjerdonek · 2015-03-22T22:04:22Z

If you're looking for data to play around with, here are some of the translations I mentioned in the previous comment cleaned up and converted to YAML files, one per language: https://github.com/cjerdonek/sf-base-election-data/tree/master/pre_data/i18n/auto

jungshadow · 2015-04-02T03:32:33Z

@cjerdonek Safe to close this issue for now? When we get into the implementation, we can reopen or create a new issue for any CSV conflicts.

cjerdonek · 2015-04-02T06:57:22Z

@cjerdonek Safe to close this issue for now?

Yes, thank you, @jungshadow!

jungshadow added the enhancement label Mar 10, 2015

jungshadow added this to the Version 5.0 milestone Mar 10, 2015

cjerdonek mentioned this issue Mar 10, 2015

Add Office object #42

Closed

jungshadow modified the milestones: Up for Debate, Version 5.0 Mar 12, 2015

cjerdonek mentioned this issue Mar 13, 2015

Add a type to support text in multiple languages #49

Closed

cjerdonek added a commit to cjerdonek/vip-specification that referenced this issue Mar 17, 2015

Add support for internationalization (issue votinginfoproject#39).

673ae49

This commit adds two new types to the schema: langString and interText.

jungshadow mentioned this issue Mar 17, 2015

Better handling of apartments / multi-home lots #22

Closed

jungshadow mentioned this issue Mar 20, 2015

Add support for internationalization #52

Merged

jungshadow modified the milestones: Version 5.0, Up for Debate Apr 2, 2015

jungshadow assigned cjerdonek Apr 2, 2015

cjerdonek closed this as completed Apr 2, 2015

cjerdonek mentioned this issue Apr 8, 2015

Change xs:string to internationalizedText as appropriate #77

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support providing text field data in multiple languages #39

Support providing text field data in multiple languages #39

cjerdonek commented Mar 7, 2015

jungshadow commented Mar 10, 2015

cjerdonek commented Mar 12, 2015

cjerdonek commented Mar 12, 2015

pstenbjorn commented Mar 13, 2015

cjerdonek commented Mar 13, 2015

cjerdonek commented Mar 13, 2015

pkoms commented Mar 17, 2015

jungshadow commented Mar 17, 2015

pkoms commented Mar 17, 2015

jungshadow commented Mar 17, 2015

cjerdonek commented Mar 17, 2015

pkoms commented Mar 17, 2015

cjerdonek commented Mar 17, 2015

cjerdonek commented Mar 17, 2015

cjerdonek commented Mar 17, 2015

cjerdonek commented Mar 17, 2015

pkoms commented Mar 18, 2015

nomadaisy commented Mar 19, 2015

cjerdonek commented Mar 19, 2015

jungshadow commented Mar 20, 2015

Josh-LACRRCC commented Mar 20, 2015

cjerdonek commented Mar 20, 2015

cjerdonek commented Mar 22, 2015

jungshadow commented Apr 2, 2015

cjerdonek commented Apr 2, 2015

Support providing text field data in multiple languages #39

Support providing text field data in multiple languages #39

Comments

cjerdonek commented Mar 7, 2015

jungshadow commented Mar 10, 2015

cjerdonek commented Mar 12, 2015

cjerdonek commented Mar 12, 2015

pstenbjorn commented Mar 13, 2015

cjerdonek commented Mar 13, 2015

cjerdonek commented Mar 13, 2015

pkoms commented Mar 17, 2015

jungshadow commented Mar 17, 2015

pkoms commented Mar 17, 2015

jungshadow commented Mar 17, 2015

cjerdonek commented Mar 17, 2015

pkoms commented Mar 17, 2015

cjerdonek commented Mar 17, 2015

cjerdonek commented Mar 17, 2015

cjerdonek commented Mar 17, 2015

cjerdonek commented Mar 17, 2015

pkoms commented Mar 18, 2015

nomadaisy commented Mar 19, 2015

cjerdonek commented Mar 19, 2015

jungshadow commented Mar 20, 2015

Josh-LACRRCC commented Mar 20, 2015

cjerdonek commented Mar 20, 2015

cjerdonek commented Mar 22, 2015

jungshadow commented Apr 2, 2015

cjerdonek commented Apr 2, 2015