-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support providing text field data in multiple languages #39
Comments
I'm adding this to the "Version 5.0" bucket, but there may be some push back. Personally, I want this to happen and soon. It will ultimately come down to how easy it is to incorporate translations in every VR/EMS system (assuming the translation data is even held in either). I imagine that @pstenbjorn might have some insight on this. |
One of my action items from the meeting was to propose something for this issue. Before coding it, I wanted to describe what I was going for. My goal was to make adding multi-language support to an element in the schema as simple and DRY as possible, for example by changing this-- <xs:element name="greeting" type="xs:string"/> to this-- <xs:element name="greeting" type="multiLangText"/> A concrete example would look like this-- <greeting>
<text lang="en">Hello</text>
<text lang="es">Hola</text>
<text lang="fr">Bonjour</text>
</greeting> Ideally, the following would also be acceptable (if only English were available and for backwards compatibility, etc)-- <greeting>Hello</greeting> (It looks like Does this seem good to people? |
And here is a rough stab at a schema definition for the idea described in the previous comment (I make no claims to have expertise in XML): <!-- A string with a required language specifier. -->
<xs:complexType name="langString">
<xs:simpleContent>
<xs:extension base="xs:string">
<!-- TODO: should the language values be restricted to certain values? -->
<xs:attribute name="lang" type="xs:string" use="required"/>
</xs:extension>
</xs:simpleContent>
</xs:complexType>
<!-- A text value in one or more languages. -->
<xs:complexType name="multiLangText">
<xs:union>
<xs:complexType>
<xs:sequence>
<xs:element name="text" type="langString" minOccurs="1" maxOccurs="unbounded"/>
</xs:sequence>
</xs:complexType>
<!-- A simple string can be provided instead if only English is available. -->
<xs:string/>
</xs:union>
</xs:complexType> |
@cjerdonek this is a good start. The W3C has defined an xsd type of Below is an example based on our conversation using the existing VIP schema. The <xs:element name="referendum">
<xs:complexType>
<xs:choice maxOccurs="unbounded">
<xs:element name="title" type="xs:string" />
<xs:element minOccurs="0" maxOccurs="unbounded" name="subtitle" type="ballotLanguage" />
<xs:element minOccurs="0" maxOccurs="unbounded" name="brief" type="ballotLanguage" />
<xs:element minOccurs="0" maxOccurs="unbounded" name="text" type="ballotLanguage" />
<xs:element minOccurs="0" maxOccurs="unbounded" name="pro_statement" type="ballotLanguage" />
<xs:element minOccurs="0" maxOccurs="unbounded" name="con_statement" type="ballotLanguage" />
<xs:element minOccurs="0" name="passage_threshold" type="xs:string" />
<xs:element minOccurs="0" name="effect_of_abstain" type="xs:string" />
<xs:element name="ballot_response_id">
<xs:complexType>
<xs:simpleContent>
<xs:extension base="xs:string">
<xs:attribute name="sort_order" type="xs:integer" />
</xs:extension>
</xs:simpleContent>
</xs:complexType>
</xs:element>
</xs:choice>
<xs:attribute name="id" type="xs:string" use="required" />
</xs:complexType>
</xs:element>
<xs:complexType name="ballotLanguage">
<xs:all>
<xs:element name="text" type="xs:string" />
<xs:element name="lang" type="xs:language" />
</xa:all>
</xs:complexType> |
@pstenbjorn Thanks. I have two main comments on your proposal. First, my preference is that the type definition itself include the sequence aspect. This makes the schema simpler and more DRY (i.e. by not having to include the Second, I also think it's important that the translations of a particular element be semantically grouped to reflect the structure, as opposed to having everything flattened. So this-- <subtitle type="multiLang">
# Translations
</subtitle>
<brief type="multiLang">
# Translations
</brief>
<text type="multiLang">
# Translations
</text> as opposed to this-- <subtitle type="ballotLanguage"></subtitle>
<subtitle type="ballotLanguage"></subtitle>
<subtitle type="ballotLanguage"></subtitle>
<brief type="ballotLanguage"></brief>
<brief type="ballotLanguage"></brief>
<brief type="ballotLanguage"></brief>
<text type="ballotLanguage"></text>
<text type="ballotLanguage"></text>
<text type="ballotLanguage"></text> I think this is conceptually clearer. The grouped approach also has the advantage that if you wanted to allow multiple text elements with the same tag, then you could still do this. For example, for-- <xs:element maxOccurs="unbounded" name="alias" type="multiLang" /> you could do-- # The first alias has an English form and translations.
<alias type="multiLang"></alias>
# The second alias has an English form and translations.
<alias type="multiLang"></alias> With the "flattened" approach, the Both of these issues I addressed in my proposal. Otherwise, I'm okay with the suggestion to use |
Rather than updating my suggestion in the discussion thread, I created a pull request #49 (which also incorporates @pstenbjorn's suggestion to use |
This commit adds two new types to the schema: langString and interText.
In case it's dropped out of the project consciousness, I'd like to add a reminder here that we should hold on merging in any XML changes (e.g., #52) until we've considered the implications for CSV design and CSV -> XML transformation. @nomadaisy currently has point on that. |
@pkoms Hmm...I didn't think we planned to hold on all changes. If you think there may be issues with certain changes, you should definitely flag those (NB: this is probably one), but the XML dev should constantly be moving forward. If you are going to request a block, please give a general idea of how long you'll need for an assessment. |
Sorry, poor word choice. I meant to say "merging in any XML changes [on multi-language support.]" Completely agreed that the XML should be constantly moving forward. We should be considering our CSV options this week. |
@pkoms Ah...lost in translation :) Great! Thanks! |
@pkoms @nomadaisy Could one of you describe what's needed on the CSV side for this, as well as any requirements that have to met? I may be able to propose something or at least offer some suggestions. |
@cjerdonek before I go into any depth, can you let me know how familiar you are with the VIP CSV specification? Knowing that will make it easier to pitch my explanation appropriately. |
Thanks, @pkoms.
Not at all, honestly. But if you point me to documentation and/or a good sample file or two (if those already exist somewhere), I can bring myself up to speed on those aspects. That way you don't need to explain everything yourself from the beginning. If the main VIP docs have all I need to know, I can just read that. |
Okay, it looks like the VIP docs have a quite a bit on this. A couple CSV approaches occur to me. I can describe them briefly and see what you think. |
Here are a couple CSV approaches. Both approaches use "resource files," which in the context of internationalization means that the translations are provided in one or more separate files (CSV files in this case). In both approaches, for item types that contain internationalized strings, the comma-delimited flat file for an item would contain the <office id=1>
<name internationalized_text_id="office_mayor">
<text lang="en">Mayor</text>
<text lang="es">Alcalde</text>
<text lang="zh">市長</text>
</name>
</office> the CSV would look like--
The text translations would then be in separate resource files (which would be global for the entire XML feed). Two possible approaches for this are as follows. Approach 1: Multiple resource files. In this approach, there would be a separate file for each language (with the file suffix indicating the language). For example-- File
File
Approach 2: Single resource file. In this approach, a single file contains all the languages, with the column headers indicating the language for that column. For example-- File:
Do either of those approaches sound okay? |
One advantage of both approaches above is that they are DRY (and so less verbose). The translations of a given string of text occur only once in the entire feed (i.e. in the resource file), as opposed to in every occurrence in the feed. Also, one advantage to Approach 1 above is that support for additional languages can be provided simply by adding a new CSV file for that language, without having to touch any other part of the feed. Similarly, jurisdictions can add support for a language simply by sending their "English CSV" off to a translation service, and getting back another CSV for the new language. |
Thanks! We had something close to 2 on the table, but 1 is interesting as well. Tagging @nomadaisy just to make sure these get on her radar. |
Hi all, I've asked our Dev team, and the easiest thing for them to incorporate would be the single-resource file, like Chris's example 2, with additional columns for the additional languages for that field. I've identified the following fields as display text that should be translated: *ballot_response.txt: text I'm interested in @kennethmbennett's take on how many fields are stored in the database in other languages. Should we enable translation for display text fields only or include all text fields like precinct.txt: name? Does your system store non-ballot, non-display text fields in other languages? I also did not include location names, such as polling_location.txt: name and early_vote_site.txt: name. Is it helpful to translate those, or should we leave them as they appear, since they refer to a proper name that might not need to be translated? |
@nomadaisy One brief comment re: proper names. Candidate names are one example of a type of proper name provided in other languages, at least in San Francisco. San Francisco provides them in Chinese. I imagine that in cases where the language uses different characters, a translation of a proper name would be possible (though I don't know these languages firsthand). Second, on a slightly different topic, it might be worth talking about how to choose or generate the A convention of something like
If the ID portion could be generated programmatically, that would be even better (e.g. the ID of the parent object). |
@nomadaisy Agree that a single resource file would be easier for the devs on our side, but I'm wondering how difficult it would be for the states to structure the information in that way (i.e. I have a feeling the data isn't linear or in the same system). Adding @Josh-LACRRCC to this conversation to assess since he's the database/ballot/XML expert. |
Here in LA County, we use a contracted vendor for all our translated election materials. We would like to bring the support of various languages into our EMS, but we have not yet begun building specifications for our requirements. We recently produced a bilingual text ballot as a manual process. A single resource file does cut down on clutter, but operationally; language specific resource files provide a smoother workflow. Also, as @cjerdonek points out, the separate resource files have a better scalability. LA County currently supports a total of eleven languages with the possibility of adding another three. Under our current system, we do not translate proper names. Candidate names, and location name (both in proper elements like polling_location.txt: name or a city name within referendum.txt: text) are left in English. If an element is missing / optional in the resource file and a system knows to "default" to English, then both LA County and San Francisco's policy about candidate names would be covered. There are a couple items that I would add to @nomadaisy list of translation elements |
FWIW, I learned a little about SF's process this week. SF also uses a translation vendor. They have some Excel spreadsheets that are roughly of the form: one word or phrase per row, with different languages in each column. Those spreadsheets currently have three and a half languages (English, Chinese, and Spanish, and Filipino is still being worked on). You can see one of the spreadsheets exported to CSV form here. In that same repo, I'm going to be playing around with these files to experiment with different formats (e.g. generating YAML, which seems to be more suitable for editing multi-line strings by hand; JSON; HTML for display, etc). Also FWIW, on the county side, a script to convert a single-file resource file to multiple files (i.e. one per language) or vice versa should be pretty simple as conversions go. If the files are structured to begin with, I would guess thirty lines of code or so. |
If you're looking for data to play around with, here are some of the translations I mentioned in the previous comment cleaned up and converted to YAML files, one per language: https://github.com/cjerdonek/sf-base-election-data/tree/master/pre_data/i18n/auto |
@cjerdonek Safe to close this issue for now? When we get into the implementation, we can reopen or create a new issue for any CSV conflicts. |
Yes, thank you, @jungshadow! |
For all text fields, the spec should have a way to allow providing that information in multiple languages. In San Francisco, for example, it is a requirement that all election information be provided in English, Spanish, and Chinese (as well as in Filipino starting January 1, 2016).
Currently, it seems like VIP consumers wouldn't be able to meet the same language requirements that jurisdictions may have (unless there is a way of providing additional languages that isn't documented in the web site documentation).
The text was updated successfully, but these errors were encountered: