Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Abandon in-file translations, document full-file translation guidance #229

Closed
timgdavies opened this issue Aug 25, 2015 · 29 comments · Fixed by #1665
Closed

Abandon in-file translations, document full-file translation guidance #229

timgdavies opened this issue Aug 25, 2015 · 29 comments · Fixed by #1665
Labels
Schema Relating to other changes in the JSON Schema (renamed fields, schema properties, etc.)
Milestone

Comments

@timgdavies
Copy link
Contributor

timgdavies commented Aug 25, 2015

The current schema uses an approach to indicate language based on suffixes to property names.

{
    "language": "en",
    "tender": {
        "item": {
                "description":"Software consultancy services",
                "description_es":"Servicios de consultoria en software",
                "description_fr":"Services de conseil en logiciels"
        }
    }
}

This was discussed in #21 during the beta, and was chosen on the understanding that we wanted to avoid depth where we could (for easier flat renderings of the the data). However, on reflection, the approach we have adopted may not have been the most appropriate.

In particular, @elf-pavlik has pointed out here in #40 that JSON-LD uses language maps, and it would be much easier to create a JSON LD rendering of our data this way.

{
    "@context": { "@language": "en" },
    "tender": {
        "item": {
              "description": {
                "en" :"Software consultancy services",
                "es": "Servicios de consultoria en software",
                "fr": "Services de conseil en logiciels"
        }
    }
}

Our flattening approach would also just render this to 'description/en', 'description/es' and so-on, which seems fairly intuitive, and our earlier fears about over-use of objects does not appear a major one.

However, to change to language maps would potentially be a backwards incompatible update.

It might make sense to look at doing it early: but would definitely need wide discussion.

@akuckartz
Copy link

👍

@Bjwebb
Copy link
Contributor

Bjwebb commented Sep 14, 2015

In particular, @elf-pavlik has pointed out here in #40 that JSON-LD uses language maps, and it would be much easier to create a JSON LD rendering of our data this way.

Looks to me like JSON-LD can also handle the current OCDS structure - #40 (comment)

However, to change to language maps would potentially be a backwards incompatible update.

This would almost certainly break backwards compatibility, at least for data users. Is there a way of changing this that wouldn't?

@timgdavies
Copy link
Contributor Author

We can't currently see a way of changing language which does not involve a backwards-incompatible change, so this would have to go on the stack for a 2.0 update right now.

@timgdavies timgdavies added this to the Version 2.0 milestone Jul 25, 2016
@timgdavies
Copy link
Contributor Author

After further team discussions, we agreed that this would not likely be a desirable change, even for 2.0, so closing.

@akuckartz
Copy link

we agreed that this would not likely be a desirable change, even for 2.0, so closing

Can the reasons be summarized ?

@timgdavies
Copy link
Contributor Author

I've unfortunately lost my full notes of the discussion, so @Bjwebb and others may be able to add, but from memory:

  • The backwards incompatibility is a big concern to many of our publishers;
  • We want to avoid fields that can have two types (string or object), so we would need to switch to a model in which there is no 'default language' (i.e. all strings are always a language map object) which makes things more difficult for data users;
  • We can still map to JSON LD using our current approach;

@kindly kindly reopened this Nov 10, 2016
@kindly
Copy link
Contributor

kindly commented Nov 10, 2016

Reopening this issue as nobody seems too happy with the current approach.

My suggestion would be.

On a core schema have a translation definition, keeping the properties field blank for now. It would look like this.

"definitions": {
   "Translation": {
         "type": "object",
         "properties": {}
    }
}

For any field that required different language versions of it you would add a ref to that object i.e

{
"description_translation": {"#ref": "#/definitions/Translation"},
"title_translation": {"#ref": "#/definitions/Translation"}
}

You would need this for every field that needed translating and would replace the pattern properties used currently.
Extensions would do the same thing, if they needed a field translated they would just add their field like the above.

There would need to be a repo with extensions for every language. These extensions would place a lang code into the Translation definition.

They would look like (say for french) the release-schema.json.

"definitions": {
   "Translation": {
         "properties": {
             "fr": {
                  "type": "string"
               }
        }
    }
}

Doing it this way means that if you apply the language extensions it will work for all other extensions (i.e they are composable and do not require special ordering of patches applying the schema) as long as they define translatable fields the same way as above.

Downsides for this approach are:

  • Not entirely backward compatible, but better then replacing all text fields with an object. It would be very easy to convert to the new format using an upgrade tool though.
  • Do not like the labelling "translations", any suggestions as to what else to call them?
  • Cases like Canada where 2 languages are considered equal still not very well supported. The main fields (the ones being translated) are still to be kept mandatory i.e description will be mandatory.
  • It is also not easy to make the translated fields required as required fields are defined as a list.

@timgdavies
Copy link
Contributor Author

I think downside 3 (multiple equal languages) will be a particular blocker to this issue.

This does make me wonder whether we should take an entirely different track - and say that OCDS does not support in-file translations, and that translations should be provided in a separate file for each language - served up via content negotiation or alternative URI structure....

@mpostelnicu
Copy link

mpostelnicu commented Nov 13, 2016

Hey,

Just an idea of handling this using oneOf draft 04 schema keyword http://json-schema.org/example2.html.

It is fully backwards compatible with our (old) non translated schema. I've made an example of the Item element below:

        "Item": {
            "type": "object",
            "description": "A good, service, or work to be contracted.",
            "required": ["id"],
            "properties": {
                "id": {
                    "description": "A local identifier to reference and merge the items by. Must be unique within a given array of items.",
                    "type": ["string", "integer"],
                    "mergeStrategy": "overwrite"
                }
                ,
                "description": {
                    "description": "A description of the goods, services to be provided.",
                    "mergeStrategy": "ocdsVersion",
                    "type": ["array","string"],
                    "oneOf" : [
                        {                        
                        "type":"string"
                        },
                        {
                        "description": "An array of translated strings, for multilingual support",
                        "type": "array",
                        "minItems":1,
                        "items": {
                               "$ref" : "#/definitions/TranslatedString" }                                                        
                        }                        
                    ]
                    }               
            }
        }

......

"definitions": {
            "TranslatedString": {
            "type": "object",
            "properties": {
                "languageCode": {
                    "description": "ISO 639 language code",
                    "type": "string",
                    "minLength": 2,
                    "maxLength": 2
                },
                "value": {
                    "description": "The value of the string in the specified language",
                    "type": "string"
                }
            }
        }
}

Using this schema change, you could validate both this:

{
"id":"1",
"date":"2016-09-22T06:38:12Z",
"tag": [
"planning"
],
"initiationType": "tender",
"ocid": "ocds-11",
"tender": {
"id": "ocds-11",
"status": "active",
"items": [
{
"id": "213966",
"description": [
    { 
        "languageCode" : "ro",
        "value" : "Descriere"
    } , 

    { "languageCode": "en",
    "value" : "Some Description"
    }

    ],
"classification": {
"id": "5",
"description": "Five"
}
}
]
}
}

and this (the old format)

{
"id":"1",
"date":"2016-09-22T06:38:12Z",
"tag": [
"planning"
],
"initiationType": "tender",
"ocid": "ocds-11",
"tender": {
"id": "ocds-11",
"status": "active",
"items": [
{
"id": "213966",
"description": "Description",
"classification": {
"id": "5",
"description": "Five"
}
}
]
}
}

I tested all the json snippets above and they are valid (against both old and this new schema).
Hope it is helpful

@timgdavies
Copy link
Contributor Author

Thanks @mpostelnicu

I can see this works well from a schema perspective - but my fear is that it is difficult for users to deal with - if they have to anticipate either a string or an object.

I guess this can be handled by:

  • Providing technically skilled users with guidance on the need to wrap any requests for string values in some sort of helper function which will handle language maps;
  • Providing more basic users with some conversion tooling which will simplify a file into a single-language version, or otherwise make it easy to work with language maps

However, your proposal is particularly interesting with respect to backwards compatibility: data valid against 1.0 would still be valid against 1.1 if it included language maps in this way.

The pattern properties (title_es etc) could be deprecated, but not removed until 2.0.

@kindly
Copy link
Contributor

kindly commented Nov 22, 2016

Thank you @mpostelnicu

I think using the oneOf is the only way to get wherer we want, even though it would be a pain for some data consumers as @timgdavies said. We also already use it in the record package schema.

However, I would not have the language as a list as it causes issues in making a flattened representation of it and limits validation options (i.e it is hard to validate if people have put the same language code twice)

I would suggest doing it like:

"definitions": {
   "multilingualString": {
      "oneOf": [{"type" : "string"},
                     {"type": "object", properties: {"en": {type: "string"}... for all lang codes or use patternProperties}
    }
}

So for any translatable fields would look like

{
"description": {"#ref": "#/definitions/multilingualString"},
"title": {"#ref": "#/definitions/multilingualString"}
}

Use cases like Canada patch the schema (in one place) to make en and fr required and also limit the properties to just them (so any lang code outside would look like an extra field).

@timgdavies timgdavies modified the milestones: Version 1.1, Version 2.0 Jan 27, 2017
@timgdavies
Copy link
Contributor Author

Discussed with @kindly

We will work this up and put this forward to peer reviewers for 1.1, for a view on accepting or pushing to 2.0 (as it creates some backwards compatibility issues)

@kindly
Copy link
Contributor

kindly commented Feb 9, 2017

A first attempt at a patch for this can be found here.

https://github.com/open-contracting/ocds_upgrade_1_1_patches/blob/master/229_language_map/release-schema.json

At the moment this uses patternProperties for what can be in the Language Map but we could rely on extension to add particular languages to the map instead or as well. Having the pattern properties lowers initial barrier to using the language map but is less explicit.

More importantly, this patch breaks backwards compatibility and for that reason needs a lot of consideration for inclusion into a non major release.

@timgdavies timgdavies removed this from the Version 1.1 milestone Jun 15, 2017
@timgdavies timgdavies added this to the Version 2.0 milestone Jun 15, 2017
@timgdavies
Copy link
Contributor Author

Moving to 2.0.

This was postponed beyond 1.1. Whilst this change is considered a good move long-run, it involves a backwards incompatible change, and would require substantial refactoring of documentation generation tools and other resources.

@duncandewhurst
Copy link
Contributor

Jordan expressed a preference for translating entire releases, rather than in-file translations of individual fields.

@jpmckinney jpmckinney changed the title Use language maps (2.0) or abandon in-file translations (1.2) Discussion: Use language maps (2.0) or abandon in-file translations (1.2) Jul 17, 2020
@jpmckinney jpmckinney changed the title Discussion: Use language maps (2.0) or abandon in-file translations (1.2) Discussion: Abandon in-file translations (1.2) or use language maps (2.0) Jul 17, 2020
@jpmckinney
Copy link
Member

jpmckinney commented Jun 7, 2023

I checked using the aggregated data at https://ocdsdata.fra1.digitaloceanspaces.com/metadata/stats.json. It's missing changes to spiders since a year ago, but in short the main users of _xx suffixes are (I don't match the full pattern, because that pattern matches almost any non-translation field):

  • armenia (lapsed)
  • canada_buyandsell (pilot)

Then there are a few fields for:

  • honduras_portal_api_records 2023-06-06 classifications_ga
  • honduras_portal_api_records 2023-06-06 classifications_ue
  • honduras_portal_api_releases 2023-05-30 classifications_ga
  • honduras_portal_api_releases 2023-05-30 classifications_ue
  • honduras_portal_bulk 2023-06-04 classifications_ga
  • honduras_portal_bulk 2023-06-04 classifications_ue
  • kyrgyzstan 2023-01-06 identifier_legalName_kg

So, I think it's okay to abandon in-file translations as a standardized method. We can recommend full-file translations for publishers like Canada, and for the few fields above, there can be local extensions.

What do you think, @yolile ?

@jpmckinney jpmckinney changed the title Discussion: Abandon in-file translations (1.2) or use language maps (2.0) Abandon in-file translations Jun 7, 2023
@jpmckinney
Copy link
Member

Noting that we can add some guidance relating to this (#1064).

@yolile
Copy link
Member

yolile commented Jun 7, 2023

The Honduras ones are more like a typo rather than the actual use of the in-file translations. So I think it is fine to abandon them.

@duncandewhurst
Copy link
Contributor

Abandoning in-file translations sounds good to me. OC4IDS 0.9.4 adds some guidance that might be relevant: https://standard.open-contracting.org/staging/infrastructure/0.9-dev/en/guidance/language/#publishing-data-in-your-own-language

Clarifying what is required to close this issue:

  1. Identify and remove mechanisms for in-file translations in the schema
  2. Identify and remove normative and non-normative documentation about in-file translations
  3. Add non-normative documentation about publishing full-file translations.

Sound good?

@jpmckinney
Copy link
Member

Sounds good! (1) is mostly removing the patternProperties fields.

@duncandewhurst duncandewhurst changed the title Abandon in-file translations Abandon in-file translations, document full-file translation guidance Jun 19, 2023
@duncandewhurst
Copy link
Contributor

@jpmckinney for (3), would reframing the OC4IDS guidance on publishing data in your own language as a worked example on translations be sufficient?

@jpmckinney
Copy link
Member

Yes, that would be a good start. I think for the lookup we can just have a list of translatable fields (as it is otherwise very long).

@duncandewhurst
Copy link
Contributor

Sounds good! (1) is mostly removing the patternProperties fields.

I'm assuming that we actually want to deprecate the patternProperties fields. Let me know if not.

@jpmckinney
Copy link
Member

lib-cove doesn't check patternProperties in _get_schema_deprecated_paths. I think better to just remove, as it will solve a bug related to patternProperties (open-contracting/lib-cove-ocds#73). Users will just get warnings about additional fields instead of warnings about deprecated fields – it's about equivalent.

@duncandewhurst
Copy link
Contributor

Sounds good! Shall I remove or deprecate the language section of the schema reference page?

@jpmckinney
Copy link
Member

jpmckinney commented Nov 29, 2023

I think just delete, since that content can't be followed if patternProperties is removed. Implementers of 1.1 can navigate to the docs for 1.1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Schema Relating to other changes in the JSON Schema (renamed fields, schema properties, etc.)
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

9 participants