Skip to content
This repository has been archived by the owner on Jun 18, 2024. It is now read-only.

Machine readable data dictionaries inside data.json #332

Closed
exafox opened this issue Jul 16, 2014 · 8 comments
Closed

Machine readable data dictionaries inside data.json #332

exafox opened this issue Jul 16, 2014 · 8 comments

Comments

@exafox
Copy link

exafox commented Jul 16, 2014

As an extension to the current metadata schema, it would be useful to have one standard way to store data dictionary information to enable future collaboration and integrations.

Some work has taken place in this area, but I am not aware of a format that is universally accepted and also relatable to CKAN and the work done to support project open data. I hope/expect these other efforts can be rolled up in this format, and not duplicated or discarded.

Key traits of the format might include:

  • Embeddable inside the data.json data structure. If this is impractical then related from a data.json as linked data. The dataDictionary field as it stands in the current schema may already be sufficient for the latter.
  • Can be persisted by CKAN (perhaps aided by an extension).
  • Not super opinionated about the semantic aspects of the format to allow different audiences to use the same format and tools as much as possible.
@whitten
Copy link

whitten commented Jul 16, 2014

I understood that the JSON-LD format was attempting to meet the need of a standard way to have data values refer to an external source for the definitions of the data. (Which is what I think is your goal for a Data Dictionary link).
Have I misunderstood your question, or does JSON-LD fail in some way that I didn't understand ?

@exafox
Copy link
Author

exafox commented Jul 16, 2014

Agreed that linking is straightforward. I started the issue to suggest a format for describing field specific information within this or a related metadata schema.

@haleyvandyck
Copy link
Contributor

thanks for raising this @exafox. Well add this to the discussion topics for the metadata offsite tomorrow and be sure to report back what was discussed here for those who cant make it in person.

@smrgeoinfo
Copy link
Contributor

@exafox -- is the issue you're interested in related to #291 ?

@gbinal
Copy link
Contributor

gbinal commented Jul 24, 2014

I definitely like the idea of encourage agencies to provide more machine readable data dictionaries. @exafox - do you know of any standards or examples we could point to?

@philipashlock
Copy link
Contributor

To reference machine readable data dictionaries in a tightly coupled way, you'd really want to be able to do it on the distribution level. Let me suggest two new fields for this: describedBy and describedByType and let's use them to help document a resource in a distribution that has a JSON Schema file that serves as a machine readable data dictionary (the data.json file itself would be an example). We'll use describedBy for a URL that points to a JSON Schema file and then we'll use describedByType to specify a media type that makes it clear that we're pointing to a JSON Schema file. Our distribution might then look something like this:

"distribution": [
    {
        "description": "Widgets data as a JSON file", 
        "describedBy": "https://data.agency.gov/datasets/widgets-statistics/widgets-schema.json",
        "describedByType": "application/schema+json",
        "downloadURL": "https://data.agency.gov/datasets/widgets-statistics/widgets.json", 
        "format": "JSON", 
        "mediaType": "application/json", 
        "title": "widgets.json"
    }
]

This is inspired by the widespread use of link relations - including the "describedby" relation. See similar uses with JSON Hyper Schema and the Protocol for Web Description Resources (wdrs:describedby is even used in ADMS which is a profile of DCAT)

The more flexible way of using link relations would be to abstract it one level out and enable lots of link relations, so describedby might just be one kind. Here's what that might look like:

"distribution": [
    {
        "description": "Widgets data as a JSON file", 
        "downloadURL": "https://data.agency.gov/datasets/widgets-statistics/widgets.json", 
        "format": "JSON", 
        "link": [
            {
                "href": "https://data.agency.gov/datasets/widgets-statistics/widgets-schema.json", 
                "rel": "describedby", 
                "type": "application/schema+json"
            }
        ], 
        "mediaType": "application/json", 
        "title": "widgets.json"
    }
]

This is a more common way of doing this kind of link relation, but it adds a bit of extra complexity and might not be worth it.

We could have describedBy serve as a replacement for dataDictionary and use it both at the dataset level and a the distribution level. If describedBy was used on it's own without describedByType it would be assumed to be a human readable html resource just like people use dataDictionary now, but if they wanted to reference a machine readable file, they could make that clear by specifying the type of machine readable file with describedByType. Taking this approach rather than creating another level with an array of link objects seems simpler and much more like how we're already using dataDictionary so it would be a painless transition for those wanting to continue using it the way they already are.

So I'm in favor of doing this with describedBy and describedByType which can be used on both the dataset and distribution level. At the dataset level, describedBy would replace dataDictionary. This would also be consistent with the use of describedBy in #309 (comment) and would also help address #291

rebeccawilliams pushed a commit that referenced this issue Oct 2, 2014
Changes that still need to be addressed are changes in structure and should we add usage notes additions here or no?:

* Adds optional describedByType field at the dataset and distribution level (#291, #332)
* Changes contactPoint field to an object that contains the name (fn) and email address (hasEmail) (#358)
* Adds fn field as part of contactPoint replacing earlier use of contactPoint (#358)
* Changes publisher field to an object that allows multiple levels of organizations (#296)
* Changes accessURL field to represent indirect access and to exist only within distribution (#217, #335) 
* Changes format field to a human readable description and to exist only within distribution (#272, #293)
* Adds optional description field for use within distribution (#248)
* Adds optional title field for use within distribution (#248)
* Changes accrualPeriodicity field to use ISO 8601 date syntax (#292)
* Changes distribution field to become required-if-applicable and to always contain the accessURL or downloadURL fields (#217)
* Changes license field to be a URL (#196)
@philipashlock
Copy link
Contributor

First pass at including describedBy and describedByType was done with 6945364

@philipashlock
Copy link
Contributor

There are several ways we've addressed this. We're now able to reference definitions for both the metadata itself and the data from within the metadata.

To reference definitions of the metadata, we can now use @context and @type (#388) as well as describedBy and conformsTo (#309) on the catalog object

To reference definitions for the data, we can now use describedBy/describedByType (#332) and conformsTo (#362) on both the dataset and distribution objects.

If you have additional feedback on using these fields, please add it to the relevant issue or open a new one as needed.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

7 participants