Skip to content
This repository has been archived by the owner on Jun 18, 2024. It is now read-only.

Added W3C Cookbook for Open Government Linked Data #21

Merged
merged 5 commits into from
May 20, 2013

Conversation

jqnatividad
Copy link
Contributor

The W3C Cookbook was drafted by the W3C Government Linked Data Working Group

DCAT - which is aligned with the Common core metadata schema
promulgated by the policy
as drafted by the Government Linked Data Working Group
@benbalter
Copy link
Contributor

@philipashlock @MarinaMartin @kachok... linked data isn't my thing. I'm all for clean, human readable URLs, but thoughts on adding?

@konklone
Copy link
Contributor

I don't wish to start a holy war here, but Linked Data isn't at the level I'd like to see as an officially endorsed recommendation for government agencies. The scope of this project, of this Order, is to get data published in open, machine-readable formats where the barrier to use the data is as low as possible. Keeping recommendations at the transport level (JSON, XML, CSV, not binary, etc.) makes a lot more sense than the schema level (FOAF, RDF, OWL, etc.). It leaves more flexibility in both publishing and consumption, and focuses interoperability efforts on more fundamental concerns (like unique identifiers).

@kachok
Copy link

kachok commented May 12, 2013

I completely agree with Eric. The only place where anything resembling linked data will make sense at this moment is to restrict datasets IDs to be URIs. Will help tremendously in harvesting and metadata reconciliation between diff data.json files.

Dmitry Kachaev
(202) 527-9423
@kachok

On May 12, 2013, at 1:43 PM, Eric Mill [email protected] wrote:

I don't wish to start a holy war here, but Linked Data isn't at the level I'd like to see as an officially endorsed recommendation for government agencies. The scope of this project, of this Order, is to get data published in open, machine-readable formats where the barrier to use the data is as low as possible. Keeping recommendations at the transport level (JSON, XML, CSV, not binary, etc.) makes a lot more sense than the schema level (FOAF, RDF, OWL, etc.). It leaves more flexibility in both publishing and consumption, and focuses interoperability efforts on more fundamental concerns (like unique identifiers).


Reply to this email directly or view it on GitHub.

@sbma44
Copy link

sbma44 commented May 12, 2013

Agreed with Dmitry and Eric. This project shouldn't push any standards that haven't won broad acceptance among the types of engineers in the target audience. Linked Data clearly doesn't yet qualify. There's a real danger of inhibiting use with overly baroque standards -- the lackluster use of the SEC's XBRL disclosures outside of the enterprise software world should be a sobering example.

URIs enforce a lot of good and necessary identifier habits, though, so I'm on board for that as well.

@georgethomas
Copy link

Guys, agencies are already doing LD, with more cutting through the unintentional FUD created by developers (that typically have never done any LD work, or otherwise have a limited understanding of what it's all about) and bringing more LD based on voluntary consensus open standard machine readable (and interpretable!) formats every day :-)

The structure of resource representations / messages / document objects, regardless of whether they're serialized as XML, JSON, CSV, etc are about schema, not transport. The lack of machine readable metadata and data structure is a huge complaint from open gov data consumers that LD fixes without breaking a sweat. Take a look at the core metadata page - notice anything different about the metadata tags listed there or the syntax supported? JSON keys are 'things, not strings' (using the soundbite/tagline Google introduced to describe their Knowledge Graph), but RDFa tags (and JSON-LD keys!) are globally disambiguated using HTTP URIs that your app client can also dereference, so it's not only instances of object classes (like /id/datasets or /id/agency or any real world thing or abstract concept) that URIs are useful for, but also the tags being used (like /def/{schema-name}/{ClassName-or-propertyName}. Both metadata tags/terms and instance data benefit tremendously by having a GUID in a globally shared namespace, which means there's nothing you need to diff.

Saying LD isn't useful is like saying schema.org isn't useful.

The cool thing here is that LD folks can fork the repo can add other LD friendly RDF serializations (besides RDFa that's already supported!), such as Turtle or JSON-LD to the catalog generator :-)

@sbma44
Copy link

sbma44 commented May 12, 2013

I certainly understand that some agencies are pushing LD, George. But my impression is that this has happened largely through interventions by advocates and academics, not in response to outside demand or systems' functional requirements.

At Sunlight we consume and publish a tremendous amount of government data; to the best of my knowledge, LD technologies have yet to prove themselves useful to any of these applications (I believe we might have used a DBpedia SPARQL endpoint for matching entity bios in our IE database, but the results were incredibly riddled with error due to limitations in the data). The experience of GovTrack is also instructive, I think. Josh can speak for himself, but my understanding is that he invested heavily in RDFa, but ultimately found it to be so poorly adopted as to be less useful than simpler alternatives.

Obviously there are systems that employ LD technologies to useful ends. But they tend to be either massive black boxes (e.g. Calais) or tightly limited subsets of functionality (e.g. schema.org -- and schema.org is useful not primarily because of its inherent technical attributes (there are plenty of other systems that solve or solved this problem, from microformats to Common Tag) but because it's being pushed by the most important search engines in the English-speaking world). There are comparatively few small startups or shops that employ LD heavily.

Ultimately, working developers will adopt the parts of these technologies that are useful and discard the rest. This happens again and again (I still remember thinking, at the dawn of XML, that I would have to learn XSLT to be a competent web programmer).

But this process needs to happen organically, not through anointment by government. Agencies will do well to experiment, but the XBRL example needs to be taken seriously: I've spoken to the people at the SEC who pushed for it, got it through, and are now scratching their heads over why it isn't being used by watchdog groups and reporters -- all while watching the internal political support for their project dry up. They're desperate for use, but they aren't getting it. XBRL is complex partly by necessity (US GAAP is hugely complicated), but this dynamic is real. IMO, open data is too important to be repurposed as a vehicle for promoting a technical agenda that alienates many developers.

@waldoj
Copy link

waldoj commented May 12, 2013

RDF reminds me of SOAP—a great solution to an important problem, but that's so complex that one cannot ease into it, and instead must dive into it wholly. That obstacle to adoption doomed SOAP's popular adoption, resulting in the brief rise of its parent, XML-RPC, which in turn lost out to the simpler-still REST. There's nothing wrong with SOAP, as there's nothing wrong with RDF—it's just too complex for me and, I think, a too complex for a lot of other developers.

@JoshData
Copy link
Contributor

Since Tom brought up what I had done with GovTrack I'll expand on that a little because I think it's a helpful story-

From about 2006-2010 I generated linked data RDF for all bills in Congress, Members of Congress etc etc. I had a RDF/XML dump, < link > tags in the corresponding HTML pages to semweb URIs, a SPARQL endpoint, and a little RDFa (though not much and perhaps not for very long). The reason I dropped it wasn't exactly that it wasn't worth it for me, but that absolutely no one was using it. Regular ol' XML data dumps was what people wanted. (I still get requests for my Census and SEC data in RDF that I did around the same time, but never the GovTrack data, and the GovTrack data was a far more compelling example of RDF than the Census and SEC data.) There were at least a few reasons why no one used it, including complexity along the whole tool chain as Waldo mentions, but also there was no one else generating other data it links to. So "linked to what?" comes to mind. That last bit is something that will change over time (there can't be any fewer people generating linked data around Congress!).

So I think Tom is right to say that, at least in certain domains, there is zero demand for Linked Data and in those cases it would be a major failure of “engage with customers” (as the memorandum says) to jump ahead to LD. If this github account is for settling on existing standards of practice, Linked Data is certainly not one (except perhaps in a few particular domains).

That doesn't mean agencies shouldn't experiment with it. There are useful pieces too, such as just deploying URIs.

@georgethomas
Copy link

typo correction, I said:

"JSON keys are 'things, not strings'"

but what I meant to say was JSON keys are strings, not things. That is, they have NO global network identity - that's the point.

Instead, what we really want, are entities and relations that do have global network identity via HTTP URI's, so that's an important correction.

Sorry, I was in a hurry, gotta visit Mom on Mothers Day :)

@georgethomas
Copy link

@sbma44

Tom, yes, agencies that are 'doing' LD (as opposed to 'pushing', which maybe sounds a bit illicit) and that typically began with some advocate, who is often informed from any and all pursuits, including academic and industry. I'm a good example of an agency advocate as you probably guessed :) I've certainly been informed by academia on this topic, just as I have by many startups and large enterprises using semantic technologies, with Linked Data representing the Semantic Web for Dummies. This is always the case for new technology understanding and rollout isn't it? Fielding got a PhD defining REST, and over time the Web API world realized that SOAP+WS-* was very complex often unnecessary when compared with simpler REST practices (which are still maturing as 'hypermedia' part takes greater hold), thanks to REST advocates and developers using what they like. But doing a distributed transaction across a bunch of disparate REST API's might be harder than doing it the via SOAP/WS-* (if the participants can do that, which is often limited to enterprise partners).

I admire and am personally extremely grateful for the work of Sunlight and Sunlight Labs. I missed #tcamp13 for the first time this year and bummed about that. I'm glad to hear you've experimented with DBPedia, but I imagine that the errors and limitations you cite had little to do with the mechanics of LD, rather that initial source data is always dirty, no matter what it's source, or how it's exposed. As for black boxes, they don't call it the Linked Open Data cloud because it's closed and proprietary like GKG, rather it's a large distributed database, a giant global graph of data in the Web that's radically open (although can of course be secured and access controlled in various ways). Looking at the obligatory image of the LOD cloud (at http://lod-cloud.net/), imagine each of the data.agency.gov publisher sites as the nodes in that graph, with Data.gov at the center, instead of DBPedia. Realizing that, then reconciliation services such as those supplied by Calais and others will be based on the Open Data Knowledge Base aggregated by Data.gov, not some proprietary paywall info broker like Thomson Reuters. I fully understand the pull of SEO for getting rich snippets from the GOOG as the rationale for schema.org, but the gov can and has every right to do their own domain specific schemas (HHS has invested $$$ of public funds here), and can also index that domain using that schema, which doesn't obviate the value of leveraging schema.org too. Right now, afaik, we only have the search interface into the GKG, and we're not really sure how things described via schema.org find their way into the GKG, in stark contrast to the LOD cloud.

Yes, developers leverage what's useful, and that influences general practice in important ways, but they also must respond to customer requirements, and ideally both developers and customers get smarter over time, and their practices evolve to reflect that. Sometimes, that means embracing and incorporating new techniques! The customer gets a new capability and the developer gets a higher wage when employing the new skill, before the open source dynamics ultimately commoditize them :) The LD community, especially those that are also a part of the Open Gov Data community, have gone out of their way to make LD accessible to the 'average' developer. Often (as is the case with healthdata.gov/cqld effort, or http://data.gov.uk/linked-data) you don't need to know RDF or SPARQL in order to interact with LD. RESTful GET's parameterize key value pairs in a familiar fashion, hiding the graph structure if you're scared of it, but exposing it in a standards based way if you're not. JSON-LD was introduced for exactly this reason, to look and feel like JSON that developers know and love, but whose keys (and often values) have global network identity. XSLT is still extremely useful as it was ten years ago, and today so is SPARQL. The more tools in your toolbox, the better chance of making something hard easier to do, and making more money.

Finally, it's the nature of your last paragraph that I find so startlingly wrong headed in the Project Open Data context, or at least terribly misunderstanding where folks like me are coming from. No one is suggesting an anointed agenda that hasn't been embraced by developers. What the LD community and agencies doing LD want is to not have LD be excluded from Project Open Data because a handful of developers continuously pee in our wheaties, and do unintentional FUD like compare LD technology to XBRL technology (although I acknowledge the new tech injection issue as similar, but that's not what the unwashed will read and comprehend from your comments, imho).

This is not an either or thing, LD builds on top of REST. Everyone's app client likes links and URI's. JSON-LD looks and feels and acts like JSON. These are open standards for machine readable data that folks are making good use of now - the community that devises and promulgates Project Open Data principles and pursuits is wrong to exclude them. If you want POX, great, and I think what Data.gov on behalf of POD is doing accommodates this. But if you want to do LD, you can do that too. Then, we might bootstrap our way out of the chicken and egg problem that LD (and so many other excellent and useful technologies) has faced, and more utility will be gained for the open data movement, because we can stop spreading FUD about publishing LD using these best practices, and get on with the business of cross linking.

We want the same things.

Thanks for your feedback, I appreciate this channel and your interaction.

@georgethomas
Copy link

@waldoj

RPC and REST are both still useful implementation styles for exposing web services. CKAN's new 'Data API' is a relevant recent RPC example. There are still two worlds, one an Enterprise IT with ESB's and another more Web native world. Clearly the external presentation of enterprise information that the Digital Strategy and Open Data Policy want to manifest applies the Web native standards, tools and techniques. I'm all for that, as I think all open gov data technologists and developers are.

Given your well earned reputation as a talented developer (among many other things I'm sure), I'd love to understand more about the generalizations you're making, and what in your experience has been so complicated such that a developer like yourself can't grok the LD publising/consuming technolgies and toolchains (that doesn't require RDF, but are even more powerful when leveraging RDF).

Unfortunately, your remarks might seem to suggest to the uninformed or inexperienced that LD is a dead end, just too complex for smart people like you, when in fact it's already an important part of the global open gov data movement and many agency pursuits. Clearly LD is still an emerging technology (although the underlying standards are well established for many years), and I'll admit from experience that it's still harder to do LD than to understand it, but it has business value that's recognized by many, including various gov agencies.

Elsewhere in the world, LD is said to be 'beguilingly simple' (Nigel Shadbolt) and embraced as an emerging innovation evolving the Web of Documents that humans find so useful into a Web of Data that's useful for machines, and btw also demonstrates a way to tear down the silos we continue to construct even in the open gov data movement. There are lots of startups doing semantic search for example, and in some way these players help move entrenched players in that market.

For some reason, particularly in the open gov data efforts orbiting around Wash DC within a small but influential number of voices, there seems to be a constant poo pooing of LD at any opportunity, usually citing vague generalities and complex unsubstantiated relations with authority that sometimes, perhaps even often, paint a skewed perspective of both techne and utility. In these days of Graph Search and Knowledge Graph focus from the most monetized Web native players industry has ever seen, with CRUD standards for LD being formalized by the largest Enterprise IT firms in the world (IBM, Oracle, etc. due to their recognition of LD as a killer integration/interoperability technology), I would think there would be an intense interest by the open gov data developer community in a graph oriented standard for data in the Web.

What gives?

@waldoj
Copy link

waldoj commented May 13, 2013

George, I only said that RDF is complicated and that my peers and I have a hard time using it. That's all.

@georgethomas
Copy link

@tauberer

Josh, not surprisingly, you were clearly ahead of the curve, and an inspiration for many I might add :)

RDF/XML syntax is probably singly responsible for driving many away from RDF, and deserves its reputation as the most horrific syntax for RDF. Turtle, JSON-LD, and HTML+RDFa are the important syntax encodings for RDF now. This probably wouldn't have affected the lack of uptake you experienced in the past, but I'm pretty confident will affect usage in the future.

POX hasn't and is unlikely to lead to silo dissolution, auto graph merging, entity resolution via reconciliation services, or the network effect for open gov data in a decentralized and federated way. Sure, GOOG can do this, but we want to enable 'cooperation without coordination' of citizen analysts as social data web curators. Would you agree?

As I mentioned in my comments to Tom, there are existing customers that want LD, so excluding them would be a failure of the open gov data movement, and not something that will be tolerated by taxpayers like me! The future is here, it's just unevenly distributed.

I wish I could find a recent SVR quote in one of the fed rags, something like 'we're building a metadata management infrastructure' - it's not just about data, it's also about metadata, and LD offers a simple Web native way to up our game there in a big way - it's metadata on steroids (or just HTTP URI's really :).

Others outside of gov agencies routinely republish any/all data formats into the LOD cloud, and to your point, the real value of the network effect only kicks in when we get on to the business of (automating) cross linkages - and this is the main utility of LD for OGD. Even though the LOD cloud supercharges mashups for folks that understand it, it also brings provenance challenges that I think really only source publishers (agencies) can solve for open gov data consumers.

@georgethomas
Copy link

@waldoj

OK, I appreciate that, but I think readers fill in the blanks and extrapolate the statements of those they respect or appear authoritative, and I think most would agree that your initial comment is suggests that there's no reason for anyone to be interested in LD because it's too complicated for rockstar developers, much less other mere mortals to understand and use.

But I'm more interested in your experiences and the experiences of your colleagues. What do you find so complex? If you have the time to elaborate and share, I'm sure it would be instructive for all.

The examples on http://json-ld.org/spec/latest/json-ld/#relationship-to-other-linked-data-formats are useful. Given the fact that there are open source libraries for processing RDF in any of it's syntactical forms for your favorite language, I'm wondering where the complexity and any other barriers to entry for developers exist?

@waldoj
Copy link

waldoj commented May 13, 2013

George, you are inferring things that I neither stated nor implied. ("Rockstar developers"?) I pointedly said that "there's nothing wrong with RDF." I simply said that I find it more confusing than to the current state of affairs (e.g., schemaless JSON). A schema is inherently more confusing to get started with than not-schema. My attempts over the years to either produce or ingest RDF have all failed. I can't tell you why, what with the failure. You are clearly very enthusiastic about and supportive of RDF, and that's great. Personally, it doesn't interest me sufficiently to overcome my difficulties in implementing it. Your line of questioning would be like me demanding to know why you don't engage in astrophotography of supernovae. "I demand that you justify your disinterest! It's not that hard to hook up an SLR adapter to a Dobsonian eyepiece via a Barlow!" :)

@konklone
Copy link
Contributor

Neither I, nor any commenter as I interpret them, is suggesting that this document should actively discourage agencies from trying Linked Data. What I am saying is that it's not appropriate for the Project Open Data document to actively encourage agencies to use Linked Data. The lack of widespread adoption and enthusiasm among web developers is what the Linked Data community needs to tackle first.

Since Linked Data is built on top of the transports that we all believe in -- XML, JSON, etc. -- there's nothing in the document discouraging an agency from trying a Linked Data solution if their customers are demanding JSON-LD over plain JSON. In another 10 years, there are surely going to be newer patterns spreading around above the transport level, and agencies should have a free hand to try those too. It is this flexibility that I'm advocating we keep.

@georgethomas
Copy link

Waldo, I meant 'rockstar developer' as a complement regarding your reputation (to include others on this thread!), and my point is that people always infer from implications. I was hoping you'd share more about why LD is of no interest to you, and specifically more about what failed for you, or what you and your colleagues found so difficult or complex. I haven't demanded anything of you, but understand if you're not interested or simply don't have the bandwidth to participate and collaborate on this topic, so no harm no foul!

Eric, what I'm asserting is inappropriate for Project Open Data is to actively discourage LD open standards based machine readable technologies via exclusion of existing open standards and corresponding best practices for publishing and consuming machine readable data - this is largely why RDFa is already supported as a resource representation (transport?) along with POX and JSON.

No winner need be picked, agencies can judge for themselves. Give them options, ranging from remedial to advanced, which can be introduced incrementally.

None of us as individuals has a singular authority of direction setting for Project Open Data in this collaborative endeavor, and each of us has a responsibility to engage and contribute what we can - so it's perfectly fine for you to assert that there's no interest in LD among developers, just as it's perfectly fine for me to point out that LD is the goal of some of the most successful open gov data participants, and many agencies in the US, and there are many developers that support those efforts whom you may not know.

Perhaps we might continue this exploration with input from the US CTO and CIO, the Data.gov team and others. This thread started out including some GLD work from the W3C, and then it appeared as if some folks seemed to want to exclude that for reasons that are not representative of the US, much less the global open data movement.

I hope no one will interpret this as anything resembling a holy war as Eric was originally concerned about. I'm an open gov data practitioner, advocating more options, and fully supporting the foundations that many on this thread have made significant contributions to establish in policy, which I think is a fantastic achievement.

@jqnatividad
Copy link
Contributor Author

Fellow Datanauts,
To produce high quality, 5-star, Linked Data - the Linked Data Lifecycle described in the Cookbook prescribes the following seven steps:

  1. Model the Data
  2. Name things with URIs
  3. Re-use vocabularies whenever possible
  4. Publish human and machine readable descriptions
  5. Convert data to RDF
  6. Specify an appropriate license
  7. Host the Linked Data Set Publicly and Announce it!

Perhaps until Linked Data technologies are widely adopted, which I believe we can all agree is the Nirvana we are all striving for - a Web of Data. Suppose we modify the lifecycle and skip step 5 (for now) and tweak step 7 and drop the "Linked" adjective?

As the Hendler Hypotheses goes - "A little semantics goes a long way." In this case, URIs and perhaps, schema.org markup.

Dr. Hendler's session at Strata 2013 is most instructive.

@MarinaNitze
Copy link
Contributor

@jqnatividad Thanks for contributing! I just saw #17 ... could you please submit a second, separate pull request for adding the link? Then we can track the two changes clearly/separately.

Marina

@benbalter
Copy link
Contributor

@jqnatividad it looks like you committed jqnatividad@5694b65 to master to submit #17, but then branched https://github.com/jqnatividad/project-open-data.github.io/tree/LinkedDataCook off of master (stacking the commits). #17 should be clean, but depending on your git prowess, if you could either remove jqnatividad@5694b65 (ideal), or at the very least commit a correction to this branch to remove the extraneous line which will make this pull request easier to merge if that's the direction the thread ends up?

@jqnatividad
Copy link
Contributor Author

@MarinaMartin @benbalter corrected as requested. Now two "clean" separate pull requests. Will use prose.io going forward.

@jpmckinney
Copy link
Contributor

I find the comments by @konklone, @kachok, @sbma44 and others confusing. This project is proposing schema with its "common core metadata". The schema are serialized as JSON, sure, but this project is clearly operating at the schema level and not uniquely at the transport level. You can have linked data in a variety of transports like JSON, N3, Turtle, XML, etc. I love JSON, and choosing JSON doesn't mean you can't also do linked data.

To get away from the philosophical debates and onto more practical concerns, turning the JSON schema proposed by this project into a linked data schema would only require adding three simple properties to each JSON document - you can see examples in this comment in #23.

One property is @id, which is a good idea so that a document can identify itself. Another is @type, which is also a good idea so that we know what kind of thing that document represents. Lastly, there @context, which tells you where to find the schema that explains the semantics of the fields used (staring at a JSON file, I wouldn't necessarily know that mbox means "email address"). @context is very useful if the data gets separated from its documentation: a very likely scenario.

Hopefully, adding those three simple properties isn't rocking the boat too much. I think those properties are incredibly useful even if you don't care and don't want linked data. But it just so happens that those three properties would make the proposed schema a linked data schema.

@jqnatividad
Copy link
Contributor Author

@jpmckinney 👍
Like I said a "little semantics goes a long way!"
Perhaps the group can look into http://json-ld.org/ - a lightweight linked data format built on JSON.
The latest draft of the v1.0 spec was released days ago - http://json-ld.org/spec/latest/json-ld/
And already, Google is using it - https://developers.google.com/gmail/schemas/

@prototypo
Copy link

Hi all,

Sorry to be joining this conversation late. I know I should have made the time earlier.

Full disclaimer: I chair the W3C RDF Working Group, which is currently in the process of standardizing RDF 1.1, including several new (and much easier) serialization formats for the RDF data model. I am also a member of the Semantic Web Coordination Group by dint of chairing a working group and a member of the Linked Data Platform Working Group. My company (3 Round Stones) sells commercial Linked Data applications. So, I'm rather invested.

My perspective on this is simple: Let's not turn this into a religious argument about formats. Developers care deeply about formats, I get that. I do too. However, the important thing for any kind of structured data sharing on the Web is having a sharable data model. That's the only way to allow others to use your data without coding. RDF is that data model.

Now, RDF has a lot of different formats. Pick the one you like and/or convert all or part of your data to one of them when and if you need to. It is no big deal. The advantage is that your data can be shared and reused by others, who may feel free to convert it as they like.

The alternative is to keep creating one-off solutions where the data dies in the browser or the app. Is that really the world we want to live in? Screen scraping works, sort of, but that doesn't make it a good idea. We should be working together to encourage structured data publication that we can all reuse and then re-share.

The Linked Data approach works. Just this week we saw Google (Google!) announce support for JSON-LD in GMail and a major revision of the German National Library Linked Data service. This stuff happens nearly every day now. The Linked Data approach is also being used in enterprise software and is being picked up by large vendors including IBM, Oracle and EMC. Something is working, so please don't say it's not.

"Don't like RDF"? Really? Have you looked at JSON-LD, Turtle or the new TriG? How about RDFa Lite, which has been adopted by the major search engines? It is not hard and gives you the benefit of a common data model across those many formats. The data remains available for others to use, unlike the traditional techniques of building a single app.

Siloed applications suck as much as data silos. We can do better. We do better. Come on in. The water's fine.

Regards,

Dave

http://about.me/david_wood

@georgethomas
Copy link

@prototypo 👍

@jcarbaugh
Copy link

Again, as @konklone said before, I don't believe anyone here wants to explicitly discourage agencies from choosing linked data, but neither should the open data policy encourage its use.

@prototypo presents a false dilemma that we either publish serialized RDF or "data dies in the browser or the app". A vast amount of data is shared and processed daily, free from silos, all without the help of RDF. The choice, as @prototypo seems to imply, isn't between using linked data or being sentenced to a terrible life of screen scraping.

I also take issue with @prototypo's assertion that linked data is the "only way to allow others to use your data without coding." Many journalists and non-programmers work with data just fine when they can open a CSV in Excel or import it into SQLite. Publishing linked data requires them to work in (and purchase 💰) tools built specifically for that purpose. I know my journalist colleagues would be delighted to find http://reference.data.gov.uk/id/year/1988 as a value for a year (seriously, that's straight from data.gov.uk). The time I now spend scraping will be spent simplifying and republishing data sets so they can be used outside of linked data toolchains.

I'm not saying that RDF et al. are inherently bad or that all agencies should avoid these technologies, but they present serious problems for people that aren't immersed in the (academic and enterprise) world of linked data. My goal is to make public data accessible to and usable by more people and encouraging linked data would make achieving my goal more difficult.

@jcarbaugh
Copy link

@jcarbaugh 👍

@jpmckinney
Copy link
Contributor

@jcarbaugh It's not a requirement to use URIs as values for years. If it's a Dublin Core property, at least, the range of possible values includes literals like "1998". I would say that the UK's choice was not the best. Mistakes are made. Let's not confuse flaws in technology with flaws in implementation. You might be right that the people responsible for RDF encodings tend to overcomplicate things, whereas those responsible for data model-free JSON encodings tend to simplify things - but that's not due to the underlying tech - it's due to people's choices.

@jcarbaugh
Copy link

@jpmckinney quite true. It's not fair to blame the technology. I see it as a cultural tendency of the linked data community to overcomplicate things. I appreciate the practical approaches you've mentioned in this thread. However, the linked data I've seen in practice are far from simple. The data.gov.uk example was one. Here is another where you only need to do data['http://health.data.gov/id/hospital/393303']['http://www.w3.org/2000/01/rdf-schema#label']['value'] to get the name of a hospital. Could it be any more common sense?!

Maybe I just have bad luck at picking which linked data sets to look at.

I feel that public data should be usable with minimal machine intervention. Encouraging the use of linked data has a good chance of resulting in significant machine intervention being required in order to use public data.

@jpmckinney
Copy link
Contributor

Wow, that JSON-LD isn't even valid (or uses a very early draft of JSON-LD). Here's what it would look like done properly. Note that as a developer, you can just ignore the big @context key if your app has already built-in assumptions about what the terms mean. In practice, the JSON document would refer to a standard context via a URI instead of including that massive block in every document. In this (proper) document, you get the name of the hospital with data['label'] 😃

Left to their own devices, governments and enterprise will often turn out the ugliest data (I've seen my share of awful CSVs on open data portals, requiring a fair bit of machine intervention to clean up). If we want data done right, we need to provide guidance. If governments are publishing linked data, let's try to make them do it as cleanly as in my above gist.

@georgethomas
Copy link

Thanks to @jcarbaugh for a specific example. @jpmckinney it's not JSON-LD, it was well before JSON-LD, using an earlier JSON encoding for RDF, but thanks for the JSON-LD gist. As noted by many here already, JSON-LD isn't even finished yet, but the GOOG is already using it for Gmail, and there are plenty of other format comparison examples available. I'm curious what @jcarbaugh @konklone @sbma44 and @kachok think of this emerging format, since they're exactly it's target.

Yes, there is a complexity spectrum that corresponds to increased utility of global network identity, bringing both the ability to easily merge data and to disambiguate metadata tags/terms. This has a positive impact on open gov data, enabling the network effect. (Healthdata.gov has yet to purchase any runtime software however, it's all FOSS).

The central issue in this thread (imo) is not our individual capacity for loving or hating LD for OGD, it's that LD represents a collection of standards based machine readable technologies and corresponding best practices, and therefore should be represented by POD along a spectrum of strength and complexity. POD should not recommend only one approach and exclude others, it's simply not appropriate in the public sphere to pick a 'winner' that excludes emerging technologies that enjoy many existing practitioners today.

Instead, POD needs to acknowledge that there are many paths to the top of the mountain - each to their need and ability. If this is not the case, POD sponsors should expect criticism from voluntary consensus standards organizations creating Native Web standards that agencies are guided by OMB to use where applicable. LD is applicable to OGD.

@konklone
Copy link
Contributor

This is getting off topic, but since I was asked my opinion of JSON-LD, I'll do my best while keeping it somewhat brief. @jpmckinney's example of JSON-LD is substantially better than the other examples I've seen, especially because it sticks the metadata inside of one ignorable @context key. The schema is much improved for that. The values are still a forest to me. I can respect the idea of using universally identifiable URIs now and then when namespace collisions are to be expected, but what it adds up to are very few values that are not URIs. This impacts both human readability and machine processability, by essentially normalizing everything.

If I want to use this JSON-LD data and render it into a web page (one of the most common use cases I face in my work with JSON), I cannot say "place the state field here". There is a resolution step, so either I need to parse the state out of the URI with text processing, or do an extra lookup step somewhere to turn the URI into the state field. It is weighty, and this does matter, especially when all of your metadata are suddenly treated like foreign keys. This level of indirection doesn't just impact rendering templates - it impacts just about any form of calculation or integration of those fields besides the storage and lookup processes themselves.

Once you get past tabular data, like CSV and SQLite-y stuff, I no longer place much value on data that can be used "without coding". Encoding the data model into the code that processes it is at worst a necessary evil to keep data simple, and at best a healthy separation of concerns.

For example, there has been a trend in web development over the past several years in moving data validation out of the database layer, and into the application layer. This has proved to be a more flexible and attractive system, and paved the way for schema-free databases to enter into more mainstream use in the web development community. At every step, systems have become easier to conceive, build, and manage over time. This is, most definitely, at the cost of more descriptive and portable data models, but it has felt like the opposite of a cost.

@prototypo
Copy link

I certainly didn't mean to present a false dichotomy: Linked Data is not the only alternative to screen scraping, nor the only approach to data sharing. However, Linked Data has two specific benefits worth noting:

  1. Linked Data is based upon two widely deployed and adopted families of international standards: HTTP and RDF.

  2. Those two families of standards allow us to share data not just with ourselves and our friends (who can have our data model communicated to them), but to people we will never meet. This is "cooperation without coordination". Compare that to the example of CSV that @jcarbaugh suggests, where we cannot typically understand the column headers even if they are present.

As @georgethomas said, we should suggest and promote the use of Linked Data alongside other machine readable mechanisms.

@jpmckinney
Copy link
Contributor

I agree with @konklone - RDF implementers sometimes go overboard with URIs. For controlled vocabularies like state codes and country codes, literal values should suffice. (I'm sure a mechanism can be devised to transform them to URIs if necessary.)

@jqnatividad
Copy link
Contributor Author

And perhaps this mechanism is one of the infrastructure pieces that Project Open Data can stand up. Perhaps, a JSON-LD viewer (or insert your format viewer here) of sorts that allows users to have a "user-friendly" way of navigating the Web of Data - transforming the long URIs to an easily readable version.

Maybe this viewer can even double as a data viewer for regular users - leveraging the embedded metadata to display the data in context (e.g. creating dropdowns for state data, doing elementary validation, etc.) Think OKFN's Recline.js or the Wikipedia Data Browser made smarter with all the metadata.

@gkellogg
Copy link

As @jpmckinney pointed out, the nice thing about JSON-LD is that you can stick all of the ugly parts in the @context. Furthermore, the context doesn't need to be part of the document itself, but can be referenced using a URL (not surprisingly, it is linked data). Moreover, specific keywords (e.g., @id and @type) can also be aliased, so you can simply use, for example, id and type as keys in an actual document.

The key message, IMO, is not that documents need to be perfect LD, but that they have unambiguous meaning, that doesn't require special knowledge. By associating (directly or indirectly) keys with URIs, and adding types to objects, you've gone most of the way there. Other details, such as whether a value is a string, date, or reference, can be hidden in the context so that developers who don't care about treating the data as RDF don't need to worry about it getting in their way, but people who do care what the meaning of keys and values are have a machine-readable way of determining this.

One of the advantages of associating URIs with keys, types and references, is that JSON-LD algorithms can be used to transform the data, through framing, flattening, expansion or compaction. You can even turn it into an abstract RDF representation, if that is useful. Furthermore, existing data that can be represented as RDF (say RDFa) can also be transformed to JSON-LD, allowing it to be used as simple JSON.

Also, note that JSON-LD was designed for non-RDF developers to be able to work with the data, and keep close to a natural JSON representation for the data they're already used to.

Full disclosure, I'm one of the editors of the JSON-LD specifications, and author of the Ruby JSON-LD gem.

@mhogeweg
Copy link
Contributor

Perhaps a discussion of the purpose of the data.json file would be useful?

It appears to me that the data.json format is designed to facilitate harvesting of agency data inventories into data.govhttp://data.gov, not so much for the description of the data sets directly. Is that so?

If so, what about other formats like site maps? Those are widely used to list things and crawlers know what to do with them (without extensive semantic inference happening). It's a list... With this focused purpose, one might argue less semantic enrichment is ok.

If the focus of data.json is on describing the datasets themselves then semantics imho go further than defining that the title element is the title of the dataset. in this case information about attribute types, data accuracy and quality (ps: quality is not a true/false as is suggested in data.json), currency, access and use constraints, lineage (where did the data come from), etc. Many of the agencies already have these more extensive descriptions of their data sets in the form of FGDC or ISO metadata (NOAA, EPA, doi, census, ...). These are XML based and typically have schemas defined as well. Agencies have included these descriptions in agency registries and catalogs (such as EPA's environmental data set gateway: https://edg.epa.gov) that may be searched, harvested, etc. geo.data.govhttp://geo.data.gov contains close to 1,000,000 of such data sets from federal, state, academia, etc (http://www.geoplatform.gov/catalog/rest/index/stats).

So, what is the goal: provide a new form of site map? Describe the data sets? This should drive the design of the information model and the level of semantic enrichment is needed.

Marten

(Full disclosure: I developed the open source technology behind EDG, http://geo.data.gov,
Geospatial One-Stop and various other federal, state and local registries that all are based on FGDC and ISO XML metadata specifications, and for over 10 years have harvested metadata in an automated fashion from 500+ national and international registries, catalogs, and catalogues using automated tools)

@jpmckinney
Copy link
Contributor

This thread is kind of becoming about everything. For geodata specs/standards, I think it makes sense to continue that discussion in #4. Re: dataQuality, it can mean a lot of things, but the POD schema defines it as being something very specific that is boolean: "Whether the dataset meets the agency’s Information Quality Guidelines"

If people want to enrich the schema, new issues should be opened for additions of things like provenance, etc. Similarly, data viewers, etc. are good ideas but should be made into issues (ideally pull requests, or even separate repositories).

This thread has been very instructive, but it is just about adding a link to the W3C Cookbook for Open Government Linked Data. For what it's worth, all the other links are to .gov resources, with the exception of The Mosaic Effect, which frankly is not a very useful resource; I think it makes sense to keep the list to .gov resources.

@akuckartz
Copy link

Some of the comments on this issue are unbelievable. Some people here seem to think that Links are the worst aspect of the World Wide Web.

If the U.S. does not like Links that will not prevent Linked Open Data to be published and used in other parts of the world, such as Europe.

@lmatteis
Copy link

Hello. Why are we talking about RDF? Linked Data isn't strictly about RDF.

It's about linking data to other data, so that users can merge and fuse things. Essentially creating a huge distributed database.

I'm happy with schemaless JSON, as long as you give me links to other related data, then it's Linked Data.

However it seems like the U.S. government may be ignoring this extremely useful exercise. Doing so is similar to building a website, without external links. May I ask why?

@lmatteis
Copy link

@prototypo 👍
@jpmckinney 👍

@kidehen
Copy link

kidehen commented May 19, 2013

All,

Does anyone here disagree about the notion of using the time-tested Entity Relationship Model as the basis structured data representation?

Does anyone here disagree with the notion of using Entity->Attribute-Value (EAV) patterns to express Entity Relationship Model based structured data?

Does anyone here disagree with the utility of resolvable URIs as denotation (reference) mechanisms for the Entity, Attribute, and optionally the Value components of the EAV pattern outlined above?

Does anyone here disagree with the notion that "Attributes" in the EAV pattern really denote (refer to) "Relations" and that Relations enable understanding (meaning) of Relationships between Things?

Finally, does anyone here disagree with utility of machines and humans being able to understand the "meaning" of relationships between things?

From my world view, the items above outline the real essence of structured data that provides utility to all i.e., profiles that include end-users, domain experts (including, real executives), systems analysts, systems integrators, programmers etc..

BTW -- The U.S. Govt. is already making productive use of Linked (Open) Data based on the Entity Relationship Model. Naturally, that won't be changing anytime soon because such regression would be utterly illogical :-)

Excuse my typos, typed in haste en route to breakfast .

@konklone
Copy link
Contributor

Let's either merge or close this thing.

@jcarbaugh
Copy link

@konklone +1

On Sun, May 19, 2013 at 2:57 PM, Eric Mill [email protected]
wrote:

Let's either merge or close this thing.

Reply to this email directly or view it on GitHub:
#21 (comment)

@lmatteis
Copy link

@konklone would it be democratic to close the most popular pull-request?

@jpmckinney
Copy link
Contributor

@konklone +1 to merge or close. If people want to discuss the various issues raised here, there are more appropriate forums. In terms of whether to merge/close, I would opt to close, and maybe open a pull request to remove the "mosaic effect" link, to keep that list of links to non-contentious .gov links.

@benbalter
Copy link
Contributor

Wow. Awesome, awesome discussion that I can already tell is going to inform a lot of the project's decisions moving forward.

Sounds like there's a general consensus, as to @jqnatividad's original pull request, that linking out to additional linked data resources is a good arrow for agencies to have in their quiver. 👍 to @konklone's suggestion to merge the pull request as is.

As for implementation / format, it sounds like there's a bunch of valid routes we can go. Would love for volunteers to shepard breaking those ideas up into additional pull requests, so that we can create actionable, concerete steps that the community can recommend to agencies, even if it has to remain at a high level, or if we have to endorse multiple formats in parallel, until the broader linked data community settles on a standard.

Either way, this thread shows exactly why Project Open Data is so important... to hash out these types of opportunities, among those most invested with their resolution, before (hopefully) codifying as agency best practices. 🇺🇸

@georgethomas
Copy link

@benbalter 👍

@haleyvandyck
Copy link
Contributor

Thank you to everyone on this thread for the very interesting conversation. @benbalter said it best-- this is exactly why we are excited about Project Open Data, and the opportunity this provides for continual iteration and improvement.

Looks like there is agreement on adding @jqnatividad suggested link. 👍 Merging now.

Thank you all for contributing and we're looking forward to many more discussions and pull requests to come. 🇺🇸

- Haley, Senior Adviser at the White House

haleyvandyck pushed a commit that referenced this pull request May 20, 2013
Added W3C Cookbook for Open Government Linked Data
@haleyvandyck haleyvandyck merged commit 029a68e into project-open-data:master May 20, 2013
@AlexeyAnshakov
Copy link

AlexeyAnshakov commented May 10, 2017

First of all, thanks to all and especially @prototypo for a such interesting thread. Too bad I missed the conversation, but if someone is interested in JSON-LD startup projects, I'd like to share info about one of them https://docs.google.com/document/d/1kdKoDfp9mT5M4pssJTqvA6sw4zzYQ5HFaN-osu7J-Nk/edit?usp=sharing

WRIO Internet OS is your door into the realm of decentralized, semantic and safe Internet based on Open & Linked Data.

Our repos: github.com/webRunes/

@akuckartz
Copy link

@benbalter LD is not about the form of URLs. If people like to provide human readable URLs they can do so while providing LD at the same time.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.