-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Model Series of Data as Distributions of a single Dataset #1429
Comments
Thanks @sabinem , very good points. I have seen at least three national implementations of data catalogues that take this approach with different files in temporal and/or spatial series as distributions of one dataset. What does worry me is that it makes it hard for reusers -- who harvest or otherwise receive DCAT descriptions -- to understand what's going on if there are some dataset/distribution combinations that follow the recommended pattern of 'distributions contain the same data' and some that follow what is now the anti-pattern. Would it be sensible to distinguish the pattern by using different classes? E.g.:
That way, it would be immediately obvious what pattern is being used. |
I think @makxdekkers suggestion is on the right track, recognizing that there are (in this case) two kinds of datasets:
KInds of distributions:
What is a 'seriesItem'; I'd propose this a static snapshot file from a dynamic series (case 1 + 1 above)) |
@makxdekkers I think your suggestion is very good, since usually datapublishers are very aware of their use case and whether or not it is a Series with Series Items or a Dataset with same content Distributions, just as @smrgeoinfo also describes in the mentioned use cases. To have a vocabulary in place that allows to translate that awareness of the use case into the appropriate vocabulary seems like a good choice and will also help users to quickly understand the structure of the data. |
It might be worth noting that the current DCAT Editor Draft and the second DCAT working draft acknowledge some flexibility on what to consider as items of dcat:datasetSeries, which already includes the use of Distributions in place of Datasets. Indeed, the property dcat:inSeries, which links the items to data series, has no domain specified, and its usage note says
I think we can distinguish between informatively equivalent and non-informative equivalent distributions using the properties dcat:distribution and dcat:inSeries. Said that... the current dataset series section does not mention the cases in which distributions are used in place of datasets. I guess a couple of examples more might help to understand to what extent the current design meets the emerging use cases. |
@riccardoAlbertoni This is indeed useful information. If this pattern is already foreseen, more information should be provided in the specification showing how to do this. I do see that this pattern is mentioned in the information about the property For one thing, the definition of the class says that the dataset series is a "collection of datasets ..." which should then say "collection of datasets or distributions ..." And the definitions of the property dcat:inSeries should be changed from "dataset series of which the dataset is part" to "dataset series of which the dataset or distribution is part", or even "dataset series of which the resource is part", given that the domain is left open -- so anything at all can be a member of a dataset series. As far as I have understood, the pattern with Another issue for me is that I think that, in general, having an 'or' in the definition might pose problems for processing. If both patterns are available, i.e. dataset series of datasets and dataset series of distributions, an application that receives such information will have to look for both datasets and distributions that link to it, and might need to take different actions in either case -- and what happens if there are both datasets and distributions linking to it? It would indeed be good to develop some examples for relevant cases. |
Here's a couple sketches trying to elucidate some of these relations: DatasetSeriesSubset Diagram: Dataset: A collection of data, published or curated by a single agent, and available for access or download in one or more representations, and containing information conforming to some schema. NOTE: identity of dataset is based on the underlying schema, and other variable criteria like authorship, coverage extent, update version. DataSeries: a collection of datasets sharing the same schema, but differentiated based on some extent criteria like temporal or spatial coverage. The 'member/inSeries' link from a Series to a Dataset is an association class that specifies parameters determining the extent of the series member. ContentModel: a schema (conceptual, logical, or physical) that characterizes a dataset; defines entities, properties, domains, ranges, and other constraints for elements in the dataset. Distribution: A specific representation of a dataset. A distribution has a Serialization based on some electronic format and profile that determines how that format is used. The serialization for a distribution must implement the schema for the dataset that is represented by the distribution. PackagedSubset: a dataset that is subset from a sourceDataset based on some query and parameter values for that query; its content is fixed, and can be assigned an identifier. FilteredDistribution: a representation of a subset of a Dataset based on some query and parameter values for the query, determined dynamically by a user requesting the data through some interface. Can be assigned an identifier to duplicate the query, but if the source dataset is updated, the actual content might vary over time. Serialization: a scheme for representing information electronically; based on some format (specified by a MIME type), with optional additional constraints on the format for greater specificity in content, e.g. XML schema, RDF vocabulary used, CSV profile. parameters: values that specify criteria in query to define a dataset subset, or that define the extent (temporal, spatial, other...) of a particular DataSeries member. The associated downloadURL could be a URITemplate in which the parameters would be substituted. Packaging Diagram: Document: a file that is related to some dataset and included in a Bundle. Package: a file that contains all the items in a bundle, e.g. a BagIT or ORE archive file. |
I have to say I am very sceptical about allowing Distributions of a Dataset to be informatively non-equivalent, and on top of that, members of DatasetSeries. In my experience, the prevailing argument for allowing informatively non-equivalent distributions of a dataset, also mentioned in this thread, is "there would be too many datasets". However, to be able to properly describe the distributions, which would now be informatively non-equivalent, one would have to use many of the properties now used for describing datasets also for distributions, as @makxdekkers also points out. This would again result in having many objects, only now it would not be Datasets in a Dataset series, but Distributions of datasets, described as datasets, in a dataset series. And this time, "there would be too many distributions". I do not see the advantage in that. I think the potential number of datasets is actually not a problem. It is simply a manifestation of the state of things, and it is up to the user interfaces presenting the data to people to handle this, e.g. by grouping by topic, publisher, time, space, etc. I may be wrong, but it always seemed to me that informatively non-equivalent distributions were an artifact of
But now that we have the dataset series, this situation should be model as a dataset series made of individual datasets, properly described, served in informatively equivalent distributions. Otherwise, there will be too many options to do the same thing, resulting in interoperability issues. I may be wrong here, but I think I have not seen a case for informatively non-equivalent distributions that could not be solved by using informatively equivalent distributions of a dataset in a dataset series. |
We have identified a similar problem in Sweden to what @sabinem described. In short, we need to make it easy for people to add more data into an existing dataset. But we have solved the problem in another way in the Swedish profile. We have allowed the dcat:downloadURL to be repeated. Like this:
This approach has the following merits:
It could be argued that repeating the dcat:downloadURL is bad, that it is not intended to be used that way. The specification says "the downloadable file" which indeed seems to indicate there should be a cardinality of one. However, I think it should be investigated if it can easily be tweaked to be compliant. I think the approach above should be considered as a more lightweight alternative to the dataset series approach. |
@matthiaspalmer Thanks for the detailed information of your solution. Not questioning at all that this approach fits your needs and the needs of your data providers, I am still a bit uneasy about all the different variants. Section 12.3 in DCAT3 mentions two 'legacy' approaches:
You now outline yet another approach. One of the main problems I see with all these different solutions is that, while they obviously make absolute sense for data providers in a particular environment, it makes it very hard for data consumers to understand what is happening. It seems to me that a data harvester needs to program quite a bit of logic to process these various approaches, and then still needs to do something smart to present data from various source in a coherent way. As far as I see it, the approach with dcat:DataSeries tries to create a more coherent and widely interoperable approach so that life becomes a lot easier for data consumers. |
I understand your concern @makxdekkers, but at the same time I am just reporting what kind of needs I have observed. I am also of the opinion that it is better to adapt the model to the world rather than trying to fit the world into the model. I think the key would be to provide good guidance when to use the Dataset series and when to use dcat:downloadPartURL. For instance, you need to use dataset series if you need more metadata than a title of the file or when the file does not follow the same structure, e.g. when tabular data does not have the same columns in every file. To be frank, if the Dataset series approach is the only way forward (together with the legacy dcterms:hasPart option) I am confident that the following will happen as soon as the model will be accepted (at least in Sweden):
If we do not do 1, people will just add one distribution per file, just like the antipattern @sabinem described. We have had half day workshops for nearly all new customers and try to instruct them to NOT do this. Still it is happening all the time, it is an uphill battle. I fear that with the Dataset series being the only alternative the battle will be even harder (unless we provide a simplified "hidden" solution for the multiple file case). Another option is to diverge from the model, keep the existing approach and do a transform when exposing to the European data portal, but this seems suboptimal. |
I would also like to point to the discussion we already had in september 2019 about this approach. Although at that point the discussion was postponed to DCAT3: #868 (comment) |
I am curious, how would you handle a dataset that can be either accessed via 10 files or via a single API (containing all the data) as a dataset series? Would it be one Dataset series with a single distribution corresponding to the API and then have 10 datasets that all point to the dataset series via dcat:inSeries? If this is the case, should it then not be stated somewhere that a Dataset series distribution should have the same content as the sum of all the datasets in contains? |
I am curious about the 'legacy' option number 1 in 12.3 you pointed to @makxdekkers . It seems to me that it is the same as the antipatern @sabinem described and @jakubklimek argued against. The final statement in that section reads "These options are not formally incompatible with DCAT" somehow legitimizes that distributions need not contain the same data. I am suprised by this statement and I think it is in direct conflict with the definition of distributions in 6.8. |
Yes it is.
But the sentence continues: "so they can cohexist with dcat:DatasetSeries during the upgrade to DCAT 3", which seems to imply that applications that do it this way are expected to upgrade to DCAT3 and then move to the approach with dataset series. But I agree you could read this in several ways, i.e. "if they upgrade" or "when they upgrade". |
The way I see it, there are two sides to this:
It might be that option 2 is the most efficient as the data provider has all the information about both the internal approach and the common interoperable approach and therefore can make the best mapping. In option 1, the data consumer might wonder what the purpose of using a standard is, if all data providers do things their own way in any case. The data consumer would need to keep knowledge of all existing variants to be able to process the information. |
That is a good point, but in this case, it might mean that if the world is a mess, the standard model should reproduce the mess. Or do you mean that DCAT should implement your model as the only one? This is a fundamental question with standardisation. Either the "world" aligns with a standard so that everybody knows what to provide and what to consume, i.e. interoperability, or the standard aligns with the world, in such a way that everybody can continue to do what they like, and there is basically no benefit of using a standard. |
In the Swiss DCAT profile we are modeling Dataseries such as yearly elections in a single dataset, where each distribution contains the election data of one year. I know that is considered an antipattern by DCAT. And I also know that all properties of Distributions are designed in such a way, that it is assumed that the all Distributions of a Dataset have comparable content.
But there is also a sentence at https://www.w3.org/TR/vocab-dcat-3/#Class:Distribution:
I was recently surprised that we are not the only profile modeling a series of data with Distributions, and that there is a certain resistance to give this up this pattern, since the overall impression is, that this would make for too many datasets and the data portals would get harder to mangage with too many datasets.
On the other hand if DCAT would approve of that pattern or antipattern, then properties would be needed to describe the content in each Distribution. In the Swiss profile we added the attribute
dct:coverage
to Distributions, with a domain of either spatial or temporal values.I am just curious what DCAT's opinion on this topic is and I want of launch a discussion about this: Can't it be that sometimes Distributions differ in their content and shouldn't DCAT alos support these use cases with appropriate properties?
The text was updated successfully, but these errors were encountered: