clarify where to put file paths (e.g ml-25m/ratings.csv) #639

pdurbin · 2024-04-24T13:58:55Z

During the 2024-03-20 Crossaint Task Force meeting I asked where to put file paths (e.g. "ml-25m" for "ml-25m/ratings.csv" and @benjelloun said to go ahead and create an issue to clarify the spec.

I understand that the spec is pretty clear in the case where a zip file is available and contentUrl can be used to show the paths to files within the zip. Here's an example from https://github.com/mlcommons/croissant/blob/v1.0.5/datasets/1.0/movielens/metadata.json that shows a file path of "ml-25m/ratings.csv":

  "distribution": [
    {
      "@type": "cr:FileObject",
      "@id": "ml-25m-archive",
      "name": "ml-25m-archive",
      "contentUrl": "https://files.grouplens.org/datasets/movielens/ml-25m.zip",
      "encodingFormat": "application/zip",
      "sha256": "8b21cfb7eb1706b4ec0aac894368d90acf26ebdfb6aced3ebd4ad5bd1eb9c6aa"
    },
    {
      "@type": "cr:FileObject",
      "@id": "ratings-table",
      "name": "ratings-table",
      "containedIn": {
        "@id": "ml-25m-archive"
      },
      "contentUrl": "ml-25m/ratings.csv",
      "encodingFormat": "text/csv"
    },

However, while Dataverse often can provide a zip of all files in a dataset, increasingly files are large and zipping is expensive, so we plan to continue using contentUrl for direct links to the files. (Besides, why download an entire zip if you only need one file?) I say continue because to support Google Dataset Search we already provide the following, for example, in our Schema.org output:

{
  "@type": "DataDownload",
  "name": "2023-01-03.tab",
  "fileFormat": "text/tab-separated-values",
  "contentSize": 21865,
  "description": "Information on known Harvard repositories on GitHub, such as the number of stars, programming language, day last updated, number of open issues, size, number of forks, repository URL, create date, and description.",
  "contentUrl": "https://dataverse.harvard.edu/api/access/datafile/6867331"
}

So, if not contentUrl, which field should I use for the file path? Thanks!

The text was updated successfully, but these errors were encountered:

Related issues: - mlcommons/croissant#639 - IQSS/dataverse#10523

benjelloun · 2024-05-15T16:33:13Z

Just to make sure I understand the issue correctly: You have an organized set of files, with directories, etc. You would like to provide them individually as FileObjects instead of a zip file, and the contentURL of each file does not contain their path, but uses a flat structure with an identifier instead, so you need a way to represent path to the file in the original directory structure. Is that right?

Assuming my understanding is correct, how about encoding the path in the @id of each fileObject? e.g.,
"@id": "path/to/2023-01-03.tab"? This also helps ensure that each @id is unique, which is a requirement.

Per discussion in this issue: mlcommons/croissant#639

pdurbin · 2024-05-15T21:05:10Z

@benjelloun yes, you understand our situation perfectly.

Thanks for the suggestion. I implemented it here: gdcc/dataverse-exporters@52c9e72

Should I go ahead and close this issue or should we use it to add something to the spec about file paths?

Thanks again! ❤️

benjelloun · 2024-05-17T15:50:33Z

I think we can close this issue now, and reopen if you run into more issues with file paths. Thanks!

pdurbin added a commit to gdcc/dataverse-exporters that referenced this issue Apr 24, 2024

add a TODO: where to put directoryLabel (file path)

13b47e2

Related issues: - mlcommons/croissant#639 - IQSS/dataverse#10523

pdurbin mentioned this issue May 7, 2024

Project: Kaggle (Croissant) IQSS/dataverse-pm#163

Open

12 tasks

pdurbin added a commit to gdcc/dataverse-exporters that referenced this issue May 15, 2024

for file @id use path/to/file.txt

52c9e72

Per discussion in this issue: mlcommons/croissant#639

benjelloun closed this as completed May 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

clarify where to put file paths (e.g ml-25m/ratings.csv) #639

clarify where to put file paths (e.g ml-25m/ratings.csv) #639

pdurbin commented Apr 24, 2024

benjelloun commented May 15, 2024

pdurbin commented May 15, 2024

benjelloun commented May 17, 2024

clarify where to put file paths (e.g ml-25m/ratings.csv) #639

clarify where to put file paths (e.g ml-25m/ratings.csv) #639

Comments

pdurbin commented Apr 24, 2024

benjelloun commented May 15, 2024

pdurbin commented May 15, 2024

benjelloun commented May 17, 2024