Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feature request] support different webdataset format #10

Open
ethanhe42 opened this issue Sep 6, 2024 · 2 comments · May be fixed by #20
Open

[feature request] support different webdataset format #10

ethanhe42 opened this issue Sep 6, 2024 · 2 comments · May be fixed by #20

Comments

@ethanhe42
Copy link
Member

to my understanding, the energon compatible format needs everything in the same folder like this

shard_000.tar
├── samples/sample_0000.jpg
├── samples/sample_0000.txt
├── samples/sample_0000.json
├── samples/sample_0001.jpg
├── samples/sample_0001.txt
├── samples/sample_0001.json
└── ...

however, users might have different dataset format, for example, images live in one folder while labels live in another.

We need to be able to customize dataset format definition. e.g., a user defined function for mapping data to dictionary.

@lvoegtle lvoegtle added the enhancement New feature or request label Sep 24, 2024
@lvoegtle
Copy link
Collaborator

To clarify:

Using this feature, the following would be possible:

base/shard_000.tar
├── samples/sample_0000.jpg
├── samples/sample_0001.jpg
└── ...
ground_truth/shard_000.tar
├── samples/sample_0000.json
├── samples/sample_0001.json
└── ...
extracted_features/shard_000.tar
├── samples/sample_0000.features.pt
├── samples/sample_0001.features.pt
└── ...

i.e. in the respective folders, the same shards with the respective same sample key must be present. The features (here "jpg", "json", "features.pt") will then be available for the same samples.

This would allow e.g. precomputing features, or changing the ground truth without modifying the input data (in this case the image).

A difficulty might be the wizard for preparing datasets. That wizard must allow combining the shards.

Implementation thought:

  • This should likely to be handled in the innermost WebdatasetSampleLoaderDataset, to combine the multiple tar loaders.
  • Internally, the .nv-meta must somehow reflect these multiple files, which was not foreseen at the moment. Maybe one folder must be the "primary" folder, and the others are "secondary", such that the primary gives the sample keys, and the secondary are only additions and may not introduce additional keys.
  • A slightly difficult case may be if a feature is optional and a sample does not have a feature. A solution may be that the .taridx points to the next sample, which will not match the key, thus not load the feature.

@ethanhe42
Copy link
Member Author

hi @lvoegtle what you described is accurate! fwiw, in webdataset, your idea of "primary" folder is supported through https://webdataset.github.io/webdataset/column-store/#using-webdataset-as-a-column-store

@voegtlel voegtlel removed the enhancement New feature or request label Oct 10, 2024
@voegtlel voegtlel linked a pull request Nov 6, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants