Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable adding of additional data by merging another dataset #20

Draft
wants to merge 21 commits into
base: develop
Choose a base branch
from

Conversation

voegtlel
Copy link
Collaborator

@voegtlel voegtlel commented Oct 12, 2024

See #10

Now made possible with this kind of configuration:

metadataset_minimal.yaml:

__module__: megatron.energon
__class__: MetadatasetV2
splits:
  train:
    mix:
      - weight: 1
        join:
          - path: ds1
          - path: ds2
        joiner:
          __module__: megatron.energon
          __class__: CaptioningSample

Extensive example:

metadataset_extended.yaml

__module__: megatron.energon
__class__: MetadatasetV2
splits:
  train:
    # Also supports leaving out "mix" (if not desired) or "join" (if not joining), each can also be directly a dataset reference "path".
    mix:
      - weight: 1
        join:
          - path: ds1
            dataset_config: dataset.yaml  # If override is needed
          - path: ds2
            dataset_config: dataset.yaml
            subflavor: ds1  # If needed, overrides the 'ds2' subflavor
            subflavors: # If needed, will be merged(overriding) with parent subflavor
              ds2_extra: 2
            split_config: split.yaml  # Sets this for all joined datasets
        joiner:
          __module__: my_module
          __class__: JoinedSample # Type should implement from_joined(ds1, ds2)
          # __function__: my_join_function  # alternative use to __class__
        split_config: split.yaml  # Sets this for all joined datasets
        split_part: train  # Sets this for all joined datasets
        subflavor: ds1  # Sets this for all joined datasets
        subflavors:  # Sets this for all joined datasets (it will be merged with their individual subflavors)
          source: metadataset.yaml

my_module.py:

from dataclasses import dataclass

import torch

from megatron.energon import Sample, TextSample

@dataclass
class JoinedSample(Sample):
    text1: torch.Tensor
    text2: torch.Tensor

    @staticmethod
    def from_joined(ds1: TextSample, ds2: TextSample) -> "TestJoinedSample":
        return ds1.to_joined(
            JoinedSample,
            text1=ds1.text,
            text2=ds2.text,
        )

Contains a breaking change

  • Checkpoints will not be compatible, because:
    • The structure of the created datasets slightly changed (removed an IterMapDataset due to making sample processing a single function)
    • Restore Key for inner samples changed.

src/megatron/energon/metadataset/metadataset.py Outdated Show resolved Hide resolved
src/megatron/energon/flavors/base_dataset.py Outdated Show resolved Hide resolved
src/megatron/energon/flavors/base_dataset.py Show resolved Hide resolved
src/megatron/energon/flavors/base_dataset.py Outdated Show resolved Hide resolved
docs/source/add_columns.md Outdated Show resolved Hide resolved
src/megatron/energon/flavors/crude.py Show resolved Hide resolved
src/megatron/energon/typed_converter.py Show resolved Hide resolved
Renamed all joining stuff to join*
Allow th specify a join method separately from the sample type
@voegtlel voegtlel linked an issue Nov 6, 2024 that may be closed by this pull request
docs/source/joining_datasets.md Outdated Show resolved Hide resolved
docs/source/joining_datasets.md Outdated Show resolved Hide resolved
src/megatron/energon/task_encoder/cooking.py Show resolved Hide resolved
src/megatron/energon/typed_converter.py Outdated Show resolved Hide resolved
tests/test_metadataset.py Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[feature request] support different webdataset format
2 participants