Skip to content

[DataProto] Supporting new operations for DataProto#1761

Merged
eric-haibin-lin merged 4 commits intoverl-project:mainfrom
hongpeng-guo:hpguo/data_proto
May 30, 2025
Merged

[DataProto] Supporting new operations for DataProto#1761
eric-haibin-lin merged 4 commits intoverl-project:mainfrom
hongpeng-guo:hpguo/data_proto

Conversation

@hongpeng-guo
Copy link
Collaborator

Checklist Before Starting

  • Search for similar PR(s).

What does this PR do?

Adding/ Enriching new operations on DataProto data class:

  1. Making DataProto compitable with self.batch is None, this is useful when we are using a DataProto to contain non-tensor data only, i.e., images for vlm use cases;
  2. sample_level_repeat: this function repeat the rows in DataProto multiple times in sample level;
  3. unfold_column_chunks: this function split along the second dim into n_splits folds. Useful in passing grouped tensors that doesn't want to be shuffled in dataset.

API & Test

Please check the usage from the added unit test files: tests/test_protocol.py. There are three unit tests added, which are: test_dataproto_no_batch, test_sample_level_repeat, and test_dataproto_unfold_column_chunks.

Checklist Before Submitting

  • Read the Contribute Guide.
  • Apply pre-commit checks.
  • Add [BREAKING] to the PR title if it breaks any API.
  • Update the documentation about your changes in the docs.
  • Add CI test(s) if necessary.

Signed-off-by: Hongpeng Guo <hg5@illinois.edu>
Signed-off-by: Hongpeng Guo <hg5@illinois.edu>
Signed-off-by: Hongpeng Guo <hg5@illinois.edu>
Signed-off-by: Hongpeng Guo <hg5@illinois.edu>
@eric-haibin-lin
Copy link
Collaborator

some broken megatron-sglang tests will be fixed by #1717

@eric-haibin-lin eric-haibin-lin merged commit e23e67b into verl-project:main May 30, 2025
29 of 40 checks passed
@hongpeng-guo hongpeng-guo deleted the hpguo/data_proto branch May 30, 2025 21:38
yzlnew pushed a commit to yzlnew/verl that referenced this pull request Jun 4, 2025
…old_column_chunks) for `DataProto` (verl-project#1761)

### Checklist Before Starting

- [x] Search for similar PR(s).

### What does this PR do?

Adding/ Enriching new operations on `DataProto` data class:

1. Making `DataProto` compitable with `self.batch is None`, this is
useful when we are using a `DataProto` to contain non-tensor data only,
i.e., images for vlm use cases;
2. `sample_level_repeat`: this function repeat the rows in DataProto
multiple times in sample level;
3. `unfold_column_chunks`: this function split along the second dim into
`n_splits` folds. Useful in passing grouped tensors that doesn't want to
be shuffled in dataset.

### API & Test

Please check the usage from the added unit test files:
`tests/test_protocol.py`. There are three unit tests added, which are:
`test_dataproto_no_batch`, `test_sample_level_repeat`, and
`test_dataproto_unfold_column_chunks`.

### Checklist Before Submitting

- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting).
- [ ] Add `[BREAKING]` to the PR title if it breaks any API.
- [ ] Update the documentation about your changes in the
[docs](https://github.com/volcengine/verl/tree/main/docs).
- [x] Add CI test(s) if necessary.

---------

Signed-off-by: Hongpeng Guo <hg5@illinois.edu>
wwwjn pushed a commit to wwwjn/verl that referenced this pull request Jun 10, 2025
…old_column_chunks) for `DataProto` (verl-project#1761)

### Checklist Before Starting

- [x] Search for similar PR(s).

### What does this PR do?

Adding/ Enriching new operations on `DataProto` data class:

1. Making `DataProto` compitable with `self.batch is None`, this is
useful when we are using a `DataProto` to contain non-tensor data only,
i.e., images for vlm use cases;
2. `sample_level_repeat`: this function repeat the rows in DataProto
multiple times in sample level;
3. `unfold_column_chunks`: this function split along the second dim into
`n_splits` folds. Useful in passing grouped tensors that doesn't want to
be shuffled in dataset.

### API & Test

Please check the usage from the added unit test files:
`tests/test_protocol.py`. There are three unit tests added, which are:
`test_dataproto_no_batch`, `test_sample_level_repeat`, and
`test_dataproto_unfold_column_chunks`.

### Checklist Before Submitting

- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting).
- [ ] Add `[BREAKING]` to the PR title if it breaks any API.
- [ ] Update the documentation about your changes in the
[docs](https://github.com/volcengine/verl/tree/main/docs).
- [x] Add CI test(s) if necessary.

---------

Signed-off-by: Hongpeng Guo <hg5@illinois.edu>
chenjiaoAngel added a commit to chenjiaoAngel/verl that referenced this pull request Nov 14, 2025
…old_column_chunks) for `DataProto` (verl-project#1761)

### Checklist Before Starting

- [x] Search for similar PR(s).

### What does this PR do?

Adding/ Enriching new operations on `DataProto` data class:

1. Making `DataProto` compitable with `self.batch is None`, this is
useful when we are using a `DataProto` to contain non-tensor data only,
i.e., images for vlm use cases;
2. `sample_level_repeat`: this function repeat the rows in DataProto
multiple times in sample level;
3. `unfold_column_chunks`: this function split along the second dim into
`n_splits` folds. Useful in passing grouped tensors that doesn't want to
be shuffled in dataset.

### API & Test

Please check the usage from the added unit test files:
`tests/test_protocol.py`. There are three unit tests added, which are:
`test_dataproto_no_batch`, `test_sample_level_repeat`, and
`test_dataproto_unfold_column_chunks`.

### Checklist Before Submitting

- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting).
- [ ] Add `[BREAKING]` to the PR title if it breaks any API.
- [ ] Update the documentation about your changes in the
[docs](https://github.com/volcengine/verl/tree/main/docs).
- [x] Add CI test(s) if necessary.

---------

Signed-off-by: Hongpeng Guo <hg5@illinois.edu>
TimurTaepov pushed a commit to giorgossideris/verl that referenced this pull request Dec 20, 2025
…old_column_chunks) for `DataProto` (verl-project#1761)

### Checklist Before Starting

- [x] Search for similar PR(s).

### What does this PR do?

Adding/ Enriching new operations on `DataProto` data class:

1. Making `DataProto` compitable with `self.batch is None`, this is
useful when we are using a `DataProto` to contain non-tensor data only,
i.e., images for vlm use cases;
2. `sample_level_repeat`: this function repeat the rows in DataProto
multiple times in sample level;
3. `unfold_column_chunks`: this function split along the second dim into
`n_splits` folds. Useful in passing grouped tensors that doesn't want to
be shuffled in dataset.

### API & Test

Please check the usage from the added unit test files:
`tests/test_protocol.py`. There are three unit tests added, which are:
`test_dataproto_no_batch`, `test_sample_level_repeat`, and
`test_dataproto_unfold_column_chunks`.

### Checklist Before Submitting

- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting).
- [ ] Add `[BREAKING]` to the PR title if it breaks any API.
- [ ] Update the documentation about your changes in the
[docs](https://github.com/volcengine/verl/tree/main/docs).
- [x] Add CI test(s) if necessary.

---------

Signed-off-by: Hongpeng Guo <hg5@illinois.edu>
vyomakesh0728 added a commit to vyomakesh0728/verl that referenced this pull request Jan 22, 2026
…old_column_chunks) for `DataProto` (verl-project#1761)

### Checklist Before Starting

- [x] Search for similar PR(s).

### What does this PR do?

Adding/ Enriching new operations on `DataProto` data class:

1. Making `DataProto` compitable with `self.batch is None`, this is
useful when we are using a `DataProto` to contain non-tensor data only,
i.e., images for vlm use cases;
2. `sample_level_repeat`: this function repeat the rows in DataProto
multiple times in sample level;
3. `unfold_column_chunks`: this function split along the second dim into
`n_splits` folds. Useful in passing grouped tensors that doesn't want to
be shuffled in dataset.

### API & Test

Please check the usage from the added unit test files:
`tests/test_protocol.py`. There are three unit tests added, which are:
`test_dataproto_no_batch`, `test_sample_level_repeat`, and
`test_dataproto_unfold_column_chunks`.

### Checklist Before Submitting

- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting).
- [ ] Add `[BREAKING]` to the PR title if it breaks any API.
- [ ] Update the documentation about your changes in the
[docs](https://github.com/volcengine/verl/tree/main/docs).
- [x] Add CI test(s) if necessary.

---------

Signed-off-by: Hongpeng Guo <hg5@illinois.edu>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants