Skip to content

[BugFix][CI] Megatron: add ep CI#1726

Merged
vermouth1992 merged 11 commits intoverl-project:mainfrom
ETOgaosion:fix_ep
Jun 5, 2025
Merged

[BugFix][CI] Megatron: add ep CI#1726
vermouth1992 merged 11 commits intoverl-project:mainfrom
ETOgaosion:fix_ep

Conversation

@ETOgaosion
Copy link
Copy Markdown
Collaborator

Checklist Before Starting

  • Search for similar PR(s).

What does this PR do?

Fix ep bug and try to add CI with 15B model, finding smaller models which are more convenient to test.

High-Level Design

Demonstrate the high-level design if this PR is complex.

Specific Changes

List the specific changes.

API

Demonstrate how the API changes if any.

Usage Example

Provide usage example(s) for easier usage.

# Add code snippet or script demonstrating how to use this 

Test

For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluatuion results, etc.

Additional Info.

  • Issue Number: Fixes issue # or discussion # if any.
  • Training: [Note which backend this PR will affect: FSDP, Megatron, both, or none]
  • Inference: [Note which backend this PR will affect: vLLM, SGLang, both, or none]

Checklist Before Submitting

  • Read the Contribute Guide.
  • Apply pre-commit checks.
  • Add [BREAKING] to the PR title if it breaks any API.
  • Update the documentation about your changes in the docs.
  • Add CI test(s) if necessary.

@vermouth1992
Copy link
Copy Markdown
Collaborator

15b is too big. I guess we should create a tiny fake model

@ETOgaosion ETOgaosion marked this pull request as draft May 27, 2025 14:42
@ETOgaosion ETOgaosion marked this pull request as ready for review May 27, 2025 14:45
@ETOgaosion
Copy link
Copy Markdown
Collaborator Author

@vermouth1992 Currently the checkpoint_converter test already involves the Qwen/Qwen1.5-MoE-A2.7B-Chat 15B model, so the CI machines have no need to download this, so we can directly test 1 step to find whether it works well.

@vermouth1992
Copy link
Copy Markdown
Collaborator

Sure, I guess we can test a small bsz and seqlen to ensure it will never cause OOM

@vermouth1992
Copy link
Copy Markdown
Collaborator

The CI system seems broken, could you reopen the PR?

Copy link
Copy Markdown
Collaborator

@eric-haibin-lin eric-haibin-lin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RuntimeError: CUDA error: out of memory
:(

@vermouth1992
Copy link
Copy Markdown
Collaborator

15b is too large to perform CI. We can create a model with random weights at the beginning to perform CI.

@vermouth1992 vermouth1992 merged commit 2a386cf into verl-project:main Jun 5, 2025
34 checks passed
yellowbee686 pushed a commit to yellowbee686/verl that referenced this pull request Jun 6, 2025
### Checklist Before Starting

- [ ] Search for similar PR(s).

### What does this PR do?

Fix ep bug and try to add CI with 15B model, finding smaller models
which are more convenient to test.

### High-Level Design

> Demonstrate the high-level design if this PR is complex.

### Specific Changes

> List the specific changes.

### API

> Demonstrate how the API changes if any.

### Usage Example

> Provide usage example(s) for easier usage.

```python
# Add code snippet or script demonstrating how to use this 
```

### Test

> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluatuion results, etc.

### Additional Info.

- **Issue Number**: Fixes issue # or discussion # if any.
- **Training**: [Note which backend this PR will affect: FSDP, Megatron,
both, or none]
- **Inference**: [Note which backend this PR will affect: vLLM, SGLang,
both, or none]

### Checklist Before Submitting

- [ ] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [ ] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting).
- [ ] Add `[BREAKING]` to the PR title if it breaks any API.
- [ ] Update the documentation about your changes in the
[docs](https://github.com/volcengine/verl/tree/main/docs).
- [x] Add CI test(s) if necessary.
chenjiaoAngel added a commit to chenjiaoAngel/verl that referenced this pull request Nov 14, 2025
### Checklist Before Starting

- [ ] Search for similar PR(s).

### What does this PR do?

Fix ep bug and try to add CI with 15B model, finding smaller models
which are more convenient to test.

### High-Level Design

> Demonstrate the high-level design if this PR is complex.

### Specific Changes

> List the specific changes.

### API

> Demonstrate how the API changes if any.

### Usage Example

> Provide usage example(s) for easier usage.

```python
# Add code snippet or script demonstrating how to use this 
```

### Test

> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluatuion results, etc.

### Additional Info.

- **Issue Number**: Fixes issue # or discussion # if any.
- **Training**: [Note which backend this PR will affect: FSDP, Megatron,
both, or none]
- **Inference**: [Note which backend this PR will affect: vLLM, SGLang,
both, or none]

### Checklist Before Submitting

- [ ] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [ ] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting).
- [ ] Add `[BREAKING]` to the PR title if it breaks any API.
- [ ] Update the documentation about your changes in the
[docs](https://github.com/volcengine/verl/tree/main/docs).
- [x] Add CI test(s) if necessary.
TimurTaepov pushed a commit to giorgossideris/verl that referenced this pull request Dec 20, 2025
### Checklist Before Starting

- [ ] Search for similar PR(s).

### What does this PR do?

Fix ep bug and try to add CI with 15B model, finding smaller models
which are more convenient to test.

### High-Level Design

> Demonstrate the high-level design if this PR is complex.

### Specific Changes

> List the specific changes.

### API

> Demonstrate how the API changes if any.

### Usage Example

> Provide usage example(s) for easier usage.

```python
# Add code snippet or script demonstrating how to use this 
```

### Test

> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluatuion results, etc.

### Additional Info.

- **Issue Number**: Fixes issue # or discussion # if any.
- **Training**: [Note which backend this PR will affect: FSDP, Megatron,
both, or none]
- **Inference**: [Note which backend this PR will affect: vLLM, SGLang,
both, or none]

### Checklist Before Submitting

- [ ] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [ ] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting).
- [ ] Add `[BREAKING]` to the PR title if it breaks any API.
- [ ] Update the documentation about your changes in the
[docs](https://github.com/volcengine/verl/tree/main/docs).
- [x] Add CI test(s) if necessary.
vyomakesh0728 added a commit to vyomakesh0728/verl that referenced this pull request Jan 22, 2026
### Checklist Before Starting

- [ ] Search for similar PR(s).

### What does this PR do?

Fix ep bug and try to add CI with 15B model, finding smaller models
which are more convenient to test.

### High-Level Design

> Demonstrate the high-level design if this PR is complex.

### Specific Changes

> List the specific changes.

### API

> Demonstrate how the API changes if any.

### Usage Example

> Provide usage example(s) for easier usage.

```python
# Add code snippet or script demonstrating how to use this 
```

### Test

> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluatuion results, etc.

### Additional Info.

- **Issue Number**: Fixes issue # or discussion # if any.
- **Training**: [Note which backend this PR will affect: FSDP, Megatron,
both, or none]
- **Inference**: [Note which backend this PR will affect: vLLM, SGLang,
both, or none]

### Checklist Before Submitting

- [ ] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [ ] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting).
- [ ] Add `[BREAKING]` to the PR title if it breaks any API.
- [ ] Update the documentation about your changes in the
[docs](https://github.com/volcengine/verl/tree/main/docs).
- [x] Add CI test(s) if necessary.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants