Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
38 commits
Select commit Hold shift + click to select a range
9f5895c
fix issue in accelerate. (#2121)
Quentin-Anthony Jul 21, 2022
a2506b5
[docs] website refresh (#2123)
jeffra Jul 21, 2022
ed02d38
[docs] link fixes (#2124)
jeffra Jul 22, 2022
fb72ccd
update info and links. (#2122)
awan-10 Jul 22, 2022
86e13b8
Update moe doc (#2098)
awan-10 Jul 22, 2022
6392f83
Bump tzinfo from 1.2.9 to 1.2.10 in /docs (#2126)
dependabot[bot] Jul 22, 2022
8413b7f
DS Benchmarks QoL Improvements (#2120)
Quentin-Anthony Jul 22, 2022
b6305d0
[ROCm] Wrong command broke ROCm build. (#2118)
jpvillam-amd Jul 25, 2022
5349347
DeepSpeed Communication Profiling and Logging (#2012)
Quentin-Anthony Jul 25, 2022
9c6cdec
Delete Gemfile.lock (#2130)
jeffra Jul 25, 2022
316c4a4
Add flake8 to pre-commit checks (#2051)
aphedges Jul 25, 2022
cf587b8
DS on Azure blog (#2133)
awan-10 Jul 26, 2022
2ff1cad
Azure blog news item (#2135)
awan-10 Jul 26, 2022
0e49b19
Update 2022-07-26-deepspeed-azure.md (#2136)
awan-10 Jul 26, 2022
31582d7
Fix conflict between Tutel and top-2 gate in MoE layer (#2053)
yetiansh Jul 26, 2022
9f3a540
adding HF Accelerate+DS tests workflow (#2134)
pacman100 Jul 26, 2022
d9ec8ef
Update 2022-07-26-deepspeed-azure.md (#2138)
awan-10 Jul 26, 2022
ddd2113
turn off time check for now (#2142)
jeffra Jul 26, 2022
5bd09a8
Allow turning off loss scaling wrt GAS + update tput calculator (#2140)
jeffra Jul 26, 2022
84dbe7c
[docs] add details about install reqs (#2143)
jeffra Jul 26, 2022
56a0d18
[docs] update build pipeline badges
jeffra Jul 26, 2022
76a01d5
[docs] README formatting fix
jeffra Jul 26, 2022
6f7137c
fix broken links (#2144)
jimwu6 Jul 27, 2022
5997589
Refactor ZeRO configs to use Pydantic (#2004)
mrwyattii Jul 27, 2022
be46ff6
Add purely-local sliding window sparse attention config (#1962)
Quentin-Anthony Jul 27, 2022
b442264
formatting fix for #1962
jeffra Jul 27, 2022
a54661a
force newer datasets version (#2147)
jeffra Jul 27, 2022
e669aaf
Trajepl/nebula ckpt engine (#2085)
trajepl Jul 28, 2022
66d29b0
Graceful exit on failures for multi-node runs (#2008)
Jul 28, 2022
57140e8
fix: fix BF16_Optimizer compatibility issue with optimizer state 0-di…
shjwudp Jul 28, 2022
556f005
Fix random token-generation issue + MP-checkpoint loading/saving (#2132)
RezaYazdaniAminabadi Jul 29, 2022
ba67bd9
Added retain_graph as a kwarg to the main engine backward function (#…
ncilfone Jul 29, 2022
1ed5aa9
Elastic Training support in DeepSpeed (#2153) (#2156)
aj-prime Jul 29, 2022
63f470e
prevent cuda 10 builds of inference kernels on ampere (#2157)
jeffra Jul 29, 2022
46401b3
[zero-3] shutdown zero.Init from within ds.init (#2150)
jeffra Jul 30, 2022
a039e22
enable fp16 input autocasting (#2158)
jeffra Jul 30, 2022
2210ebe
Release swap buffers for persisted params (#2089)
tjruwase Jul 31, 2022
5fe9d61
Tensor parallelism for Mixture of Experts (#2074)
siddharth9820 Aug 1, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
60 changes: 60 additions & 0 deletions .github/workflows/nv-accelerate-v100.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
name: nv-accelerate-v100

on:
push:
branches:
- 'master'
- 'staging**'
paths-ignore:
- 'docs/**'
pull_request:
paths-ignore:
- 'docs/**'

concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: true

jobs:
unit-tests:
runs-on: [self-hosted, nvidia, cu111, v100]

steps:
- uses: actions/checkout@v2

- name: environment
run: |
nvidia-smi
which python
python --version
which nvcc
nvcc --version
pip install --upgrade pip
pip uninstall --yes torch torchvision
pip install torch==1.8.2+cu111 torchvision==0.9.2+cu111 -f https://download.pytorch.org/whl/lts/1.8/torch_lts.html
python -c "import torch; print('torch:', torch.__version__, torch)"
python -c "import torch; print('CUDA available:', torch.cuda.is_available())"

- name: Python environment
run: |
pip list

- name: Install deepspeed
run: |
pip uninstall --yes deepspeed
pip install .[dev,autotuning]
ds_report

- name: HF Accelerate tests
run: |
if [[ -d ./torch-extensions ]]; then rm -rf ./torch-extensions; fi
git clone https://github.com/huggingface/accelerate
cd accelerate
# installing dependencies
pip install .[testing]
# force protobuf version due to issues
pip install "protobuf<4.21.0"
# tmp fix: force newer datasets version
pip install "datasets>=2.0.0"
pip list
TORCH_EXTENSIONS_DIR=./torch-extensions pytest --color=yes --durations=0 --verbose tests/deepspeed
8 changes: 7 additions & 1 deletion .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ repos:
name: check-torchdist
entry: ./scripts/check-torchdist.py
language: script
exclude: ^(deepspeed/comm/|docs/|benchmarks/|scripts/check-torchdist.py|deepspeed/moe/sharded_moe.py|deepspeed/runtime/comm/coalesced_collectives.py)
exclude: ^(deepspeed/comm/|docs/|benchmarks/|scripts/check-torchdist.py|deepspeed/moe/sharded_moe.py|deepspeed/runtime/comm/coalesced_collectives.py|deepspeed/elasticity/elastic_agent.py|deepspeed/launcher/launch.py)
# Specific deepspeed/ files are excluded for now until we wrap ProcessGroup in deepspeed.comm

- repo: https://github.com/codespell-project/codespell
Expand All @@ -54,3 +54,9 @@ repos:
--check-filenames,
--check-hidden
]

- repo: https://github.com/pycqa/flake8
rev: 4.0.1
hooks:
- id: flake8
args: ['--ignore=E,F403,F405,F541,F841,W', '--select=E9,F,W6', '--per-file-ignores=__init__.py:F401']
1 change: 1 addition & 0 deletions MANIFEST.in
Original file line number Diff line number Diff line change
Expand Up @@ -3,3 +3,4 @@ recursive-include requirements *.txt
recursive-include deepspeed *.cpp *.h *.cu *.hip *.tr *.cuh *.cc *.json
recursive-include csrc *.cpp *.h *.cu *.tr *.cuh *.cc
recursive-include op_builder *.py
recursive-include benchmarks *.py
209 changes: 90 additions & 119 deletions README.md

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion azure/README.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
# Getting Started with DeepSpeed on Azure

Please see our [Azure tutorial](https://www.deepspeed.ai/tutorials/azure/) to get started with DeepSpeed on Azure!
The recommended and simplest method to try DeepSpeed on Azure is through [AzureML](https://azure.microsoft.com/en-us/services/machine-learning/). For more details, please see our [Azure tutorial](https://www.deepspeed.ai/tutorials/azure/).
4 changes: 0 additions & 4 deletions azure/attach.sh

This file was deleted.

7 changes: 0 additions & 7 deletions azure/azure_config.json

This file was deleted.

29 changes: 0 additions & 29 deletions azure/azure_ssh.sh

This file was deleted.

3 changes: 0 additions & 3 deletions azure/build_docker_image.sh

This file was deleted.

55 changes: 0 additions & 55 deletions azure/create_vms.sh

This file was deleted.

50 changes: 0 additions & 50 deletions azure/setup_docker.sh

This file was deleted.

54 changes: 0 additions & 54 deletions azure/setup_vms.sh

This file was deleted.

37 changes: 0 additions & 37 deletions azure/shutdown_vms.sh

This file was deleted.

11 changes: 0 additions & 11 deletions azure/start_container.sh

This file was deleted.

Empty file added benchmarks/__init__.py
Empty file.
Loading