Skip to content

Commit

Permalink
Merge branch 'master' into zhejiang/fix_runtime_dataloader_shuffle
Browse files Browse the repository at this point in the history
  • Loading branch information
loadams authored Nov 21, 2024
2 parents 4d3aaa5 + cd20a3b commit 068ff2d
Show file tree
Hide file tree
Showing 63 changed files with 1,067 additions and 207 deletions.
85 changes: 85 additions & 0 deletions .github/workflows/hpu-gaudi2-nightly.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
name: hpu-gaudi2-nightly

on:
workflow_dispatch:
schedule:
- cron: "0 0 * * *"
pull_request:
paths:
- ".github/workflows/hpu-gaudi2-nightly.yml"

concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: true

permissions:
contents: read
issues: write

jobs:
unit-tests:
# The type of runner that the job will run on
runs-on: [self-hosted, intel, gaudi2]
container:
image: vault.habana.ai/gaudi-docker/1.18.0/ubuntu22.04/habanalabs/pytorch-installer-2.4.0:latest
ports:
- 80
options: --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice

env:
PT_HPU_LAZY_MODE: 0
TORCHINDUCTOR_COMPILE_THREADS: 1
TEST_LIST: |
test_adamw.py
test_bf16.py
test_ds_config_dict.py
test_dynamic_loss_scale.py
test_latest_checkpoint.py
test_moe_checkpoint.py
test_multi_output_model.py
test_other_optimizer.py
test_pipe.py
test_pipeline.py
test_universal_checkpoint.py
test_zero_context_return.py
test_zero_leaf_module.py
test_zero_offloadpp.py
test_zero_tiled.py
# Steps represent a sequence of tasks that will be executed as part of the job
steps:
# Checks-out your repository under $GITHUB_WORKSPACE, so your job can access it
- uses: actions/checkout@v4

- name: Check container state
run: |
ldd --version
hl-smi -L
python -c "import torch; print('torch:', torch.__version__, torch)"
python -c "import torch; print('CUDA available:', torch.cuda.is_available())"
- name: Install transformers
run: |
git clone https://github.com/huggingface/transformers
cd transformers
git rev-parse --short HEAD
pip install .
- name: Install deepspeed
run: |
pip install .[dev,autotuning]
ds_report
- name: Python environment
run: |
pip list
- name: Unit tests
run: |
unset TORCH_CUDA_ARCH_LIST # only jit compile for current arch
cd tests
export PT_HPU_LAZY_MODE=${PT_HPU_LAZY_MODE}
export TORCHINDUCTOR_COMPILE_THREADS=${TORCHINDUCTOR_COMPILE_THREADS}
TEST_LIST=$(echo "$TEST_LIST" | awk 'NF{printf "%s%s", (NR>1 ? " or " : ""), $0} END{if (NR>1) print ""}')
echo "TEST_LIST ${TEST_LIST}"
pytest --verbose unit/ -k "${TEST_LIST}"
1 change: 1 addition & 0 deletions .github/workflows/hpu-gaudi2.yml
Original file line number Diff line number Diff line change
Expand Up @@ -111,6 +111,7 @@ jobs:
run: |
git clone https://github.com/huggingface/transformers
cd transformers
git checkout 7df93d6ffb48946e532c7d766fe10372e98d78b6
git rev-parse --short HEAD
pip install .
Expand Down
3 changes: 2 additions & 1 deletion .github/workflows/nv-a6000.yml
Original file line number Diff line number Diff line change
Expand Up @@ -40,8 +40,9 @@ jobs:
python -c "import torch; print('CUDA available:', torch.cuda.is_available())"
- name: Install transformers
run: |
git clone --depth=1 https://github.com/huggingface/transformers
git clone https://github.com/huggingface/transformers
cd transformers
git checkout 7df93d6ffb48946e532c7d766fe10372e98d78b6
git rev-parse --short HEAD
python -m pip install .
- name: Install deepspeed
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/python.yml
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ jobs:
unit-tests:
strategy:
matrix:
pyVersion: ["3.8", "3.9", "3.10", "3.11", "3.12"]
pyVersion: ["3.8", "3.9", "3.10"]
fail-fast: false

runs-on: ubuntu-24.04
Expand Down
101 changes: 101 additions & 0 deletions GOVERNANCE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@

# DeepSpeed Project Charter and Governance

This charter sets forth the responsibilities and procedures for technical contribution to, and oversight of, the DeepSpeed open source project. All contributors (including committers, maintainers, and other technical positions) and other participants in the Project (collectively, "Collaborators") must comply with the terms of this Charter.

## Mission and Scope of the Project

The mission of the Project is to DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.

The scope of the Project includes collaborative development under the Project License (as defined herein) supporting the mission, including documentation, testing, integration, and the creation of other artifacts that aid the development, deployment, operation, or adoption of the open source project.

## Technical Steering Committee

1. The Technical Steering Committee (the "TSC") will be responsible for all technical oversight of the open source Project.

2. The TSC voting members are initially the Project's Committers. At the inception of the project, the Committers of the Project will be as set forth within the "CONTRIBUTING" file within the Project's code repository. The TSC may choose an alternative approach for determining the voting members of the TSC, and any such alternative approach will be documented in the CONTRIBUTING file. Any meetings of the Technical Steering Committee are intended to be open to the public, and can be conducted electronically, via teleconference, or in person.

3. TSC projects generally will involve Contributors and Committers. The TSC may adopt or modify roles so long as the roles are documented in the CONTRIBUTING file. Unless otherwise documented:

- **Contributors** include anyone in the technical community that contributes code, documentation, or other technical artifacts to the Project.
- **Committers** are Contributors who have earned the ability to modify ("commit") source code, documentation, or other technical artifacts in a project's repository.

- A Contributor may become a Committer by a majority approval of the existing Committers. A Committer may be removed by a majority approval of the other existing Committers.

4. Participation in the Project through becoming a Contributor and Committer is open to anyone so long as they abide by the terms of this Charter.

5. The TSC may:
- Establish workflow procedures for the submission, approval, and closure/archiving of projects.
- Set requirements for the promotion of Contributors to Committer status, as applicable.
- Amend, adjust, refine and/or eliminate the roles of Contributors and Committers, and create new roles, and publicly document any TSC roles, as it sees fit.

6. The TSC may elect a TSC Chair, who will preside over meetings of the TSC and will serve until their resignation or replacement by the TSC. The TSC Chair, or any other TSC member so designated by the TSC, will serve as the primary communication contact between the Project and AI & Data, a directed fund of The Linux Foundation.

7. Responsibilities: The TSC will be responsible for all aspects of oversight relating to the Project, which may include:

- Coordinating the technical direction of the Project.
- Approving project or system proposals (including, but not limited to, incubation, deprecation, and changes to a sub-project's scope).
- Organizing sub-projects and removing sub-projects.
- Creating sub-committees or working groups to focus on cross-project technical issues and requirements.
- Appointing representatives to work with other open source or open standards communities.
- Establishing community norms, workflows, issuing releases, and security issue reporting policies.
- Approving and implementing policies and processes for contributing (to be published in the CONTRIBUTING file) and coordinating with the series manager of the Project (as provided for in the Series Agreement, the "Series Manager") to resolve matters or concerns that may arise as set forth in Section 7 of this Charter.
- Discussions, seeking consensus, and where necessary, voting on technical matters relating to the code base that affect multiple projects.
- Coordinating any marketing, events, or communications regarding the Project.

## TSC Voting

1. While the Project aims to operate as a consensus-based community, if any TSC decision requires a vote to move the Project forward, the voting members of the TSC will vote on a one vote per voting member basis.

2. Quorum for TSC meetings requires at least fifty percent of all voting members of the TSC to be present. The TSC may continue to meet if quorum is not met but will be prevented from making any decisions at the meeting.

3. Except as provided in Section 7.c. and 8.a, decisions by vote at a meeting require a majority vote of those in attendance, provided quorum is met. Decisions made by electronic vote without a meeting require a majority vote of all voting members of the TSC.

4. In the event a vote cannot be resolved by the TSC, any voting member of the TSC may refer the matter to the Series Manager for assistance in reaching a resolution.

## Compliance with Policies

1. This Charter is subject to the Series Agreement for the Project and the Operating Agreement of LF Projects. Contributors will comply with the policies of LF Projects as may be adopted and amended by LF Projects, including, without limitation, the policies listed at https://lfprojects.org/policies/.

2. The TSC may adopt a code of conduct ("CoC") for the Project, which is subject to approval by the Series Manager. In the event that a Project-specific CoC has not been approved, the LF Projects Code of Conduct listed at https://lfprojects.org/policies will apply for all Collaborators in the Project.

3. When amending or adopting any policy applicable to the Project, LF Projects will publish such policy, as to be amended or adopted, on its website at least 30 days prior to such policy taking effect; provided, however, that in the case of any amendment of the Trademark Policy or Terms of Use of LF Projects, any such amendment is effective upon publication on LF Project's website.

4. All Collaborators must allow open participation from any individual or organization meeting the requirements for contributing under this Charter and any policies adopted for all Collaborators by the TSC, regardless of competitive interests. Put another way, the Project community must not seek to exclude any participant based on any criteria, requirement, or reason other than those that are reasonable and applied on a non-discriminatory basis to all Collaborators in the Project community.

5. The Project will operate in a transparent, open, collaborative, and ethical manner at all times. The output of all Project discussions, proposals, timelines, decisions, and status should be made open and easily visible to all. Any potential violations of this requirement should be reported immediately to the Series Manager.

## Community Assets

1. LF Projects will hold title to all trade or service marks used by the Project ("Project Trademarks"), whether based on common law or registered rights. Project Trademarks will be transferred and assigned to LF Projects to hold on behalf of the Project. Any use of any Project Trademarks by Collaborators in the Project will be in accordance with the license from LF Projects and inure to the benefit of LF Projects.

2. The Project will, as permitted and in accordance with such license from LF Projects, develop and own all Project GitHub and social media accounts, and domain name registrations created by the Project community.

3. Under no circumstances will LF Projects be expected or required to undertake any action on behalf of the Project that is inconsistent with the tax-exempt status or purpose, as applicable, of the Joint Development Foundation or LF Projects, LLC.

## General Rules and Operations

The Project will:

1. Engage in the work of the Project in a professional manner consistent with maintaining a cohesive community, while also maintaining the goodwill and esteem of LF Projects, Joint Development Foundation, and other partner organizations in the open source community.
2. Respect the rights of all trademark owners, including any branding and trademark usage guidelines.

## Intellectual Property Policy

1. Collaborators acknowledge that the copyright in all new contributions will be retained by the copyright holder as independent works of authorship and that no contributor or copyright holder will be required to assign copyrights to the Project.

2. Except as described in Section 7.c., all contributions to the Project are subject to the following:

- All new inbound code contributions to the Project must be made using Apache License, Version 2.0 available at http://www.apache.org/licenses/LICENSE-2.0 (the "Project License").
- All new inbound code contributions must also be accompanied by a Developer Certificate of Origin (http://developercertificate.org) sign-off in the source code system that is submitted through a TSC-approved contribution process which will bind the authorized contributor and, if not self-employed, their employer to the applicable license.
- All outbound code will be made available under the Project License.
- Documentation will be received and made available by the Project under the Creative Commons Attribution 4.0 International License (available at http://creativecommons.org/licenses/by/4.0/).
- The Project may seek to integrate and contribute back to other open source projects ("Upstream Projects"). In such cases, the Project will conform to all license requirements of the Upstream Projects, including dependencies, leveraged by the Project. Upstream Project code contributions not stored within the Project's main code repository will comply with the contribution process and license terms for the applicable Upstream Project.

3. The TSC may approve the use of an alternative license or licenses for inbound or outbound contributions on an exception basis. To request an exception, please describe the contribution, the alternative open source license(s), and the justification for using an alternative open source license for the Project. License exceptions must be approved by a two-thirds vote of the entire TSC.

4. Contributed files should contain license information, such as SPDX short form identifiers, indicating the open source license or licenses pertaining to the file.

## Amendments

1. This charter may be amended by a two-thirds vote of the entire TSC and is subject to approval by LF Projects.
2 changes: 2 additions & 0 deletions accelerator/cpu_accelerator.py
Original file line number Diff line number Diff line change
Expand Up @@ -71,6 +71,8 @@ def device_count(self):
# In flat mode, HBM is in separate NUMA node with no cores on this node.
# Ignore these NUMA nodes with no cores.
numa_core_lists = get_numa_cores()
if not numa_core_lists:
return 1
numa_count = 0
prev_core_list = []
for core_list in numa_core_lists:
Expand Down
18 changes: 12 additions & 6 deletions csrc/aio/common/deepspeed_aio_utils.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -19,9 +19,14 @@ const int c_io_queue_depth = 8;

io_xfer_ctxt::io_xfer_ctxt(const int fd,
const int64_t file_offset,
const int64_t buffer_offset,
const int64_t num_bytes,
const void* buffer)
: _fd(fd), _base_offset(file_offset), _mem_buffer(buffer), _num_bytes(num_bytes)
: _fd(fd),
_file_base_offset(file_offset),
_buffer_base_offset(buffer_offset),
_mem_buffer(buffer),
_num_bytes(num_bytes)
{
}

Expand All @@ -41,9 +46,10 @@ void io_prep_context::prep_iocbs(const int n_iocbs,
assert(static_cast<size_t>(n_iocbs) <= _iocbs->size());
for (auto i = 0; i < n_iocbs; ++i) {
const auto shift = i * _block_size;
const auto xfer_buffer = (char*)start_buffer + _xfer_ctxt->_base_offset + shift;
const auto xfer_offset = _xfer_ctxt->_base_offset + start_offset + shift;
const auto xfer_buffer = (char*)start_buffer + _xfer_ctxt->_buffer_base_offset + shift;
const auto xfer_offset = _xfer_ctxt->_file_base_offset + start_offset + shift;
auto byte_count = _block_size;

if ((shift + _block_size) > num_bytes) { byte_count = num_bytes - shift; }

if (_read_op) {
Expand Down Expand Up @@ -79,10 +85,10 @@ int io_prep_generator::prep_iocbs(const int n_iocbs, std::vector<struct iocb*>*

auto actual_n_iocbs = min(static_cast<int64_t>(n_iocbs), _remaining_io_blocks);
for (auto i = 0; i < actual_n_iocbs; ++i, ++_next_iocb_index) {
const auto xfer_offset = _xfer_ctxt->_base_offset + (_next_iocb_index * _block_size);
const auto xfer_buffer = (char*)_xfer_ctxt->_mem_buffer + xfer_offset;
const auto xfer_buffer = (char*)_xfer_ctxt->_mem_buffer + _xfer_ctxt->_buffer_base_offset +
(_next_iocb_index * _block_size);
const auto xfer_offset = _xfer_ctxt->_file_base_offset + (_next_iocb_index * _block_size);
const auto num_bytes = min(static_cast<int64_t>(_block_size), _remaining_bytes);

if (_read_op) {
io_prep_pread(iocbs->at(i), _xfer_ctxt->_fd, xfer_buffer, num_bytes, xfer_offset);
} else {
Expand Down
4 changes: 3 additions & 1 deletion csrc/aio/common/deepspeed_aio_utils.h
Original file line number Diff line number Diff line change
Expand Up @@ -30,12 +30,14 @@ Functionality for swapping optimizer tensors to/from (NVMe) storage devices.

struct io_xfer_ctxt {
const int _fd;
const int64_t _base_offset;
const int64_t _file_base_offset;
const int64_t _buffer_base_offset;
const void* _mem_buffer;
const int64_t _num_bytes;

io_xfer_ctxt(const int fd,
const int64_t file_offset,
const int64_t buffer_offset,
const int64_t num_bytes,
const void* buffer);
};
Expand Down
6 changes: 4 additions & 2 deletions csrc/aio/py_lib/deepspeed_aio_op_desc.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -13,14 +13,16 @@ io_op_desc_t::io_op_desc_t(const bool read_op,
const char* filename,
const int64_t file_num_bytes,
const int intra_op_parallelism,
const bool validate)
const bool validate,
const int64_t file_offset)
: _read_op(read_op),
_buffer(buffer),
_fd(fd),
_filename(filename),
_file_num_bytes(file_num_bytes),
_file_offset(file_offset),
_intra_op_parallelism(intra_op_parallelism),
_num_bytes_per_thread(file_num_bytes / intra_op_parallelism),
_num_bytes_per_thread(static_cast<int64_t>(buffer.nbytes()) / intra_op_parallelism),
_validate(validate)
{
}
Expand Down
4 changes: 3 additions & 1 deletion csrc/aio/py_lib/deepspeed_aio_op_desc.h
Original file line number Diff line number Diff line change
Expand Up @@ -19,14 +19,16 @@ struct io_op_desc_t {
const int64_t _num_bytes_per_thread;
torch::Tensor _contiguous_buffer;
const bool _validate;
const int64_t _file_offset;

io_op_desc_t(const bool read_op,
const torch::Tensor& buffer,
const int fd,
const char* filename,
const int64_t file_num_bytes,
const int intra_op_parallelism,
const bool validate);
const bool validate,
const int64_t file_offset);

virtual void run(const int tid,
std::unique_ptr<aio_context>& aio_ctxt,
Expand Down
Loading

0 comments on commit 068ff2d

Please sign in to comment.