Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ExecutionProvider API refactor - replace OrtMemoryInfo with OrtDevice #15618

Merged
merged 6 commits into from
May 1, 2023

Conversation

jslhcl
Copy link
Contributor

@jslhcl jslhcl commented Apr 21, 2023

Description

ExecutionProvider API refactor - replace OrtMemoryInfo with OrtDevice

Motivation and Context

Currently “Location” is represented as ORTMemoryInfo, which is OrtDevice + OrtMemType, while OrtDevice is represent as DeviceType + DeviceId + MemType. As we can see there is some unnecessary hierarchy, the proposal is to make it a clear definition that to use OrtDevice as an abstraction for Location

@jslhcl jslhcl requested review from souptc and RandySheriffH April 21, 2023 00:43
@jslhcl
Copy link
Contributor Author

jslhcl commented Apr 21, 2023

static const DeviceType FPGA = 2;

No FPGA device in EP?


Refers to: include/onnxruntime/core/framework/ortdevice.h:18 in 9bd7286. [](commit_id = 9bd7286, deletion_comment = False)

if (use_metadef_id_creator) {
metadef_id_generator_ = std::make_unique<ModelMetadefIdGenerator>();
}
}
OrtDevice default_device_;
Copy link
Member

@souptc souptc Apr 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be private?

and please add comments for the new member/method. #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is protected.

Comments added

@jslhcl jslhcl marked this pull request as ready for review April 25, 2023 21:31
@jslhcl
Copy link
Contributor Author

jslhcl commented Apr 26, 2023

/azp run Linux Android Emulator QNN CI Pipeline

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

souptc
souptc previously approved these changes Apr 28, 2023
Copy link
Member

@souptc souptc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@souptc
Copy link
Member

souptc commented Apr 28, 2023

static const DeviceType FPGA = 2;

FPGA is resvered for some 1p hardware.


In reply to: 1517106603


Refers to: include/onnxruntime/core/framework/ortdevice.h:18 in 9bd7286. [](commit_id = 9bd7286, deletion_comment = False)

@jslhcl
Copy link
Contributor Author

jslhcl commented Apr 30, 2023

/azp run Android CI Pipeline

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@jslhcl
Copy link
Contributor Author

jslhcl commented May 1, 2023

/azp run Windows GPU CI Pipeline

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Member

@souptc souptc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@jslhcl jslhcl merged commit d58fa98 into main May 1, 2023
@jslhcl jslhcl deleted the leca/OrtMemoryInfo2OrtDevice branch May 1, 2023 17:06
fs-eire added a commit that referenced this pull request May 4, 2023
### Description
Add the missing `OrtDevice` initialization in JSEP introduced by #15618
pengwa added a commit that referenced this pull request May 6, 2023
### Fix segfault for multiple GPU run

#15618 introduced
`GetOrtDeviceByMemType`. The intention should be: handle CPU device
differently in the if branch, while might by mistakenly passing the
unique default non-cpu device id.


```
OrtDevice CUDAExecutionProvider::GetOrtDeviceByMemType(OrtMemType mem_type) const {
  if (mem_type == OrtMemTypeCPUInput || mem_type == OrtMemTypeCPUOutput) {
    return OrtDevice(OrtDevice::CPU, OrtDevice::MemType::CUDA_PINNED, default_device_.Id());
  }
  return default_device_;
}
```

We observed a segement fault thrown when running multiple GPU training  

`
CUDA_LAUNCH_BLOCKING=1 python -m torch.distributed.launch
--nproc_per_node=2
examples/onnxruntime/training/language-modeling/run_mlm.py
--model_name_or_path distilbert-base-uncased --dataset_name wikitext
--dataset_config_name wikitext-2-raw-v1 --num_train_epochs 10
--per_device_train_batch_size 8 --per_device_eval_batch_size 8
--do_train --do_eval --overwrite_output_dir --output_dir ./outputs222/
--seed 1137 --fp16 --report_to none --optim adamw_ort_fused --max_steps
400 --logging_steps 1
`

It is found GPU0 works fine, GPU1 throw segement fault. Looking further,
a Shape node trying to allocate it's output tensor, trying to fetch
corresponding allocator with ORTDevice(Device:[DeviceType:0 MemoryType:1
DeviceId:1]), while CPU device did not have device id = 1, so a no
allocator returned. When we try to call `AsStreamBasedAllocator` for the
allocator, segement happens as no null check was done there.



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
ShukantPal pushed a commit to ShukantPal/onnxruntime that referenced this pull request May 7, 2023
…microsoft#15618)

### Description
ExecutionProvider API refactor - replace OrtMemoryInfo with OrtDevice



### Motivation and Context
Currently “Location” is represented as ORTMemoryInfo, which is OrtDevice
+ OrtMemType, while OrtDevice is represent as DeviceType + DeviceId +
MemType. As we can see there is some unnecessary hierarchy, the proposal
is to make it a clear definition that to use OrtDevice as an abstraction
for Location

---------

Co-authored-by: Lei Cao <[email protected]>
ShukantPal pushed a commit to ShukantPal/onnxruntime that referenced this pull request May 7, 2023
### Description
Add the missing `OrtDevice` initialization in JSEP introduced by microsoft#15618
ShukantPal pushed a commit to ShukantPal/onnxruntime that referenced this pull request May 7, 2023
### Fix segfault for multiple GPU run

microsoft#15618 introduced
`GetOrtDeviceByMemType`. The intention should be: handle CPU device
differently in the if branch, while might by mistakenly passing the
unique default non-cpu device id.


```
OrtDevice CUDAExecutionProvider::GetOrtDeviceByMemType(OrtMemType mem_type) const {
  if (mem_type == OrtMemTypeCPUInput || mem_type == OrtMemTypeCPUOutput) {
    return OrtDevice(OrtDevice::CPU, OrtDevice::MemType::CUDA_PINNED, default_device_.Id());
  }
  return default_device_;
}
```

We observed a segement fault thrown when running multiple GPU training  

`
CUDA_LAUNCH_BLOCKING=1 python -m torch.distributed.launch
--nproc_per_node=2
examples/onnxruntime/training/language-modeling/run_mlm.py
--model_name_or_path distilbert-base-uncased --dataset_name wikitext
--dataset_config_name wikitext-2-raw-v1 --num_train_epochs 10
--per_device_train_batch_size 8 --per_device_eval_batch_size 8
--do_train --do_eval --overwrite_output_dir --output_dir ./outputs222/
--seed 1137 --fp16 --report_to none --optim adamw_ort_fused --max_steps
400 --logging_steps 1
`

It is found GPU0 works fine, GPU1 throw segement fault. Looking further,
a Shape node trying to allocate it's output tensor, trying to fetch
corresponding allocator with ORTDevice(Device:[DeviceType:0 MemoryType:1
DeviceId:1]), while CPU device did not have device id = 1, so a no
allocator returned. When we try to call `AsStreamBasedAllocator` for the
allocator, segement happens as no null check was done there.



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
fs-eire added a commit that referenced this pull request May 9, 2023
### Description
Add the missing `OrtDevice` initialization in JSEP introduced by #15618
jslhcl added a commit that referenced this pull request May 15, 2023
…Input (#15903)

### Description
<!-- Describe your changes. -->
change the EP device to default OrtDevice() for memoryType equals
CPUInput for cuda, rocm, migraph
x and tensorRT EP


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
My previous PR (#15618)
caused random failures on cuda training test
GradientCheckerTest.TileGrad (see build
https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=986784&view=logs&j=5076e696-f193-5f12-2d8a-703dda41a79b&t=a3824a7c-2162-5e3d-3fdd-8cf808834fbb)
and rocm test:

root@a59558217e53:/workspace# pytest
orttraining/orttraining/test/python/orttraining_test_ortmodule_api.py::test_gradient_correctness_minmax
... 
E RuntimeError: Error in backward pass execution: Non-zero status code
returned while running ATen node.
Name:'/_original_module/ATen_Grad/ATen_1' Status Message: Storage size
calculation overflowed with sizes=[72340172838076673, 72340172838076673,
128]

Potential reason is that if the memType of cuda/tensorRT/rocm/migraphx
EP is CPUInput, previously the corresponding device in the IAllocator's
memoryInfo is default OrtDevice(), while after my change, it becomes
OrtDevice(CPU, xx_PINNED, 0);

Changing it back fixed GradientCheckerTest.TileGrad in Win GPU training
build.
prathikr pushed a commit that referenced this pull request May 16, 2023
### Description
Add the missing `OrtDevice` initialization in JSEP introduced by #15618
prathikr pushed a commit that referenced this pull request May 16, 2023
### Fix segfault for multiple GPU run

#15618 introduced
`GetOrtDeviceByMemType`. The intention should be: handle CPU device
differently in the if branch, while might by mistakenly passing the
unique default non-cpu device id.


```
OrtDevice CUDAExecutionProvider::GetOrtDeviceByMemType(OrtMemType mem_type) const {
  if (mem_type == OrtMemTypeCPUInput || mem_type == OrtMemTypeCPUOutput) {
    return OrtDevice(OrtDevice::CPU, OrtDevice::MemType::CUDA_PINNED, default_device_.Id());
  }
  return default_device_;
}
```

We observed a segement fault thrown when running multiple GPU training  

`
CUDA_LAUNCH_BLOCKING=1 python -m torch.distributed.launch
--nproc_per_node=2
examples/onnxruntime/training/language-modeling/run_mlm.py
--model_name_or_path distilbert-base-uncased --dataset_name wikitext
--dataset_config_name wikitext-2-raw-v1 --num_train_epochs 10
--per_device_train_batch_size 8 --per_device_eval_batch_size 8
--do_train --do_eval --overwrite_output_dir --output_dir ./outputs222/
--seed 1137 --fp16 --report_to none --optim adamw_ort_fused --max_steps
400 --logging_steps 1
`

It is found GPU0 works fine, GPU1 throw segement fault. Looking further,
a Shape node trying to allocate it's output tensor, trying to fetch
corresponding allocator with ORTDevice(Device:[DeviceType:0 MemoryType:1
DeviceId:1]), while CPU device did not have device id = 1, so a no
allocator returned. When we try to call `AsStreamBasedAllocator` for the
allocator, segement happens as no null check was done there.



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
prathikr pushed a commit that referenced this pull request May 16, 2023
…Input (#15903)

### Description
<!-- Describe your changes. -->
change the EP device to default OrtDevice() for memoryType equals
CPUInput for cuda, rocm, migraph
x and tensorRT EP


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
My previous PR (#15618)
caused random failures on cuda training test
GradientCheckerTest.TileGrad (see build
https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=986784&view=logs&j=5076e696-f193-5f12-2d8a-703dda41a79b&t=a3824a7c-2162-5e3d-3fdd-8cf808834fbb)
and rocm test:

root@a59558217e53:/workspace# pytest
orttraining/orttraining/test/python/orttraining_test_ortmodule_api.py::test_gradient_correctness_minmax
... 
E RuntimeError: Error in backward pass execution: Non-zero status code
returned while running ATen node.
Name:'/_original_module/ATen_Grad/ATen_1' Status Message: Storage size
calculation overflowed with sizes=[72340172838076673, 72340172838076673,
128]

Potential reason is that if the memType of cuda/tensorRT/rocm/migraphx
EP is CPUInput, previously the corresponding device in the IAllocator's
memoryInfo is default OrtDevice(), while after my change, it becomes
OrtDevice(CPU, xx_PINNED, 0);

Changing it back fixed GradientCheckerTest.TileGrad in Win GPU training
build.
@snnn snnn removed the release:1.15 label May 18, 2023
snnn pushed a commit that referenced this pull request May 19, 2023
### Description
Add the missing `OrtDevice` initialization in JSEP introduced by #15618
snnn pushed a commit that referenced this pull request May 19, 2023
### Fix segfault for multiple GPU run

#15618 introduced
`GetOrtDeviceByMemType`. The intention should be: handle CPU device
differently in the if branch, while might by mistakenly passing the
unique default non-cpu device id.


```
OrtDevice CUDAExecutionProvider::GetOrtDeviceByMemType(OrtMemType mem_type) const {
  if (mem_type == OrtMemTypeCPUInput || mem_type == OrtMemTypeCPUOutput) {
    return OrtDevice(OrtDevice::CPU, OrtDevice::MemType::CUDA_PINNED, default_device_.Id());
  }
  return default_device_;
}
```

We observed a segement fault thrown when running multiple GPU training  

`
CUDA_LAUNCH_BLOCKING=1 python -m torch.distributed.launch
--nproc_per_node=2
examples/onnxruntime/training/language-modeling/run_mlm.py
--model_name_or_path distilbert-base-uncased --dataset_name wikitext
--dataset_config_name wikitext-2-raw-v1 --num_train_epochs 10
--per_device_train_batch_size 8 --per_device_eval_batch_size 8
--do_train --do_eval --overwrite_output_dir --output_dir ./outputs222/
--seed 1137 --fp16 --report_to none --optim adamw_ort_fused --max_steps
400 --logging_steps 1
`

It is found GPU0 works fine, GPU1 throw segement fault. Looking further,
a Shape node trying to allocate it's output tensor, trying to fetch
corresponding allocator with ORTDevice(Device:[DeviceType:0 MemoryType:1
DeviceId:1]), while CPU device did not have device id = 1, so a no
allocator returned. When we try to call `AsStreamBasedAllocator` for the
allocator, segement happens as no null check was done there.



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
snnn pushed a commit that referenced this pull request May 19, 2023
…Input (#15903)

### Description
<!-- Describe your changes. -->
change the EP device to default OrtDevice() for memoryType equals
CPUInput for cuda, rocm, migraph
x and tensorRT EP


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
My previous PR (#15618)
caused random failures on cuda training test
GradientCheckerTest.TileGrad (see build
https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=986784&view=logs&j=5076e696-f193-5f12-2d8a-703dda41a79b&t=a3824a7c-2162-5e3d-3fdd-8cf808834fbb)
and rocm test:

root@a59558217e53:/workspace# pytest
orttraining/orttraining/test/python/orttraining_test_ortmodule_api.py::test_gradient_correctness_minmax
... 
E RuntimeError: Error in backward pass execution: Non-zero status code
returned while running ATen node.
Name:'/_original_module/ATen_Grad/ATen_1' Status Message: Storage size
calculation overflowed with sizes=[72340172838076673, 72340172838076673,
128]

Potential reason is that if the memType of cuda/tensorRT/rocm/migraphx
EP is CPUInput, previously the corresponding device in the IAllocator's
memoryInfo is default OrtDevice(), while after my change, it becomes
OrtDevice(CPU, xx_PINNED, 0);

Changing it back fixed GradientCheckerTest.TileGrad in Win GPU training
build.
snnn pushed a commit that referenced this pull request May 19, 2023
### Description
Add the missing `OrtDevice` initialization in JSEP introduced by #15618
snnn pushed a commit that referenced this pull request May 19, 2023
### Fix segfault for multiple GPU run

#15618 introduced
`GetOrtDeviceByMemType`. The intention should be: handle CPU device
differently in the if branch, while might by mistakenly passing the
unique default non-cpu device id.


```
OrtDevice CUDAExecutionProvider::GetOrtDeviceByMemType(OrtMemType mem_type) const {
  if (mem_type == OrtMemTypeCPUInput || mem_type == OrtMemTypeCPUOutput) {
    return OrtDevice(OrtDevice::CPU, OrtDevice::MemType::CUDA_PINNED, default_device_.Id());
  }
  return default_device_;
}
```

We observed a segement fault thrown when running multiple GPU training  

`
CUDA_LAUNCH_BLOCKING=1 python -m torch.distributed.launch
--nproc_per_node=2
examples/onnxruntime/training/language-modeling/run_mlm.py
--model_name_or_path distilbert-base-uncased --dataset_name wikitext
--dataset_config_name wikitext-2-raw-v1 --num_train_epochs 10
--per_device_train_batch_size 8 --per_device_eval_batch_size 8
--do_train --do_eval --overwrite_output_dir --output_dir ./outputs222/
--seed 1137 --fp16 --report_to none --optim adamw_ort_fused --max_steps
400 --logging_steps 1
`

It is found GPU0 works fine, GPU1 throw segement fault. Looking further,
a Shape node trying to allocate it's output tensor, trying to fetch
corresponding allocator with ORTDevice(Device:[DeviceType:0 MemoryType:1
DeviceId:1]), while CPU device did not have device id = 1, so a no
allocator returned. When we try to call `AsStreamBasedAllocator` for the
allocator, segement happens as no null check was done there.



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
snnn pushed a commit that referenced this pull request May 19, 2023
…Input (#15903)

### Description
<!-- Describe your changes. -->
change the EP device to default OrtDevice() for memoryType equals
CPUInput for cuda, rocm, migraph
x and tensorRT EP


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
My previous PR (#15618)
caused random failures on cuda training test
GradientCheckerTest.TileGrad (see build
https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=986784&view=logs&j=5076e696-f193-5f12-2d8a-703dda41a79b&t=a3824a7c-2162-5e3d-3fdd-8cf808834fbb)
and rocm test:

root@a59558217e53:/workspace# pytest
orttraining/orttraining/test/python/orttraining_test_ortmodule_api.py::test_gradient_correctness_minmax
... 
E RuntimeError: Error in backward pass execution: Non-zero status code
returned while running ATen node.
Name:'/_original_module/ATen_Grad/ATen_1' Status Message: Storage size
calculation overflowed with sizes=[72340172838076673, 72340172838076673,
128]

Potential reason is that if the memType of cuda/tensorRT/rocm/migraphx
EP is CPUInput, previously the corresponding device in the IAllocator's
memoryInfo is default OrtDevice(), while after my change, it becomes
OrtDevice(CPU, xx_PINNED, 0);

Changing it back fixed GradientCheckerTest.TileGrad in Win GPU training
build.
fs-eire added a commit that referenced this pull request May 19, 2023
### Description
because of #15618 , the default allocator changed to device allocator,
which will be GPU instead of CPU. in transpose optimizer we expect to
read data from initializers so a CPU allocator is required here.

this change fixes transpose optimizer on GPU EP

Fixes the issue referred to in #15869, #15796
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants