Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ROCm changes on uvm. #1142

Closed
wants to merge 59 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
59 commits
Select commit Hold shift + click to select a range
59267a9
Hipification of fbgemm for AMD GPUs/CPUs (#4)
jithunnair-amd Jan 25, 2022
a223936
Use SHEFL_SYNC_MACRO to replace __shefl() and __shefl_sync()
liligwu Jan 26, 2022
4610075
Merge pull request #6 from ROCmSoftwarePlatform/rocm4.3/develop
liligwu Jan 26, 2022
a506c52
Change the hipify dependency to hipify_torch (#7)
liligwu Jan 31, 2022
f596bde
IFU, merge from upstream commit c6df576 to main. (#8)
liligwu Feb 14, 2022
0cfb792
Enable `split_table_batched_embeddings_test.py` (#10)
liligwu Mar 2, 2022
f13af44
*Enable use_cache. *Enable split_embedding_inference_converter_test.p…
liligwu Mar 7, 2022
25e5b71
Skip use_cpu.
liligwu Mar 7, 2022
dcbe19f
Enable test_nbit_cache_pipeline and test_cache_miss_counter.
liligwu Mar 7, 2022
fda048e
Enable quantize_ops_test.py
liligwu Mar 7, 2022
00abba1
Merge branch 'main' into use_cache_enabled
liligwu Mar 7, 2022
cf307b6
Remove @skipIfRocm for test_nbit_cache_pipeline and test_cache_miss_c…
liligwu Mar 7, 2022
2d66ea8
*Uncondition use_cache in split_table_batched_embeddings_test.py *Rem…
liligwu Mar 7, 2022
958679b
Merge pull request #11 from ROCmSoftwarePlatform/use_cache_enabled
amathews-amd Mar 8, 2022
e642a48
Fix backward tests and test_cache_pipeline in split_table_batched_emb…
liligwu Mar 8, 2022
d0d294a
A minor change of removing a commented line.
liligwu Mar 8, 2022
146f2df
Remove skipIfRocm import in split_table_batched_embeddings_test.py.
liligwu Mar 8, 2022
eb0cf36
Merge pull request #12 from ROCmSoftwarePlatform/fix_backward
amathews-amd Mar 8, 2022
0c86f2b
*Removed post_hipify logic in setup.py. *Removed two headerfiles that…
liligwu Mar 11, 2022
6e7f13e
Merge pull request #16 from ROCmSoftwarePlatform/remove_post_hipify
amathews-amd Mar 11, 2022
edd3306
Pointing hipify_torch to the newer commit.
liligwu Mar 14, 2022
9a45f4a
Merge pull request #17 from ROCmSoftwarePlatform/pointing_hipify_torc…
amathews-amd Mar 14, 2022
309a3a1
Fixing #include <ATen/CUDAGeneratorImpl.h> by defining NEW_GENERATOR_…
liligwu Mar 16, 2022
358eaf5
Disabling all use_cpu in the tests. (#20)
liligwu Mar 16, 2022
3a915a8
Change py3.8 syntax to py3.7 syntax (#18)
pruthvistony Mar 16, 2022
40928ba
Match upstream setup (#21)
liligwu Mar 31, 2022
69abf78
Enable merge_pooled_embeddings op. in ROCm (#15)
reza-amd Apr 1, 2022
5c0096e
Merge remote-tracking branch 'upstream/main' into IFU-main-2022-04-07
liligwu Apr 14, 2022
bfac874
Fixing test_lxu_cache_lookup in AMD devices where warp_siize=64
liligwu Apr 14, 2022
1cf7e84
* Enabling the specificationn of hip architecture by using PYTORCH_RO…
liligwu Apr 15, 2022
5b33287
*Fixing the unit tests in sparse_ops_test.py. *Fixing the path of Ato…
liligwu Apr 19, 2022
2c514c5
Merge pull request #23 from ROCmSoftwarePlatform/IFU-main-2022-04-07
pruthvistony Apr 19, 2022
0d5a012
Enable use_cpu in the tests.
liligwu Apr 20, 2022
ae14a47
Merge remote-tracking branch 'upstream/main' into IFU-main-2022-04-20
liligwu Apr 20, 2022
1718605
*Taking @skipIfRocm back in the test_utils.py. *Fixing cublasGemmStri…
liligwu Apr 20, 2022
bc902a3
Cleaning up the code.
liligwu Apr 20, 2022
0d95948
Merge pull request #24 from ROCmSoftwarePlatform/IFU-main-2022-04-20
pruthvistony Apr 21, 2022
9a5a33b
Enabling cuda (#25)
liligwu Apr 21, 2022
6490dbc
Enabling cuda (#25)
liligwu Apr 21, 2022
77627ae
Merge branch 'main' of https://github.com/ROCmSoftwarePlatform/FBGEMM…
liligwu Apr 22, 2022
18b48e9
Merge remote-tracking branch 'upstream/main' into IFU-main-2022-05-02
liligwu May 2, 2022
99a70e1
Merge pull request #2 from ROCmSoftwarePlatform/IFU-main-2022-05-02
liligwu May 4, 2022
fed56ff
Merge branch 'main' into rocm_changes
liligwu May 4, 2022
4b39a70
Merge branch 'upstream_main' into rocm_changes
liligwu May 5, 2022
785afb8
Removing building and testing bash scripts.
liligwu May 5, 2022
bbd0ad1
* Addressing the comments in PR review ROCm changes #1102. * Reoganiz…
liligwu May 9, 2022
9db83d8
Minor changes that minimize the difference to upstream.
liligwu May 9, 2022
eabd0a8
A minor change on a blank line.
liligwu May 9, 2022
2038008
Fixing indentation and commented code in CMakeList.txt
liligwu May 10, 2022
0202078
Removing build script.
liligwu May 10, 2022
9cf8856
Addressing the second batch of comments of https://github.com/pytorch…
liligwu May 11, 2022
b885322
* Removing the condition on c++ standard * An indentation correction
liligwu May 12, 2022
0e3dfdb
* Changing the logic of detecting GPU vender, making CUDA as default.…
liligwu May 13, 2022
1f926e9
Merge remote-tracking branch 'upstream/main' into IFU-2022-05-23
liligwu May 24, 2022
adefcc0
fix enum macro to avoid missing symbols
jeffdaily May 24, 2022
b96bd9a
- Changing detection of ROCm to /opt/rocm. - Skipping 4 unit tests fo…
liligwu May 26, 2022
3a1c2a3
Cherry-pick 33c5e061e7aa47b8efbcb7dee83580b3844f6d67
amathews-amd May 23, 2022
f664267
Resolve the conflict in quantize_ops_benchmark.py
liligwu May 26, 2022
ebb0154
work around Python API for enum_utils.h.
liligwu Jun 8, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion fbgemm_gpu/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ endif()
option(USE_CUDA "Use CUDA" ON)
option(USE_ROCM "Use ROCm" OFF)

if((EXISTS "/bin/hipcc") AND NOT (EXISTS "/bin/nvcc"))
if((EXISTS "/opt/rocm/") AND NOT (EXISTS "/bin/nvcc"))
message("AMD GPU detected.")
set(USE_CUDA OFF)
set(USE_ROCM ON)
Expand Down
46 changes: 35 additions & 11 deletions fbgemm_gpu/bench/quantize_ops_benchmark.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,17 +33,7 @@ def cli() -> None:
pass


@cli.command()
@click.option("--flush-gpu-cache-size-mb", default=0)
@click.option("--iters", default=100)
@click.option("--warmup-runs", default=2)
@settings(max_examples=10, deadline=None)
# pyre-ignore
@given(
num_columns=st.sampled_from([2**n for n in range(4, 10)]),
num_rows=st.sampled_from([2**n for n in range(4, 10)]),
)
def bench(
def bench_impl(
flush_gpu_cache_size_mb: int,
iters: int,
num_columns: int,
Expand Down Expand Up @@ -138,6 +128,40 @@ def bench(
logging.info(f"{k} time per iter: {t_time * 1.0e6:.0f}us")


@settings(max_examples=10, deadline=None)
# pyre-ignore
@given(
num_columns=st.sampled_from([2 ** n for n in range(4, 10)]),
num_rows=st.sampled_from([2 ** n for n in range(4, 10)]),
)
def bench_spectrum(
flush_gpu_cache_size_mb: int,
iters: int,
num_columns: int,
num_rows: int,
warmup_runs: int,
) -> None:
bench_impl(flush_gpu_cache_size_mb=flush_gpu_cache_size_mb, iters=iters, num_columns=num_columns, num_rows=num_rows, warmup_runs=warmup_runs)

@cli.command()
@click.option("--flush-gpu-cache-size-mb", default=0)
@click.option("--iters", default=100)
@click.option("--num-columns", default=-1)
@click.option("--num-rows", default=-1)
@click.option("--warmup-runs", default=2)
def bench(
flush_gpu_cache_size_mb: int,
iters: int,
num_columns: int,
num_rows: int,
warmup_runs: int,
) -> None:
if num_columns == -1 or num_rows == -1:
bench_spectrum(flush_gpu_cache_size_mb=flush_gpu_cache_size_mb, iters=iters, warmup_runs=warmup_runs)
else:
bench_impl(flush_gpu_cache_size_mb=flush_gpu_cache_size_mb, iters=iters, num_columns=num_columns, num_rows=num_rows, warmup_runs=warmup_runs)


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I missed this in the first review pass. I might be just missing the context, but why do we need to update the quantization test?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our discussions require us testing specific versions of num_rows and num_columns, not in the standard spectrum tested here. The old implementation of this file had num_rows and num_columns as arguments. This is bringing some of that functionality back.

@cli.command()
@click.option("--flush-gpu-cache-size-mb", default=0)
@click.option("--iters", default=100)
Expand Down
21 changes: 14 additions & 7 deletions fbgemm_gpu/include/fbgemm_gpu/enum_utils.h
Original file line number Diff line number Diff line change
Expand Up @@ -15,30 +15,37 @@ namespace fbgemm_gpu {

#define FBGEMM_GPU_ENUM_CREATE_TAG(module_name) \
struct fbgemm_gpu_enum_tag_##module_name {}; \
extern template enum_registration<struct fbgemm_gpu_enum_tag_##module_name>* \
template <> enum_registration<struct fbgemm_gpu_enum_tag_##module_name>* \
enum_registration< \
struct fbgemm_gpu_enum_tag_##module_name>::registration_list;
struct fbgemm_gpu_enum_tag_##module_name>::registration_list; \
extern template class enum_registration< \
struct fbgemm_gpu_enum_tag_##module_name>;
liligwu marked this conversation as resolved.
Show resolved Hide resolved

#define FBGEMM_GPU_ENUM_TAG(module_name) \
struct fbgemm_gpu_enum_tag_##module_name

#define FBGEMM_GPU_ENUM_GLOGAL(module_name) \
template class enum_registration<FBGEMM_GPU_ENUM_TAG(module_name)>; \
template <> \
enum_registration<FBGEMM_GPU_ENUM_TAG(module_name)>* \
enum_registration<FBGEMM_GPU_ENUM_TAG(module_name)>::registration_list = \
nullptr;

#define FBGEMM_GPU_ENUM_REGISTER_START(module_name, enum_name) \
enum_registration<FBGEMM_GPU_ENUM_TAG(module_name)> fbgemm_fpu_enum_reg_ ## enum_name( \
#enum_name,
// To work around (escape from) hipify_torch, the names of the idendifiers
// are decoposed to `prefix` and `enum_name`.
#define FBGEMM_GPU_ENUM_REGISTER_START(module_name, prefix, enum_name) \
enum_registration<FBGEMM_GPU_ENUM_TAG(module_name)> fbgemm_fpu_enum_reg_ \
## prefix ## enum_name( #prefix #enum_name,

#define FBGEMM_GPU_ENUM_REGISTER_END );

#define FBGEMM_GPU_ENUM_OP(module_name, op_name) \
#op_name "() -> ((str, (str, int)[])[])", \
TORCH_FN(enum_query <FBGEMM_GPU_ENUM_TAG(module_name)>)
#define FBGEMM_GPU_ENUM_ITEM(x) \
{ #x, x }
// To work around (escape from) hipify_torch, the names of the idendifiers
// are decoposed to `x` and `y`. `z` is supposed to be hipified.
#define FBGEMM_GPU_ENUM_ITEM(x, y, z) \
{ #x #y, z }

using enum_item = std::tuple<std::string, int64_t>;

Expand Down
14 changes: 7 additions & 7 deletions fbgemm_gpu/src/cumem_utils.cu
Original file line number Diff line number Diff line change
Expand Up @@ -357,13 +357,13 @@ Tensor uvm_to_cpu_clone(Tensor t) {

FBGEMM_GPU_ENUM_GLOGAL(uvm)

FBGEMM_GPU_ENUM_REGISTER_START(uvm, cudaMemoryAdvise){
FBGEMM_GPU_ENUM_ITEM(cudaMemAdviseSetReadMostly),
FBGEMM_GPU_ENUM_ITEM(cudaMemAdviseUnsetReadMostly),
FBGEMM_GPU_ENUM_ITEM(cudaMemAdviseSetPreferredLocation),
FBGEMM_GPU_ENUM_ITEM(cudaMemAdviseUnsetPreferredLocation),
FBGEMM_GPU_ENUM_ITEM(cudaMemAdviseSetAccessedBy),
FBGEMM_GPU_ENUM_ITEM(cudaMemAdviseUnsetAccessedBy),
FBGEMM_GPU_ENUM_REGISTER_START(uvm, cudaMemory, Advise){
FBGEMM_GPU_ENUM_ITEM(cudaMem, AdviseSetReadMostly, cudaMemAdviseSetReadMostly),
FBGEMM_GPU_ENUM_ITEM(cudaMem, AdviseUnsetReadMostly, cudaMemAdviseUnsetReadMostly),
FBGEMM_GPU_ENUM_ITEM(cudaMem, AdviseSetPreferredLocation, cudaMemAdviseSetPreferredLocation),
FBGEMM_GPU_ENUM_ITEM(cudaMem, AdviseUnsetPreferredLocation, cudaMemAdviseUnsetPreferredLocation),
FBGEMM_GPU_ENUM_ITEM(cudaMem, AdviseSetAccessedBy, cudaMemAdviseSetAccessedBy),
FBGEMM_GPU_ENUM_ITEM(cudaMem, AdviseUnsetAccessedBy, cudaMemAdviseUnsetAccessedBy),
} FBGEMM_GPU_ENUM_REGISTER_END

} // namespace fbgemm_gpu
8 changes: 0 additions & 8 deletions fbgemm_gpu/src/cumem_utils_host.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -39,11 +39,7 @@ TORCH_LIBRARY_FRAGMENT(fb, m) {
TORCH_FN(uvm_mem_advice_dont_fork));

m.def("uvm_to_cpu_clone(Tensor t) -> Tensor", TORCH_FN(uvm_to_cpu_clone));

#ifndef __HIP_PLATFORM_HCC__
// FIXME: some advanced "cudaMemAdvise" flags are not supported by HIP.
m.def(FBGEMM_GPU_ENUM_OP(uvm, fbgemm_gpu_uvm_enum_query));
#endif
}

TORCH_LIBRARY_FRAGMENT(fbgemm, m) {
Expand All @@ -69,11 +65,7 @@ TORCH_LIBRARY_FRAGMENT(fbgemm, m) {
TORCH_FN(uvm_mem_advice_dont_fork));

m.def("uvm_to_cpu_clone(Tensor t) -> Tensor", TORCH_FN(uvm_to_cpu_clone));

#ifndef __HIP_PLATFORM_HCC__
// FIXME: some advanced "cudaMemAdvise" flags are not supported by HIP.
m.def(FBGEMM_GPU_ENUM_OP(uvm, fbgemm_gpu_uvm_enum_query));
#endif
}

} // namespace fbgemm_gpu
6 changes: 5 additions & 1 deletion fbgemm_gpu/test/uvm_test.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@

if open_source:
# pyre-ignore[21]
from test_utils import gpu_available, gpu_unavailable
from test_utils import gpu_available, gpu_unavailable, skipIfRocm
else:
from fbgemm_gpu.test.test_utils import gpu_available, gpu_unavailable

Expand Down Expand Up @@ -80,6 +80,7 @@ def test_enum(self) -> None:
# pyre-ignore[16]
assert cudaMemoryAdvise.cudaMemAdviseSetAccessedBy.value == 5

@skipIfRocm
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are all these unit tests skipped?

Copy link
Contributor Author

@liligwu liligwu May 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It complains "out of memory" on some of the ROCm devices. We're investigating it. uvm_test.py is not on the list of SOW.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue is because a large amount of memory (>1TB) was allocated in the tests. We are checking if this is valid.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any progress here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, not yet. I've submitted a JIRA ticket internally, but it may take some time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got it. Thanks for your updates.

@unittest.skipIf(*gpu_unavailable)
@given(
sizes=st.lists(
Expand Down Expand Up @@ -123,6 +124,7 @@ def test_cudaMemPrefetchAsync(self, sizes: List[int], vanilla: bool) -> None:

torch.cuda.synchronize(torch.device("cuda:0"))

@skipIfRocm
@unittest.skipIf(*gpu_unavailable or torch.cuda.device_count() < 2)
@given(
sizes=st.lists(
Expand Down Expand Up @@ -154,6 +156,7 @@ def test_uvm_to_device(self, sizes: List[int], vanilla: bool) -> None:
assert torch.ops.fbgemm.uvm_storage(second_t)
assert second_t.device == device_prototype.device

@skipIfRocm
@unittest.skipIf(*gpu_unavailable)
@given(
sizes=st.lists(
Expand Down Expand Up @@ -183,6 +186,7 @@ def test_uvm_slice(self, sizes: List[int], vanilla: bool) -> None:
assert torch.ops.fbgemm.is_uvm_tensor(uvm_slice)
assert torch.ops.fbgemm.uvm_storage(cpu_slice)

@skipIfRocm
@unittest.skipIf(*gpu_unavailable)
@given(
sizes=st.lists(
Expand Down