【PerfXLab】Optimize Randn and randn_like#1247
Conversation
… instance_norm kron linspace nll_loss" (#1143) * [KUNLUNXIN] open threshold in flaggems benchmark for "index index_add instance_norm kron linspace nll_loss" * [KUNLUNXIN] open threshold in flaggems benchmark for "index index_add instance_norm kron linspace nll_loss" --------- Co-authored-by: xuerui <xuerui06@baidu.com>
Co-authored-by: zhaoyin <zhaoyin@zhaoyindeMacBook-Pro.local>
* moe_sum * change format * delete test_moe_sum * update for build * remove json * add test * add test to test_special_ops.py * add benchmark to test_special_perf.py * add @pytest.mark.moe_sum --------- Co-authored-by: nmpress1 <1935298275@qq.com> Co-authored-by: you-and-you <1823382186@qq.com>
* adaptation
- mean
- argmax
* optimize
- max
- min
- all
- any
- arange
- argmin
- batch_norm
- celu
- gather
- log
- prod
* fix
- addmm
- index_put
* implement and test per-token-group fp8 op --------- Co-authored-by: Ea760 <15236119052@163.com>
* Enhance test coverage for aten::index operator - Fix AttributeError in index operator for mixed basic/advanced indexing - Add comprehensive test cases for index operator - Support combining advanced and basic indexing using Triton Fixes #635 * Fix index_put logic inconsistency and precision issues - Update get_max_rank_shape() and broadcast_indices() in index_put.py to support None values (consistent with index.py) - Fix precision issue: create tensor_indices AFTER broadcast_indices to ensure using broadcasted tensors - Add gen_indices_for_index_put() function in test_reduction_ops.py to properly handle multi-dimensional index shapes - Update all index_put tests to use gen_indices_for_index_put() This fixes the pipeline failures and ensures consistency between index and index_put operators. * Fix code formatting issues (trailing whitespace and black formatting) * Reduce test cases to prevent timeout - Remove excessive test cases added to INDEX_ACC_SHAPE - Keep only the original 8 test cases to match the baseline - This should prevent CI timeout issues * Add test cases to improve coverage for index and index_put operators - Add test cases for None value handling in index operator - Add test cases for non-contiguous subspace (transpose logic) - Add test cases for boolean mask indexing - Add test cases for error handling paths - Add test cases for edge cases (empty tensor, all None, 1D special case) - Add error handling tests for index_put operators Total: 10 new test cases covering critical code paths to improve coverage from 70.8% to target >=90% * Fix black formatting for long lines in test cases * Fix failing test cases: remove unsupported scenarios - Remove test_index_all_none: PyTorch doesn't support all-None indices - Simplify test_index_with_none_basic_indexing: keep only working parameter combinations - Remove test_index_non_contiguous_subspace: implementation issue All remaining test cases now pass successfully (8/8 passed) * Fix formatting: remove extra blank lines (flake8 and black)
* enable index * replace autotuner to libtuner for index_put
* add mm configs * tma gemm kernel * adjust the location of mm with tma implementation * fix ci * Update imports based on Triton version Signed-off-by: Galaxy1458 <55453380+Galaxy1458@users.noreply.github.com> * Update import statement for mm with noqa comment Signed-off-by: Galaxy1458 <55453380+Galaxy1458@users.noreply.github.com> --------- Signed-off-by: Galaxy1458 <55453380+Galaxy1458@users.noreply.github.com> Co-authored-by: Galaxy1458 <55453380+Galaxy1458@users.noreply.github.com>
* [KUNLUNXIN] skip lerp/lerp_ test on torch 2.0 with kununxin The half dtype is un-supported on torch < 2.5, so we have to skip that for now * [KUNLUNXIN] skip lerp/lerp_ test on torch 2.0 with kununxin The half dtype is un-supported on torch < 2.5, so we have to skip that for now
…big shapes (#1158) Co-authored-by: xuerui <xuerui06@baidu.com>
* [KUNLUNXIN] use manual_seed rwkv_mm_sparsity * [KUNLUNXIN] tl.load use fp32 to update precesion
* [KUNLUNXIN] Passin in_h in max_pool2d_backward_kernel Signed-off-by: wangrun06 <wangrun06@baidu.com> * [KUNLUNXIN] Turn on attention tests Signed-off-by: wangrun06 <wangrun06@baidu.com> --------- Signed-off-by: wangrun06 <wangrun06@baidu.com> Co-authored-by: wangrun06 <wangrun06@baidu.com>
|
|
Signed-off-by: bin913 <842884726@qq.com>
tests/test_unary_pointwise_ops.py
Outdated
| unsqueeze_tensor, | ||
| unsqueeze_tuple, | ||
| ) | ||
| from .conftest import TO_CPU |
There was a problem hiding this comment.
This is introduced by a code conflict, I have removed it in latest commit
|
@bin913 Please sign the CLA in order for your PR to be considered as a merge candidate. |
|
Please run |
Thanks for the reminder, I have signed the CLA.
|
* optimize resolve_conj * optimize argmin * optimize mean * recover * [KUNLUNXIN] open threshold in flaggems benchmark for "index index_add instance_norm kron linspace nll_loss" (flagos-ai#1143) * [KUNLUNXIN] open threshold in flaggems benchmark for "index index_add instance_norm kron linspace nll_loss" * [KUNLUNXIN] open threshold in flaggems benchmark for "index index_add instance_norm kron linspace nll_loss" --------- Co-authored-by: xuerui <xuerui06@baidu.com> * [KUNLUNXIN] Fix Bug For UnContigious Copy (flagos-ai#1151) Co-authored-by: zhaoyin <zhaoyin@zhaoyindeMacBook-Pro.local> * [AdvancedCompiler]Add moe_sum (flagos-ai#688) * moe_sum * change format * delete test_moe_sum * update for build * remove json * add test * add test to test_special_ops.py * add benchmark to test_special_perf.py * add @pytest.mark.moe_sum --------- Co-authored-by: nmpress1 <1935298275@qq.com> Co-authored-by: you-and-you <1823382186@qq.com> * [AdvancedCompiler]Sort(cpp wrapper) (flagos-ai#822) * [Advanced Compiler]Add TE.geglu&dgeglu (flagos-ai#1056) * [MTHREADS] Adaptation for MUSA backend (flagos-ai#1150) * adaptation - mean - argmax * optimize - max - min - all - any - arange - argmin - batch_norm - celu - gather - log - prod * fix - addmm - index_put * [AdvancedCompiler]Per token group quant fp8 (flagos-ai#716) * implement and test per-token-group fp8 op --------- Co-authored-by: Ea760 <15236119052@163.com> * 【Triton Copilot】Enhance test coverage for aten::index operator (flagos-ai#1083) * Enhance test coverage for aten::index operator - Fix AttributeError in index operator for mixed basic/advanced indexing - Add comprehensive test cases for index operator - Support combining advanced and basic indexing using Triton Fixes flagos-ai#635 * Fix index_put logic inconsistency and precision issues - Update get_max_rank_shape() and broadcast_indices() in index_put.py to support None values (consistent with index.py) - Fix precision issue: create tensor_indices AFTER broadcast_indices to ensure using broadcasted tensors - Add gen_indices_for_index_put() function in test_reduction_ops.py to properly handle multi-dimensional index shapes - Update all index_put tests to use gen_indices_for_index_put() This fixes the pipeline failures and ensures consistency between index and index_put operators. * Fix code formatting issues (trailing whitespace and black formatting) * Reduce test cases to prevent timeout - Remove excessive test cases added to INDEX_ACC_SHAPE - Keep only the original 8 test cases to match the baseline - This should prevent CI timeout issues * Add test cases to improve coverage for index and index_put operators - Add test cases for None value handling in index operator - Add test cases for non-contiguous subspace (transpose logic) - Add test cases for boolean mask indexing - Add test cases for error handling paths - Add test cases for edge cases (empty tensor, all None, 1D special case) - Add error handling tests for index_put operators Total: 10 new test cases covering critical code paths to improve coverage from 70.8% to target >=90% * Fix black formatting for long lines in test cases * Fix failing test cases: remove unsupported scenarios - Remove test_index_all_none: PyTorch doesn't support all-None indices - Simplify test_index_with_none_basic_indexing: keep only working parameter combinations - Remove test_index_non_contiguous_subspace: implementation issue All remaining test cases now pass successfully (8/8 passed) * Fix formatting: remove extra blank lines (flake8 and black) * ci (flagos-ai#1159) * extract and cache device_info in cpp wrapper (flagos-ai#1152) * enable index (flagos-ai#1161) * Fix te ut (flagos-ai#1168) * 【Operator】 replace autotuner to libtuner for index_put (flagos-ai#1166) * enable index * replace autotuner to libtuner for index_put * fix typos (flagos-ai#1169) * 【Hopper】Update mm kernel and tune configs (flagos-ai#1104) * add mm configs * tma gemm kernel * adjust the location of mm with tma implementation * fix ci * Update imports based on Triton version Signed-off-by: Galaxy1458 <55453380+Galaxy1458@users.noreply.github.com> * Update import statement for mm with noqa comment Signed-off-by: Galaxy1458 <55453380+Galaxy1458@users.noreply.github.com> --------- Signed-off-by: Galaxy1458 <55453380+Galaxy1458@users.noreply.github.com> Co-authored-by: Galaxy1458 <55453380+Galaxy1458@users.noreply.github.com> * [KUNLUNXIN] Fix Full Like (flagos-ai#1163) * [KUNLUNXIN] skip lerp/lerp_ test on torch 2.0 with kununxin (flagos-ai#1170) * [KUNLUNXIN] skip lerp/lerp_ test on torch 2.0 with kununxin The half dtype is un-supported on torch < 2.5, so we have to skip that for now * [KUNLUNXIN] skip lerp/lerp_ test on torch 2.0 with kununxin The half dtype is un-supported on torch < 2.5, so we have to skip that for now * [KUNLUNXIN] Fix Softmax accuracy (flagos-ai#1146) * [KUNLUNXIN] Fix index_put_ (flagos-ai#1165) * [KUNLUNXIN] Fix Select Scatter (flagos-ai#1164) * [kunlunxin] update kron op input shapes from zhiyuan provided except big shapes (flagos-ai#1158) Co-authored-by: xuerui <xuerui06@baidu.com> * [KUNLUNXIN] use manual_seed rwkv_mm_sparsity (flagos-ai#1162) * [KUNLUNXIN] use manual_seed rwkv_mm_sparsity * [KUNLUNXIN] tl.load use fp32 to update precesion * Remove register ops (flagos-ai#1171) * [KUNLUNXIN] Turn on attention tests (flagos-ai#1147) * [KUNLUNXIN] Passin in_h in max_pool2d_backward_kernel Signed-off-by: wangrun06 <wangrun06@baidu.com> * [KUNLUNXIN] Turn on attention tests Signed-off-by: wangrun06 <wangrun06@baidu.com> --------- Signed-off-by: wangrun06 <wangrun06@baidu.com> Co-authored-by: wangrun06 <wangrun06@baidu.com> * Fix bool type indices (flagos-ai#1023) * optimize randn and randn_like * fix code format and style * remove the unnecessary change --------- Signed-off-by: Galaxy1458 <55453380+Galaxy1458@users.noreply.github.com> Signed-off-by: wangrun06 <wangrun06@baidu.com> Signed-off-by: bin913 <842884726@qq.com> Co-authored-by: xuexingtu <88195961+xuexingtu@users.noreply.github.com> Co-authored-by: xuerui <xuerui06@baidu.com> Co-authored-by: fantasy666 <39185229+fantasy666@users.noreply.github.com> Co-authored-by: zhaoyin <zhaoyin@zhaoyindeMacBook-Pro.local> Co-authored-by: AdvancedCompiler <Pikachu_Jun@outlook.com> Co-authored-by: nmpress1 <1935298275@qq.com> Co-authored-by: you-and-you <1823382186@qq.com> Co-authored-by: zhoubo567 <781266327@qq.com> Co-authored-by: Kylin1207 <13345006231@163.com> Co-authored-by: Ea760 <15236119052@163.com> Co-authored-by: Zang Peiyu <166481866+factnn@users.noreply.github.com> Co-authored-by: Bigeyes <qjk595391@gmail.com> Co-authored-by: kiddyjinjin <54064850+kiddyjinjin@users.noreply.github.com> Co-authored-by: WangZhen <23097963+0x45f@users.noreply.github.com> Co-authored-by: Galaxy1458 <55453380+Galaxy1458@users.noreply.github.com> Co-authored-by: ylyzty <50573767+ylyzty@users.noreply.github.com> Co-authored-by: Ason93 <18817617225@163.com> Co-authored-by: zeno_dongjibin <4pyqm84rz7@privaterelay.appleid.com> Co-authored-by: purerli98 <82259540+purerli98@users.noreply.github.com> Co-authored-by: mikiya1991 <anakinlancer@gmail.com> Co-authored-by: wangrun06 <wangrun06@baidu.com>

PR Category
[ Operator]
Type of Change
[Performance Optimization]
Description
optimize randn and randn_like
Issue
Progress
Performance
t_tensor_constructor_perf.py::test_tensor_constructor_benchmark[randn-randn-generic_constructor_input_fn]
Operator: randn Performance Test (dtype=torch.float16, mode=kernel,level=core)
Status Torch Latency (ms) Gems Latency (ms) Gems Speedup Size Detail
SUCCESS 2.290912 1.666816 1.374 {'size': [1073741824], 'dtype': torch.float16, 'device': 'cuda'}
SUCCESS 0.043616 0.032896 1.326 {'size': [4096, 4096], 'dtype': torch.float16, 'device': 'cuda'}
SUCCESS 0.043712 0.033920 1.289 {'size': [64, 512, 512], 'dtype': torch.float16, 'device': 'cuda'}
SUCCESS 2.274112 1.653408 1.375 {'size': [1024, 1024, 1024], 'dtype': torch.float16, 'device': 'cuda'}
SUCCESS 0.005664 0.005632 1.006 {'size': [64, 64], 'dtype': torch.float16, 'device': 'cuda'}
Operator: randn Performance Test (dtype=torch.float32, mode=kernel,level=core)
Status Torch Latency (ms) Gems Latency (ms) Gems Speedup Size Detail
SUCCESS 2.276512 1.713344 1.329 {'size': [1073741824], 'dtype': torch.float32, 'device': 'cuda'}
SUCCESS 0.043392 0.032320 1.343 {'size': [4096, 4096], 'dtype': torch.float32, 'device': 'cuda'}
SUCCESS 0.043136 0.032384 1.332 {'size': [64, 512, 512], 'dtype': torch.float32, 'device': 'cuda'}
SUCCESS 2.259264 1.687392 1.339 {'size': [1024, 1024, 1024], 'dtype': torch.float32, 'device': 'cuda'}
SUCCESS 0.005856 0.005408 1.083 {'size': [64, 64], 'dtype': torch.float32, 'device': 'cuda'}
Operator: randn Performance Test (dtype=torch.bfloat16, mode=kernel,level=core)
Status Torch Latency (ms) Gems Latency (ms) Gems Speedup Size Detail
SUCCESS 2.273152 1.642272 1.384 {'size': [1073741824], 'dtype': torch.bfloat16, 'device': 'cuda'}
SUCCESS 0.043584 0.032256 1.351 {'size': [4096, 4096], 'dtype': torch.bfloat16, 'device': 'cuda'}
SUCCESS 0.043712 0.032128 1.361 {'size': [64, 512, 512], 'dtype': torch.bfloat16, 'device': 'cuda'}
SUCCESS 2.274240 1.641728 1.385 {'size': [1024, 1024, 1024], 'dtype': torch.bfloat16, 'device': 'cuda'}
SUCCESS 0.005856 0.005408 1.083 {'size': [64, 64], 'dtype': torch.bfloat16, 'device': 'cuda'}
test_tensor_constructor_perf.py::test_tensor_constructor_benchmark[randn_like-randn_like-unary_input_fn]
Operator: randn_like Performance Test (dtype=torch.float16, mode=kernel,level=core)
Status Torch Latency (ms) Gems Latency (ms) Gems Speedup Size Detail
SUCCESS 2.291104 1.667392 1.374 [torch.Size([1073741824])]
SUCCESS 0.044384 0.033152 1.339 [torch.Size([4096, 4096])]
SUCCESS 0.044160 0.032896 1.342 [torch.Size([64, 512, 512])]
SUCCESS 2.273856 1.654624 1.374 [torch.Size([1024, 1024, 1024])]
SUCCESS 0.005536 0.005344 1.036 [torch.Size([64, 64])]
Operator: randn_like Performance Test (dtype=torch.float32, mode=kernel,level=core)
Status Torch Latency (ms) Gems Latency (ms) Gems Speedup Size Detail
SUCCESS 2.260832 1.700096 1.330 [torch.Size([1073741824])]
SUCCESS 0.043168 0.032480 1.329 [torch.Size([4096, 4096])]
SUCCESS 0.043360 0.032480 1.335 [torch.Size([64, 512, 512])]
SUCCESS 2.260352 1.700224 1.329 [torch.Size([1024, 1024, 1024])]
SUCCESS 0.005696 0.005344 1.066 [torch.Size([64, 64])]
Operator: randn_like Performance Test (dtype=torch.bfloat16, mode=kernel,level=core)
Status Torch Latency (ms) Gems Latency (ms) Gems Speedup Size Detail
SUCCESS 2.275488 1.642208 1.386 [torch.Size([1073741824])]
SUCCESS 0.043552 0.032192 1.353 [torch.Size([4096, 4096])]
SUCCESS 0.043360 0.032160 1.348 [torch.Size([64, 512, 512])]
SUCCESS 2.275040 1.642432 1.385 [torch.Size([1024, 1024, 1024])]
SUCCESS 0.005536 0.005344 1.036 [torch.Size([64, 64])]