Skip to content

【PerfXLab】Optimize Randn and randn_like#1247

Merged
kiddyjinjin merged 33 commits intoflagos-ai:masterfrom
bin913:randn_randn_like
Dec 30, 2025
Merged

【PerfXLab】Optimize Randn and randn_like#1247
kiddyjinjin merged 33 commits intoflagos-ai:masterfrom
bin913:randn_randn_like

Conversation

@bin913
Copy link
Copy Markdown
Contributor

@bin913 bin913 commented Dec 20, 2025

PR Category
[ Operator]

Type of Change
[Performance Optimization]

Description
optimize randn and randn_like

Issue

Progress

  • Change is properly reviewed (1 reviewer required, 2 recommended).
  • Change is responded to an issue.
  • Change is fully covered by a UT.

Performance
t_tensor_constructor_perf.py::test_tensor_constructor_benchmark[randn-randn-generic_constructor_input_fn]
Operator: randn Performance Test (dtype=torch.float16, mode=kernel,level=core)
Status Torch Latency (ms) Gems Latency (ms) Gems Speedup Size Detail

SUCCESS 2.290912 1.666816 1.374 {'size': [1073741824], 'dtype': torch.float16, 'device': 'cuda'}
SUCCESS 0.043616 0.032896 1.326 {'size': [4096, 4096], 'dtype': torch.float16, 'device': 'cuda'}
SUCCESS 0.043712 0.033920 1.289 {'size': [64, 512, 512], 'dtype': torch.float16, 'device': 'cuda'}
SUCCESS 2.274112 1.653408 1.375 {'size': [1024, 1024, 1024], 'dtype': torch.float16, 'device': 'cuda'}
SUCCESS 0.005664 0.005632 1.006 {'size': [64, 64], 'dtype': torch.float16, 'device': 'cuda'}

Operator: randn Performance Test (dtype=torch.float32, mode=kernel,level=core)
Status Torch Latency (ms) Gems Latency (ms) Gems Speedup Size Detail

SUCCESS 2.276512 1.713344 1.329 {'size': [1073741824], 'dtype': torch.float32, 'device': 'cuda'}
SUCCESS 0.043392 0.032320 1.343 {'size': [4096, 4096], 'dtype': torch.float32, 'device': 'cuda'}
SUCCESS 0.043136 0.032384 1.332 {'size': [64, 512, 512], 'dtype': torch.float32, 'device': 'cuda'}
SUCCESS 2.259264 1.687392 1.339 {'size': [1024, 1024, 1024], 'dtype': torch.float32, 'device': 'cuda'}
SUCCESS 0.005856 0.005408 1.083 {'size': [64, 64], 'dtype': torch.float32, 'device': 'cuda'}

Operator: randn Performance Test (dtype=torch.bfloat16, mode=kernel,level=core)
Status Torch Latency (ms) Gems Latency (ms) Gems Speedup Size Detail

SUCCESS 2.273152 1.642272 1.384 {'size': [1073741824], 'dtype': torch.bfloat16, 'device': 'cuda'}
SUCCESS 0.043584 0.032256 1.351 {'size': [4096, 4096], 'dtype': torch.bfloat16, 'device': 'cuda'}
SUCCESS 0.043712 0.032128 1.361 {'size': [64, 512, 512], 'dtype': torch.bfloat16, 'device': 'cuda'}
SUCCESS 2.274240 1.641728 1.385 {'size': [1024, 1024, 1024], 'dtype': torch.bfloat16, 'device': 'cuda'}
SUCCESS 0.005856 0.005408 1.083 {'size': [64, 64], 'dtype': torch.bfloat16, 'device': 'cuda'}

test_tensor_constructor_perf.py::test_tensor_constructor_benchmark[randn_like-randn_like-unary_input_fn]
Operator: randn_like Performance Test (dtype=torch.float16, mode=kernel,level=core)
Status Torch Latency (ms) Gems Latency (ms) Gems Speedup Size Detail

SUCCESS 2.291104 1.667392 1.374 [torch.Size([1073741824])]
SUCCESS 0.044384 0.033152 1.339 [torch.Size([4096, 4096])]
SUCCESS 0.044160 0.032896 1.342 [torch.Size([64, 512, 512])]
SUCCESS 2.273856 1.654624 1.374 [torch.Size([1024, 1024, 1024])]
SUCCESS 0.005536 0.005344 1.036 [torch.Size([64, 64])]

Operator: randn_like Performance Test (dtype=torch.float32, mode=kernel,level=core)
Status Torch Latency (ms) Gems Latency (ms) Gems Speedup Size Detail

SUCCESS 2.260832 1.700096 1.330 [torch.Size([1073741824])]
SUCCESS 0.043168 0.032480 1.329 [torch.Size([4096, 4096])]
SUCCESS 0.043360 0.032480 1.335 [torch.Size([64, 512, 512])]
SUCCESS 2.260352 1.700224 1.329 [torch.Size([1024, 1024, 1024])]
SUCCESS 0.005696 0.005344 1.066 [torch.Size([64, 64])]

Operator: randn_like Performance Test (dtype=torch.bfloat16, mode=kernel,level=core)
Status Torch Latency (ms) Gems Latency (ms) Gems Speedup Size Detail

SUCCESS 2.275488 1.642208 1.386 [torch.Size([1073741824])]
SUCCESS 0.043552 0.032192 1.353 [torch.Size([4096, 4096])]
SUCCESS 0.043360 0.032160 1.348 [torch.Size([64, 512, 512])]
SUCCESS 2.275040 1.642432 1.385 [torch.Size([1024, 1024, 1024])]
SUCCESS 0.005536 0.005344 1.036 [torch.Size([64, 64])]

bin913 and others added 30 commits December 3, 2025 11:22
… instance_norm kron linspace nll_loss" (#1143)

* [KUNLUNXIN] open threshold in flaggems benchmark for "index index_add instance_norm kron linspace nll_loss"

* [KUNLUNXIN] open threshold in flaggems benchmark for "index index_add instance_norm kron linspace nll_loss"

---------

Co-authored-by: xuerui <xuerui06@baidu.com>
Co-authored-by: zhaoyin <zhaoyin@zhaoyindeMacBook-Pro.local>
* moe_sum

* change format

* delete test_moe_sum

* update for build

* remove json

* add test

* add test to test_special_ops.py

* add benchmark to test_special_perf.py

* add @pytest.mark.moe_sum

---------

Co-authored-by: nmpress1 <1935298275@qq.com>
Co-authored-by: you-and-you <1823382186@qq.com>
* adaptation
    - mean
    - argmax
* optimize
    - max
    - min
    - all
    - any
    - arange
    - argmin
    - batch_norm
    - celu
    - gather
    - log
    - prod
* fix
    - addmm
    - index_put
* implement and test per-token-group fp8 op

---------

Co-authored-by: Ea760 <15236119052@163.com>
* Enhance test coverage for aten::index operator

- Fix AttributeError in index operator for mixed basic/advanced indexing
- Add comprehensive test cases for index operator
- Support combining advanced and basic indexing using Triton

Fixes #635

* Fix index_put logic inconsistency and precision issues

- Update get_max_rank_shape() and broadcast_indices() in index_put.py to support None values (consistent with index.py)
- Fix precision issue: create tensor_indices AFTER broadcast_indices to ensure using broadcasted tensors
- Add gen_indices_for_index_put() function in test_reduction_ops.py to properly handle multi-dimensional index shapes
- Update all index_put tests to use gen_indices_for_index_put()

This fixes the pipeline failures and ensures consistency between index and index_put operators.

* Fix code formatting issues (trailing whitespace and black formatting)

* Reduce test cases to prevent timeout

- Remove excessive test cases added to INDEX_ACC_SHAPE
- Keep only the original 8 test cases to match the baseline
- This should prevent CI timeout issues

* Add test cases to improve coverage for index and index_put operators

- Add test cases for None value handling in index operator
- Add test cases for non-contiguous subspace (transpose logic)
- Add test cases for boolean mask indexing
- Add test cases for error handling paths
- Add test cases for edge cases (empty tensor, all None, 1D special case)
- Add error handling tests for index_put operators

Total: 10 new test cases covering critical code paths to improve coverage from 70.8% to target >=90%

* Fix black formatting for long lines in test cases

* Fix failing test cases: remove unsupported scenarios

- Remove test_index_all_none: PyTorch doesn't support all-None indices
- Simplify test_index_with_none_basic_indexing: keep only working parameter combinations
- Remove test_index_non_contiguous_subspace: implementation issue

All remaining test cases now pass successfully (8/8 passed)

* Fix formatting: remove extra blank lines (flake8 and black)
* enable index

* replace autotuner to libtuner for index_put
* add mm configs

* tma gemm kernel

* adjust the location of mm with tma implementation

* fix ci

* Update imports based on Triton version

Signed-off-by: Galaxy1458 <55453380+Galaxy1458@users.noreply.github.com>

* Update import statement for mm with noqa comment

Signed-off-by: Galaxy1458 <55453380+Galaxy1458@users.noreply.github.com>

---------

Signed-off-by: Galaxy1458 <55453380+Galaxy1458@users.noreply.github.com>
Co-authored-by: Galaxy1458 <55453380+Galaxy1458@users.noreply.github.com>
* [KUNLUNXIN] skip lerp/lerp_ test on torch 2.0 with kununxin

The half dtype is un-supported on torch < 2.5, so we have to skip that for now

* [KUNLUNXIN] skip lerp/lerp_ test on torch 2.0 with kununxin

The half dtype is un-supported on torch < 2.5, so we have to skip that for now
…big shapes (#1158)

Co-authored-by: xuerui <xuerui06@baidu.com>
* [KUNLUNXIN] use manual_seed rwkv_mm_sparsity

* [KUNLUNXIN] tl.load use fp32 to update precesion
* [KUNLUNXIN] Passin in_h in max_pool2d_backward_kernel

Signed-off-by: wangrun06 <wangrun06@baidu.com>

* [KUNLUNXIN] Turn on attention tests

Signed-off-by: wangrun06 <wangrun06@baidu.com>

---------

Signed-off-by: wangrun06 <wangrun06@baidu.com>
Co-authored-by: wangrun06 <wangrun06@baidu.com>
@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Dec 20, 2025

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
15 out of 16 committers have signed the CLA.

✅ bin913
✅ xuexingtu
✅ fantasy666
✅ AdvancedCompiler
✅ factnn
✅ Kylin1207
✅ botbigeyes
✅ kiddyjinjin
✅ 0x45f
✅ Ason93
✅ dongjibin1996
✅ ylyzty
✅ purerli98
✅ Galaxy1458
✅ zhoubo567
❌ mikiya1991
You have signed the CLA already but the status is still pending? Let us recheck it.

Signed-off-by: bin913 <842884726@qq.com>
unsqueeze_tensor,
unsqueeze_tuple,
)
from .conftest import TO_CPU
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this change?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is introduced by a code conflict, I have removed it in latest commit

@tengqm
Copy link
Copy Markdown
Contributor

tengqm commented Dec 28, 2025

@bin913 Please sign the CLA in order for your PR to be considered as a merge candidate.

@huangyiqun
Copy link
Copy Markdown
Collaborator

Please run pre-commit run to check and format the code.

@bin913
Copy link
Copy Markdown
Contributor Author

bin913 commented Dec 29, 2025

@bin913 Please sign the CLA in order for your PR to be considered as a merge candidate.

Thanks for the reminder, I have signed the CLA.

b4634e59-2848-41b3-a5cf-4e710fdd04b5

Copy link
Copy Markdown
Collaborator

@kiddyjinjin kiddyjinjin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@kiddyjinjin kiddyjinjin merged commit e678a50 into flagos-ai:master Dec 30, 2025
11 of 14 checks passed
nicelynice pushed a commit to nicelynice/FlagGems that referenced this pull request Feb 24, 2026
* optimize resolve_conj

* optimize argmin

* optimize mean

* recover

* [KUNLUNXIN] open threshold in flaggems benchmark for "index index_add instance_norm kron linspace nll_loss" (flagos-ai#1143)

* [KUNLUNXIN] open threshold in flaggems benchmark for "index index_add instance_norm kron linspace nll_loss"

* [KUNLUNXIN] open threshold in flaggems benchmark for "index index_add instance_norm kron linspace nll_loss"

---------

Co-authored-by: xuerui <xuerui06@baidu.com>

* [KUNLUNXIN] Fix Bug For UnContigious Copy (flagos-ai#1151)

Co-authored-by: zhaoyin <zhaoyin@zhaoyindeMacBook-Pro.local>

* [AdvancedCompiler]Add moe_sum (flagos-ai#688)

* moe_sum

* change format

* delete test_moe_sum

* update for build

* remove json

* add test

* add test to test_special_ops.py

* add benchmark to test_special_perf.py

* add @pytest.mark.moe_sum

---------

Co-authored-by: nmpress1 <1935298275@qq.com>
Co-authored-by: you-and-you <1823382186@qq.com>

* [AdvancedCompiler]Sort(cpp wrapper) (flagos-ai#822)

* [Advanced Compiler]Add TE.geglu&dgeglu (flagos-ai#1056)

* [MTHREADS] Adaptation for MUSA backend (flagos-ai#1150)

* adaptation
    - mean
    - argmax
* optimize
    - max
    - min
    - all
    - any
    - arange
    - argmin
    - batch_norm
    - celu
    - gather
    - log
    - prod
* fix
    - addmm
    - index_put

* [AdvancedCompiler]Per token group quant fp8 (flagos-ai#716)

* implement and test per-token-group fp8 op

---------

Co-authored-by: Ea760 <15236119052@163.com>

* 【Triton Copilot】Enhance test coverage for aten::index operator (flagos-ai#1083)

* Enhance test coverage for aten::index operator

- Fix AttributeError in index operator for mixed basic/advanced indexing
- Add comprehensive test cases for index operator
- Support combining advanced and basic indexing using Triton

Fixes flagos-ai#635

* Fix index_put logic inconsistency and precision issues

- Update get_max_rank_shape() and broadcast_indices() in index_put.py to support None values (consistent with index.py)
- Fix precision issue: create tensor_indices AFTER broadcast_indices to ensure using broadcasted tensors
- Add gen_indices_for_index_put() function in test_reduction_ops.py to properly handle multi-dimensional index shapes
- Update all index_put tests to use gen_indices_for_index_put()

This fixes the pipeline failures and ensures consistency between index and index_put operators.

* Fix code formatting issues (trailing whitespace and black formatting)

* Reduce test cases to prevent timeout

- Remove excessive test cases added to INDEX_ACC_SHAPE
- Keep only the original 8 test cases to match the baseline
- This should prevent CI timeout issues

* Add test cases to improve coverage for index and index_put operators

- Add test cases for None value handling in index operator
- Add test cases for non-contiguous subspace (transpose logic)
- Add test cases for boolean mask indexing
- Add test cases for error handling paths
- Add test cases for edge cases (empty tensor, all None, 1D special case)
- Add error handling tests for index_put operators

Total: 10 new test cases covering critical code paths to improve coverage from 70.8% to target >=90%

* Fix black formatting for long lines in test cases

* Fix failing test cases: remove unsupported scenarios

- Remove test_index_all_none: PyTorch doesn't support all-None indices
- Simplify test_index_with_none_basic_indexing: keep only working parameter combinations
- Remove test_index_non_contiguous_subspace: implementation issue

All remaining test cases now pass successfully (8/8 passed)

* Fix formatting: remove extra blank lines (flake8 and black)

* ci (flagos-ai#1159)

* extract and cache device_info in cpp wrapper (flagos-ai#1152)

* enable index (flagos-ai#1161)

* Fix te ut (flagos-ai#1168)

* 【Operator】 replace autotuner to libtuner for index_put (flagos-ai#1166)

* enable index

* replace autotuner to libtuner for index_put

* fix typos (flagos-ai#1169)

* 【Hopper】Update mm kernel and tune configs (flagos-ai#1104)

* add mm configs

* tma gemm kernel

* adjust the location of mm with tma implementation

* fix ci

* Update imports based on Triton version

Signed-off-by: Galaxy1458 <55453380+Galaxy1458@users.noreply.github.com>

* Update import statement for mm with noqa comment

Signed-off-by: Galaxy1458 <55453380+Galaxy1458@users.noreply.github.com>

---------

Signed-off-by: Galaxy1458 <55453380+Galaxy1458@users.noreply.github.com>
Co-authored-by: Galaxy1458 <55453380+Galaxy1458@users.noreply.github.com>

* [KUNLUNXIN] Fix Full Like (flagos-ai#1163)

* [KUNLUNXIN] skip lerp/lerp_ test on torch 2.0 with kununxin (flagos-ai#1170)

* [KUNLUNXIN] skip lerp/lerp_ test on torch 2.0 with kununxin

The half dtype is un-supported on torch < 2.5, so we have to skip that for now

* [KUNLUNXIN] skip lerp/lerp_ test on torch 2.0 with kununxin

The half dtype is un-supported on torch < 2.5, so we have to skip that for now

* [KUNLUNXIN] Fix Softmax accuracy (flagos-ai#1146)

* [KUNLUNXIN] Fix index_put_ (flagos-ai#1165)

* [KUNLUNXIN] Fix Select Scatter (flagos-ai#1164)

* [kunlunxin] update kron op input shapes from zhiyuan provided except big shapes (flagos-ai#1158)

Co-authored-by: xuerui <xuerui06@baidu.com>

* [KUNLUNXIN] use manual_seed rwkv_mm_sparsity (flagos-ai#1162)

* [KUNLUNXIN] use manual_seed rwkv_mm_sparsity

* [KUNLUNXIN] tl.load use fp32 to update precesion

* Remove register ops (flagos-ai#1171)

* [KUNLUNXIN] Turn on attention tests (flagos-ai#1147)

* [KUNLUNXIN] Passin in_h in max_pool2d_backward_kernel

Signed-off-by: wangrun06 <wangrun06@baidu.com>

* [KUNLUNXIN] Turn on attention tests

Signed-off-by: wangrun06 <wangrun06@baidu.com>

---------

Signed-off-by: wangrun06 <wangrun06@baidu.com>
Co-authored-by: wangrun06 <wangrun06@baidu.com>

* Fix bool type indices (flagos-ai#1023)

* optimize randn and randn_like

* fix code format and style

* remove the unnecessary change

---------

Signed-off-by: Galaxy1458 <55453380+Galaxy1458@users.noreply.github.com>
Signed-off-by: wangrun06 <wangrun06@baidu.com>
Signed-off-by: bin913 <842884726@qq.com>
Co-authored-by: xuexingtu <88195961+xuexingtu@users.noreply.github.com>
Co-authored-by: xuerui <xuerui06@baidu.com>
Co-authored-by: fantasy666 <39185229+fantasy666@users.noreply.github.com>
Co-authored-by: zhaoyin <zhaoyin@zhaoyindeMacBook-Pro.local>
Co-authored-by: AdvancedCompiler <Pikachu_Jun@outlook.com>
Co-authored-by: nmpress1 <1935298275@qq.com>
Co-authored-by: you-and-you <1823382186@qq.com>
Co-authored-by: zhoubo567 <781266327@qq.com>
Co-authored-by: Kylin1207 <13345006231@163.com>
Co-authored-by: Ea760 <15236119052@163.com>
Co-authored-by: Zang Peiyu <166481866+factnn@users.noreply.github.com>
Co-authored-by: Bigeyes <qjk595391@gmail.com>
Co-authored-by: kiddyjinjin <54064850+kiddyjinjin@users.noreply.github.com>
Co-authored-by: WangZhen <23097963+0x45f@users.noreply.github.com>
Co-authored-by: Galaxy1458 <55453380+Galaxy1458@users.noreply.github.com>
Co-authored-by: ylyzty <50573767+ylyzty@users.noreply.github.com>
Co-authored-by: Ason93 <18817617225@163.com>
Co-authored-by: zeno_dongjibin <4pyqm84rz7@privaterelay.appleid.com>
Co-authored-by: purerli98 <82259540+purerli98@users.noreply.github.com>
Co-authored-by: mikiya1991 <anakinlancer@gmail.com>
Co-authored-by: wangrun06 <wangrun06@baidu.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.