【PerfXLab】Optimize Randn and randn_like by bin913 · Pull Request #1247 · flagos-ai/FlagGems

bin913 · 2025-12-20T09:06:54Z

PR Category
[ Operator]

Type of Change
[Performance Optimization]

Description
optimize randn and randn_like

Issue

Progress

Change is properly reviewed (1 reviewer required, 2 recommended).
Change is responded to an issue.
Change is fully covered by a UT.

Performance
t_tensor_constructor_perf.py::test_tensor_constructor_benchmark[randn-randn-generic_constructor_input_fn]
Operator: randn Performance Test (dtype=torch.float16, mode=kernel,level=core)
Status Torch Latency (ms) Gems Latency (ms) Gems Speedup Size Detail

SUCCESS 2.290912 1.666816 1.374 {'size': [1073741824], 'dtype': torch.float16, 'device': 'cuda'}
SUCCESS 0.043616 0.032896 1.326 {'size': [4096, 4096], 'dtype': torch.float16, 'device': 'cuda'}
SUCCESS 0.043712 0.033920 1.289 {'size': [64, 512, 512], 'dtype': torch.float16, 'device': 'cuda'}
SUCCESS 2.274112 1.653408 1.375 {'size': [1024, 1024, 1024], 'dtype': torch.float16, 'device': 'cuda'}
SUCCESS 0.005664 0.005632 1.006 {'size': [64, 64], 'dtype': torch.float16, 'device': 'cuda'}

Operator: randn Performance Test (dtype=torch.float32, mode=kernel,level=core)
Status Torch Latency (ms) Gems Latency (ms) Gems Speedup Size Detail

SUCCESS 2.276512 1.713344 1.329 {'size': [1073741824], 'dtype': torch.float32, 'device': 'cuda'}
SUCCESS 0.043392 0.032320 1.343 {'size': [4096, 4096], 'dtype': torch.float32, 'device': 'cuda'}
SUCCESS 0.043136 0.032384 1.332 {'size': [64, 512, 512], 'dtype': torch.float32, 'device': 'cuda'}
SUCCESS 2.259264 1.687392 1.339 {'size': [1024, 1024, 1024], 'dtype': torch.float32, 'device': 'cuda'}
SUCCESS 0.005856 0.005408 1.083 {'size': [64, 64], 'dtype': torch.float32, 'device': 'cuda'}

Operator: randn Performance Test (dtype=torch.bfloat16, mode=kernel,level=core)
Status Torch Latency (ms) Gems Latency (ms) Gems Speedup Size Detail

SUCCESS 2.273152 1.642272 1.384 {'size': [1073741824], 'dtype': torch.bfloat16, 'device': 'cuda'}
SUCCESS 0.043584 0.032256 1.351 {'size': [4096, 4096], 'dtype': torch.bfloat16, 'device': 'cuda'}
SUCCESS 0.043712 0.032128 1.361 {'size': [64, 512, 512], 'dtype': torch.bfloat16, 'device': 'cuda'}
SUCCESS 2.274240 1.641728 1.385 {'size': [1024, 1024, 1024], 'dtype': torch.bfloat16, 'device': 'cuda'}
SUCCESS 0.005856 0.005408 1.083 {'size': [64, 64], 'dtype': torch.bfloat16, 'device': 'cuda'}

test_tensor_constructor_perf.py::test_tensor_constructor_benchmark[randn_like-randn_like-unary_input_fn]
Operator: randn_like Performance Test (dtype=torch.float16, mode=kernel,level=core)
Status Torch Latency (ms) Gems Latency (ms) Gems Speedup Size Detail

SUCCESS 2.291104 1.667392 1.374 [torch.Size([1073741824])]
SUCCESS 0.044384 0.033152 1.339 [torch.Size([4096, 4096])]
SUCCESS 0.044160 0.032896 1.342 [torch.Size([64, 512, 512])]
SUCCESS 2.273856 1.654624 1.374 [torch.Size([1024, 1024, 1024])]
SUCCESS 0.005536 0.005344 1.036 [torch.Size([64, 64])]

Operator: randn_like Performance Test (dtype=torch.float32, mode=kernel,level=core)
Status Torch Latency (ms) Gems Latency (ms) Gems Speedup Size Detail

SUCCESS 2.260832 1.700096 1.330 [torch.Size([1073741824])]
SUCCESS 0.043168 0.032480 1.329 [torch.Size([4096, 4096])]
SUCCESS 0.043360 0.032480 1.335 [torch.Size([64, 512, 512])]
SUCCESS 2.260352 1.700224 1.329 [torch.Size([1024, 1024, 1024])]
SUCCESS 0.005696 0.005344 1.066 [torch.Size([64, 64])]

Operator: randn_like Performance Test (dtype=torch.bfloat16, mode=kernel,level=core)
Status Torch Latency (ms) Gems Latency (ms) Gems Speedup Size Detail

SUCCESS 2.275488 1.642208 1.386 [torch.Size([1073741824])]
SUCCESS 0.043552 0.032192 1.353 [torch.Size([4096, 4096])]
SUCCESS 0.043360 0.032160 1.348 [torch.Size([64, 512, 512])]
SUCCESS 2.275040 1.642432 1.385 [torch.Size([1024, 1024, 1024])]
SUCCESS 0.005536 0.005344 1.036 [torch.Size([64, 64])]

… instance_norm kron linspace nll_loss" (#1143) * [KUNLUNXIN] open threshold in flaggems benchmark for "index index_add instance_norm kron linspace nll_loss" * [KUNLUNXIN] open threshold in flaggems benchmark for "index index_add instance_norm kron linspace nll_loss" --------- Co-authored-by: xuerui <xuerui06@baidu.com>

Co-authored-by: zhaoyin <zhaoyin@zhaoyindeMacBook-Pro.local>

* moe_sum * change format * delete test_moe_sum * update for build * remove json * add test * add test to test_special_ops.py * add benchmark to test_special_perf.py * add @pytest.mark.moe_sum --------- Co-authored-by: nmpress1 <1935298275@qq.com> Co-authored-by: you-and-you <1823382186@qq.com>

* adaptation - mean - argmax * optimize - max - min - all - any - arange - argmin - batch_norm - celu - gather - log - prod * fix - addmm - index_put

* implement and test per-token-group fp8 op --------- Co-authored-by: Ea760 <15236119052@163.com>

* Enhance test coverage for aten::index operator - Fix AttributeError in index operator for mixed basic/advanced indexing - Add comprehensive test cases for index operator - Support combining advanced and basic indexing using Triton Fixes #635 * Fix index_put logic inconsistency and precision issues - Update get_max_rank_shape() and broadcast_indices() in index_put.py to support None values (consistent with index.py) - Fix precision issue: create tensor_indices AFTER broadcast_indices to ensure using broadcasted tensors - Add gen_indices_for_index_put() function in test_reduction_ops.py to properly handle multi-dimensional index shapes - Update all index_put tests to use gen_indices_for_index_put() This fixes the pipeline failures and ensures consistency between index and index_put operators. * Fix code formatting issues (trailing whitespace and black formatting) * Reduce test cases to prevent timeout - Remove excessive test cases added to INDEX_ACC_SHAPE - Keep only the original 8 test cases to match the baseline - This should prevent CI timeout issues * Add test cases to improve coverage for index and index_put operators - Add test cases for None value handling in index operator - Add test cases for non-contiguous subspace (transpose logic) - Add test cases for boolean mask indexing - Add test cases for error handling paths - Add test cases for edge cases (empty tensor, all None, 1D special case) - Add error handling tests for index_put operators Total: 10 new test cases covering critical code paths to improve coverage from 70.8% to target >=90% * Fix black formatting for long lines in test cases * Fix failing test cases: remove unsupported scenarios - Remove test_index_all_none: PyTorch doesn't support all-None indices - Simplify test_index_with_none_basic_indexing: keep only working parameter combinations - Remove test_index_non_contiguous_subspace: implementation issue All remaining test cases now pass successfully (8/8 passed) * Fix formatting: remove extra blank lines (flake8 and black)

* enable index * replace autotuner to libtuner for index_put

* add mm configs * tma gemm kernel * adjust the location of mm with tma implementation * fix ci * Update imports based on Triton version Signed-off-by: Galaxy1458 <55453380+Galaxy1458@users.noreply.github.com> * Update import statement for mm with noqa comment Signed-off-by: Galaxy1458 <55453380+Galaxy1458@users.noreply.github.com> --------- Signed-off-by: Galaxy1458 <55453380+Galaxy1458@users.noreply.github.com> Co-authored-by: Galaxy1458 <55453380+Galaxy1458@users.noreply.github.com>

* [KUNLUNXIN] skip lerp/lerp_ test on torch 2.0 with kununxin The half dtype is un-supported on torch < 2.5, so we have to skip that for now * [KUNLUNXIN] skip lerp/lerp_ test on torch 2.0 with kununxin The half dtype is un-supported on torch < 2.5, so we have to skip that for now

…big shapes (#1158) Co-authored-by: xuerui <xuerui06@baidu.com>

* [KUNLUNXIN] use manual_seed rwkv_mm_sparsity * [KUNLUNXIN] tl.load use fp32 to update precesion

* [KUNLUNXIN] Passin in_h in max_pool2d_backward_kernel Signed-off-by: wangrun06 <wangrun06@baidu.com> * [KUNLUNXIN] Turn on attention tests Signed-off-by: wangrun06 <wangrun06@baidu.com> --------- Signed-off-by: wangrun06 <wangrun06@baidu.com> Co-authored-by: wangrun06 <wangrun06@baidu.com>

CLAassistant · 2025-12-20T09:07:06Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
15 out of 16 committers have signed the CLA.

✅ bin913
✅ xuexingtu
✅ fantasy666
✅ AdvancedCompiler
✅ factnn
✅ Kylin1207
✅ botbigeyes
✅ kiddyjinjin
✅ 0x45f
✅ Ason93
✅ dongjibin1996
✅ ylyzty
✅ purerli98
✅ Galaxy1458
✅ zhoubo567
❌ mikiya1991
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

Signed-off-by: bin913 <842884726@qq.com>

src/flag_gems/ops/randn.py

tengqm · 2025-12-24T08:28:07Z

tests/test_unary_pointwise_ops.py

    unsqueeze_tensor,
    unsqueeze_tuple,
 )
+from .conftest import TO_CPU


Why this change?

This is introduced by a code conflict, I have removed it in latest commit

src/flag_gems/ops/randn.py

tengqm · 2025-12-28T14:35:14Z

@bin913 Please sign the CLA in order for your PR to be considered as a merge candidate.

huangyiqun · 2025-12-29T02:14:21Z

Please run pre-commit run to check and format the code.

bin913 · 2025-12-29T07:49:05Z

@bin913 Please sign the CLA in order for your PR to be considered as a merge candidate.

Thanks for the reminder, I have signed the CLA.

kiddyjinjin

lgtm

* optimize resolve_conj * optimize argmin * optimize mean * recover * [KUNLUNXIN] open threshold in flaggems benchmark for "index index_add instance_norm kron linspace nll_loss" (flagos-ai#1143) * [KUNLUNXIN] open threshold in flaggems benchmark for "index index_add instance_norm kron linspace nll_loss" * [KUNLUNXIN] open threshold in flaggems benchmark for "index index_add instance_norm kron linspace nll_loss" --------- Co-authored-by: xuerui <xuerui06@baidu.com> * [KUNLUNXIN] Fix Bug For UnContigious Copy (flagos-ai#1151) Co-authored-by: zhaoyin <zhaoyin@zhaoyindeMacBook-Pro.local> * [AdvancedCompiler]Add moe_sum (flagos-ai#688) * moe_sum * change format * delete test_moe_sum * update for build * remove json * add test * add test to test_special_ops.py * add benchmark to test_special_perf.py * add @pytest.mark.moe_sum --------- Co-authored-by: nmpress1 <1935298275@qq.com> Co-authored-by: you-and-you <1823382186@qq.com> * [AdvancedCompiler]Sort(cpp wrapper) (flagos-ai#822) * [Advanced Compiler]Add TE.geglu&dgeglu (flagos-ai#1056) * [MTHREADS] Adaptation for MUSA backend (flagos-ai#1150) * adaptation - mean - argmax * optimize - max - min - all - any - arange - argmin - batch_norm - celu - gather - log - prod * fix - addmm - index_put * [AdvancedCompiler]Per token group quant fp8 (flagos-ai#716) * implement and test per-token-group fp8 op --------- Co-authored-by: Ea760 <15236119052@163.com> * 【Triton Copilot】Enhance test coverage for aten::index operator (flagos-ai#1083) * Enhance test coverage for aten::index operator - Fix AttributeError in index operator for mixed basic/advanced indexing - Add comprehensive test cases for index operator - Support combining advanced and basic indexing using Triton Fixes flagos-ai#635 * Fix index_put logic inconsistency and precision issues - Update get_max_rank_shape() and broadcast_indices() in index_put.py to support None values (consistent with index.py) - Fix precision issue: create tensor_indices AFTER broadcast_indices to ensure using broadcasted tensors - Add gen_indices_for_index_put() function in test_reduction_ops.py to properly handle multi-dimensional index shapes - Update all index_put tests to use gen_indices_for_index_put() This fixes the pipeline failures and ensures consistency between index and index_put operators. * Fix code formatting issues (trailing whitespace and black formatting) * Reduce test cases to prevent timeout - Remove excessive test cases added to INDEX_ACC_SHAPE - Keep only the original 8 test cases to match the baseline - This should prevent CI timeout issues * Add test cases to improve coverage for index and index_put operators - Add test cases for None value handling in index operator - Add test cases for non-contiguous subspace (transpose logic) - Add test cases for boolean mask indexing - Add test cases for error handling paths - Add test cases for edge cases (empty tensor, all None, 1D special case) - Add error handling tests for index_put operators Total: 10 new test cases covering critical code paths to improve coverage from 70.8% to target >=90% * Fix black formatting for long lines in test cases * Fix failing test cases: remove unsupported scenarios - Remove test_index_all_none: PyTorch doesn't support all-None indices - Simplify test_index_with_none_basic_indexing: keep only working parameter combinations - Remove test_index_non_contiguous_subspace: implementation issue All remaining test cases now pass successfully (8/8 passed) * Fix formatting: remove extra blank lines (flake8 and black) * ci (flagos-ai#1159) * extract and cache device_info in cpp wrapper (flagos-ai#1152) * enable index (flagos-ai#1161) * Fix te ut (flagos-ai#1168) * 【Operator】 replace autotuner to libtuner for index_put (flagos-ai#1166) * enable index * replace autotuner to libtuner for index_put * fix typos (flagos-ai#1169) * 【Hopper】Update mm kernel and tune configs (flagos-ai#1104) * add mm configs * tma gemm kernel * adjust the location of mm with tma implementation * fix ci * Update imports based on Triton version Signed-off-by: Galaxy1458 <55453380+Galaxy1458@users.noreply.github.com> * Update import statement for mm with noqa comment Signed-off-by: Galaxy1458 <55453380+Galaxy1458@users.noreply.github.com> --------- Signed-off-by: Galaxy1458 <55453380+Galaxy1458@users.noreply.github.com> Co-authored-by: Galaxy1458 <55453380+Galaxy1458@users.noreply.github.com> * [KUNLUNXIN] Fix Full Like (flagos-ai#1163) * [KUNLUNXIN] skip lerp/lerp_ test on torch 2.0 with kununxin (flagos-ai#1170) * [KUNLUNXIN] skip lerp/lerp_ test on torch 2.0 with kununxin The half dtype is un-supported on torch < 2.5, so we have to skip that for now * [KUNLUNXIN] skip lerp/lerp_ test on torch 2.0 with kununxin The half dtype is un-supported on torch < 2.5, so we have to skip that for now * [KUNLUNXIN] Fix Softmax accuracy (flagos-ai#1146) * [KUNLUNXIN] Fix index_put_ (flagos-ai#1165) * [KUNLUNXIN] Fix Select Scatter (flagos-ai#1164) * [kunlunxin] update kron op input shapes from zhiyuan provided except big shapes (flagos-ai#1158) Co-authored-by: xuerui <xuerui06@baidu.com> * [KUNLUNXIN] use manual_seed rwkv_mm_sparsity (flagos-ai#1162) * [KUNLUNXIN] use manual_seed rwkv_mm_sparsity * [KUNLUNXIN] tl.load use fp32 to update precesion * Remove register ops (flagos-ai#1171) * [KUNLUNXIN] Turn on attention tests (flagos-ai#1147) * [KUNLUNXIN] Passin in_h in max_pool2d_backward_kernel Signed-off-by: wangrun06 <wangrun06@baidu.com> * [KUNLUNXIN] Turn on attention tests Signed-off-by: wangrun06 <wangrun06@baidu.com> --------- Signed-off-by: wangrun06 <wangrun06@baidu.com> Co-authored-by: wangrun06 <wangrun06@baidu.com> * Fix bool type indices (flagos-ai#1023) * optimize randn and randn_like * fix code format and style * remove the unnecessary change --------- Signed-off-by: Galaxy1458 <55453380+Galaxy1458@users.noreply.github.com> Signed-off-by: wangrun06 <wangrun06@baidu.com> Signed-off-by: bin913 <842884726@qq.com> Co-authored-by: xuexingtu <88195961+xuexingtu@users.noreply.github.com> Co-authored-by: xuerui <xuerui06@baidu.com> Co-authored-by: fantasy666 <39185229+fantasy666@users.noreply.github.com> Co-authored-by: zhaoyin <zhaoyin@zhaoyindeMacBook-Pro.local> Co-authored-by: AdvancedCompiler <Pikachu_Jun@outlook.com> Co-authored-by: nmpress1 <1935298275@qq.com> Co-authored-by: you-and-you <1823382186@qq.com> Co-authored-by: zhoubo567 <781266327@qq.com> Co-authored-by: Kylin1207 <13345006231@163.com> Co-authored-by: Ea760 <15236119052@163.com> Co-authored-by: Zang Peiyu <166481866+factnn@users.noreply.github.com> Co-authored-by: Bigeyes <qjk595391@gmail.com> Co-authored-by: kiddyjinjin <54064850+kiddyjinjin@users.noreply.github.com> Co-authored-by: WangZhen <23097963+0x45f@users.noreply.github.com> Co-authored-by: Galaxy1458 <55453380+Galaxy1458@users.noreply.github.com> Co-authored-by: ylyzty <50573767+ylyzty@users.noreply.github.com> Co-authored-by: Ason93 <18817617225@163.com> Co-authored-by: zeno_dongjibin <4pyqm84rz7@privaterelay.appleid.com> Co-authored-by: purerli98 <82259540+purerli98@users.noreply.github.com> Co-authored-by: mikiya1991 <anakinlancer@gmail.com> Co-authored-by: wangrun06 <wangrun06@baidu.com>

bin913 and others added 30 commits December 3, 2025 11:22

optimize resolve_conj

344a3e0

optimize argmin

2d4f01a

optimize mean

7d2551d

recover

a50721d

[KUNLUNXIN] Fix Bug For UnContigious Copy (#1151)

31b4d3a

Co-authored-by: zhaoyin <zhaoyin@zhaoyindeMacBook-Pro.local>

[AdvancedCompiler]Sort(cpp wrapper) (#822)

b52a7dd

[Advanced Compiler]Add TE.geglu&dgeglu (#1056)

57a3cc7

[MTHREADS] Adaptation for MUSA backend (#1150)

9dcf70d

* adaptation - mean - argmax * optimize - max - min - all - any - arange - argmin - batch_norm - celu - gather - log - prod * fix - addmm - index_put

[AdvancedCompiler]Per token group quant fp8 (#716)

e4af5a4

* implement and test per-token-group fp8 op --------- Co-authored-by: Ea760 <15236119052@163.com>

ci (#1159)

26cd688

extract and cache device_info in cpp wrapper (#1152)

adf836c

enable index (#1161)

6be3e63

Fix te ut (#1168)

babad8f

【Operator】 replace autotuner to libtuner for index_put (#1166)

2451b5d

* enable index * replace autotuner to libtuner for index_put

fix typos (#1169)

1899d0f

[KUNLUNXIN] Fix Full Like (#1163)

e8958d2

[KUNLUNXIN] Fix Softmax accuracy (#1146)

6f533d3

[KUNLUNXIN] Fix index_put_ (#1165)

a4c64ae

[KUNLUNXIN] Fix Select Scatter (#1164)

07ff51a

[kunlunxin] update kron op input shapes from zhiyuan provided except …

a3b994a

…big shapes (#1158) Co-authored-by: xuerui <xuerui06@baidu.com>

[KUNLUNXIN] use manual_seed rwkv_mm_sparsity (#1162)

d776d6d

* [KUNLUNXIN] use manual_seed rwkv_mm_sparsity * [KUNLUNXIN] tl.load use fp32 to update precesion

Remove register ops (#1171)

608d7a7

Fix bool type indices (#1023)

439d95e

optimize randn and randn_like

c7a753a

bin913 requested review from 0x45f, huangyiqun, kiddyjinjin and zhangpeiyang1 as code owners December 20, 2025 09:06

Merge branch 'master' into randn_randn_like

50466b2

Signed-off-by: bin913 <842884726@qq.com>

0x45f assigned kiddyjinjin Dec 22, 2025

tengqm reviewed Dec 24, 2025

View reviewed changes

fix code format and style

aecce2d

remove the unnecessary change

57de980

kiddyjinjin approved these changes Dec 29, 2025

View reviewed changes

kiddyjinjin merged commit e678a50 into flagos-ai:master Dec 30, 2025
11 of 14 checks passed

huangyiqun mentioned this pull request Jan 7, 2026

[Operator Development] upsample_nearest1d #1348

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

【PerfXLab】Optimize Randn and randn_like#1247

【PerfXLab】Optimize Randn and randn_like#1247
kiddyjinjin merged 33 commits intoflagos-ai:masterfrom
bin913:randn_randn_like

bin913 commented Dec 20, 2025

Uh oh!

CLAassistant commented Dec 20, 2025 •

edited

Loading

Uh oh!

Uh oh!

tengqm Dec 24, 2025

Uh oh!

bin913 Dec 29, 2025

Uh oh!

Uh oh!

tengqm commented Dec 28, 2025

Uh oh!

huangyiqun commented Dec 29, 2025

Uh oh!

bin913 commented Dec 29, 2025 •

edited

Loading

Uh oh!

kiddyjinjin left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

19 participants

Conversation

bin913 commented Dec 20, 2025

Performance t_tensor_constructor_perf.py::test_tensor_constructor_benchmark[randn-randn-generic_constructor_input_fn] Operator: randn Performance Test (dtype=torch.float16, mode=kernel,level=core) Status Torch Latency (ms) Gems Latency (ms) Gems Speedup Size Detail

Operator: randn Performance Test (dtype=torch.float32, mode=kernel,level=core) Status Torch Latency (ms) Gems Latency (ms) Gems Speedup Size Detail

Operator: randn Performance Test (dtype=torch.bfloat16, mode=kernel,level=core) Status Torch Latency (ms) Gems Latency (ms) Gems Speedup Size Detail

test_tensor_constructor_perf.py::test_tensor_constructor_benchmark[randn_like-randn_like-unary_input_fn] Operator: randn_like Performance Test (dtype=torch.float16, mode=kernel,level=core) Status Torch Latency (ms) Gems Latency (ms) Gems Speedup Size Detail

Operator: randn_like Performance Test (dtype=torch.float32, mode=kernel,level=core) Status Torch Latency (ms) Gems Latency (ms) Gems Speedup Size Detail

Operator: randn_like Performance Test (dtype=torch.bfloat16, mode=kernel,level=core) Status Torch Latency (ms) Gems Latency (ms) Gems Speedup Size Detail

Uh oh!

CLAassistant commented Dec 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

tengqm Dec 24, 2025

Choose a reason for hiding this comment

Uh oh!

bin913 Dec 29, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tengqm commented Dec 28, 2025

Uh oh!

huangyiqun commented Dec 29, 2025

Uh oh!

bin913 commented Dec 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kiddyjinjin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

19 participants

Performance
t_tensor_constructor_perf.py::test_tensor_constructor_benchmark[randn-randn-generic_constructor_input_fn]
Operator: randn Performance Test (dtype=torch.float16, mode=kernel,level=core)
Status Torch Latency (ms) Gems Latency (ms) Gems Speedup Size Detail

Operator: randn Performance Test (dtype=torch.float32, mode=kernel,level=core)
Status Torch Latency (ms) Gems Latency (ms) Gems Speedup Size Detail

Operator: randn Performance Test (dtype=torch.bfloat16, mode=kernel,level=core)
Status Torch Latency (ms) Gems Latency (ms) Gems Speedup Size Detail

test_tensor_constructor_perf.py::test_tensor_constructor_benchmark[randn_like-randn_like-unary_input_fn]
Operator: randn_like Performance Test (dtype=torch.float16, mode=kernel,level=core)
Status Torch Latency (ms) Gems Latency (ms) Gems Speedup Size Detail

Operator: randn_like Performance Test (dtype=torch.float32, mode=kernel,level=core)
Status Torch Latency (ms) Gems Latency (ms) Gems Speedup Size Detail

Operator: randn_like Performance Test (dtype=torch.bfloat16, mode=kernel,level=core)
Status Torch Latency (ms) Gems Latency (ms) Gems Speedup Size Detail

CLAassistant commented Dec 20, 2025 •

edited

Loading

bin913 commented Dec 29, 2025 •

edited

Loading