forked from llvm/llvm-project
-
Notifications
You must be signed in to change notification settings - Fork 0
Update GetElementPtr.rst #1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
moar55
pushed a commit
that referenced
this pull request
Feb 19, 2025
For function declarations (i.e. func op has no entry block), the FunctionOpInterface method `insertArgument` and `eraseArgument` will cause segfault. This PR guards against manipulation of empty entry block by checking whether func op is external. An example can be seen in google/heir#1324 The segfault trace ``` #1 0x0000560f1289d9db PrintStackTraceSignalHandler(void*) /proc/self/cwd/external/llvm-project/llvm/lib/Support/Unix/Signals.inc:874:1 llvm#2 0x0000560f1289b116 llvm::sys::RunSignalHandlers() /proc/self/cwd/external/llvm-project/llvm/lib/Support/Signals.cpp:105:5 llvm#3 0x0000560f1289e145 SignalHandler(int) /proc/self/cwd/external/llvm-project/llvm/lib/Support/Unix/Signals.inc:415:1 llvm#4 0x00007f829a3d9520 (/lib/x86_64-linux-gnu/libc.so.6+0x42520) llvm#5 0x0000560f1257f8bc void __gnu_cxx::new_allocator<mlir::BlockArgument>::construct<mlir::BlockArgument, mlir::BlockArgument>(mlir::BlockArgument*, mlir::BlockArgument&&) /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/ext/new_allocator.h:162:23 llvm#6 0x0000560f1257f84d void std::allocator_traits<std::allocator<mlir::BlockArgument> >::construct<mlir::BlockArgument, mlir::BlockArgument>(std::allocator<mlir::BlockArgument>&, mlir::BlockArgument*, mlir::BlockArgument&&) /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/alloc_traits.h:520:2 llvm#7 0x0000560f12580498 void std::vector<mlir::BlockArgument, std::allocator<mlir::BlockArgument> >::_M_insert_aux<mlir::BlockArgument>(__gnu_cxx::__normal_iterator<mlir::BlockArgument*, std::vector<mlir::BlockArgument, std::allocator<mlir::BlockArgument> > >, mlir::BlockArgument&&) /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/vector.tcc:405:7 llvm#8 0x0000560f1257cf7e std::vector<mlir::BlockArgument, std::allocator<mlir::BlockArgument> >::insert(__gnu_cxx::__normal_iterator<mlir::BlockArgument const*, std::vector<mlir::BlockArgument, std::allocator<mlir::BlockArgument> > >, mlir::BlockArgument const&) /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/vector.tcc:154:6 llvm#9 0x0000560f1257b349 mlir::Block::insertArgument(unsigned int, mlir::Type, mlir::Location) /proc/self/cwd/external/llvm-project/mlir/lib/IR/Block.cpp:178:13 llvm#10 0x0000560f123d2a1c mlir::function_interface_impl::insertFunctionArguments(mlir::FunctionOpInterface, llvm::ArrayRef<unsigned int>, mlir::TypeRange, llvm::ArrayRef<mlir::DictionaryAttr>, llvm::ArrayRef<mlir::Location>, unsigned int, mlir::Type) /proc/self/cwd/external/llvm-project/mlir/lib/Interfaces/FunctionInterfaces.cpp:232:11 llvm#11 0x0000560f0be6b727 mlir::detail::FunctionOpInterfaceTrait<mlir::func::FuncOp>::insertArguments(llvm::ArrayRef<unsigned int>, mlir::TypeRange, llvm::ArrayRef<mlir::DictionaryAttr>, llvm::ArrayRef<mlir::Location>) /proc/self/cwd/bazel-out/k8-dbg/bin/external/llvm-project/mlir/include/mlir/Interfaces/FunctionInterfaces.h.inc:809:7 llvm#12 0x0000560f0be6b536 mlir::detail::FunctionOpInterfaceTrait<mlir::func::FuncOp>::insertArgument(unsigned int, mlir::Type, mlir::DictionaryAttr, mlir::Location) /proc/self/cwd/bazel-out/k8-dbg/bin/external/llvm-project/mlir/include/mlir/Interfaces/FunctionInterfaces.h.inc:796:7 ```
moar55
pushed a commit
that referenced
this pull request
Nov 23, 2025
## Summary
Fix `FindProcesses` to respect Android's `hidepid=2` security model and
enable name matching for Android apps.
## Problem
1. Called `adb shell pidof` or `adb shell ps` directly, bypassing
Android's process visibility restrictions
2. Name matching failed for Android apps - searched for
`com.example.myapp` but GDB Remote Protocol reports `app_process64`
Android apps fork from Zygote, so `/proc/PID/exe` points to
`app_process64` for all apps. The actual package name is only in
`/proc/PID/cmdline`. The previous implementation applied name filters
without supplementing with cmdline, so searches failed.
## Fix
- Delegate to lldb-server via GDB Remote Protocol (respects `hidepid=2`)
- Get all visible processes, supplement zygote/app_process entries with
cmdline, then apply name matching
- Only fetch cmdline for zygote apps (performance), parallelize with
`xargs -P 8`
- Remove redundant code (GDB Remote Protocol already provides GID/arch)
## Test Results
### Before this fix:
```
(lldb) platform process list
error: no processes were found on the "remote-android" platform
(lldb) platform process list -n com.example.hellojni
1 matching process was found on "remote-android"
PID PARENT USER TRIPLE NAME
====== ====== ========== ============================== ============================
5276 359 u0_a192 com.example.hellojni
^^^^^^^^ Missing triple!
```
### After this fix:
```
(lldb) platform process list
PID PARENT USER TRIPLE NAME
====== ====== ========== ============================== ============================
1 0 root aarch64-unknown-linux-android init
2 0 root [kthreadd]
359 1 system aarch64-unknown-linux-android app_process64
5276 359 u0_a192 aarch64-unknown-linux-android com.example.hellojni
5357 5355 u0_a192 aarch64-unknown-linux-android sh
5377 5370 u0_a192 aarch64-unknown-linux-android lldb-server
^^^^^^^^ User-space processes now have triples!
(lldb) platform process list -n com.example.hellojni
1 matching process was found on "remote-android"
PID PARENT USER TRIPLE NAME
====== ====== ========== ============================== ============================
5276 359 u0_a192 aarch64-unknown-linux-android com.example.hellojni
(lldb) process attach -n com.example.hellojni
Process 5276 stopped
* thread #1, name = 'example.hellojni', stop reason = signal SIGSTOP
```
## Test Plan
With an Android device/emulator connected:
1. Start lldb-server on device:
```bash
adb push lldb-server /data/local/tmp/
adb shell chmod +x /data/local/tmp/lldb-server
adb shell /data/local/tmp/lldb-server platform --listen 127.0.0.1:9500 --server
```
2. Connect from LLDB:
```
(lldb) platform select remote-android
(lldb) platform connect connect://127.0.0.1:9500
(lldb) platform process list
```
3. Verify:
- `platform process list` returns all processes with triple information
- `platform process list -n com.example.app` finds Android apps by
package name
- `process attach -n com.example.app` successfully attaches to Android
apps
## Impact
Restores `platform process list` on Android with architecture
information and package name lookup. All name matching modes now work
correctly.
Fixes llvm#164192
moar55
pushed a commit
that referenced
this pull request
Nov 23, 2025
…am (llvm#167724) This got exposed by `09262656f32ab3f2e1d82e5342ba37eecac52522`. The underlying stream of `m_os` is referenced by the `TextDiagnostic` member of `TextDiagnosticPrinter`. It got turned into a `llvm::formatted_raw_ostream` in the commit above. When `~TextDiagnosticPrinter` (and thus `~TextDiagnostic`) is invoked, we now call `~formatted_raw_ostream`, which tries to access the underlying stream. But `m_os` was already deleted because it is earlier in the order of destruction in `TextDiagnosticPrinter`. Move the `m_os` member before the `TextDiagnosticPrinter` to avoid a use-after-free. Drive-by: * Also move the `m_output` member which the `m_os` holds a reference to. The fact it's a reference indicates the expectation is most likely that the string outlives the stream. The ASAN macOS bot is currently failing with this: ``` 08:15:39 ================================================================= 08:15:39 ==61103==ERROR: AddressSanitizer: heap-use-after-free on address 0x60600012cf40 at pc 0x00012140d304 bp 0x00016eecc850 sp 0x00016eecc848 08:15:39 READ of size 8 at 0x60600012cf40 thread T0 08:15:39 #0 0x00012140d300 in llvm::formatted_raw_ostream::releaseStream() FormattedStream.h:205 08:15:39 #1 0x00012140d3a4 in llvm::formatted_raw_ostream::~formatted_raw_ostream() FormattedStream.h:145 08:15:39 llvm#2 0x00012604abf8 in clang::TextDiagnostic::~TextDiagnostic() TextDiagnostic.cpp:721 08:15:39 llvm#3 0x00012605dc80 in clang::TextDiagnosticPrinter::~TextDiagnosticPrinter() TextDiagnosticPrinter.cpp:30 08:15:39 llvm#4 0x00012605dd5c in clang::TextDiagnosticPrinter::~TextDiagnosticPrinter() TextDiagnosticPrinter.cpp:27 08:15:39 llvm#5 0x0001231fb210 in (anonymous namespace)::StoringDiagnosticConsumer::~StoringDiagnosticConsumer() ClangModulesDeclVendor.cpp:47 08:15:39 llvm#6 0x0001231fb3bc in (anonymous namespace)::StoringDiagnosticConsumer::~StoringDiagnosticConsumer() ClangModulesDeclVendor.cpp:47 08:15:39 llvm#7 0x000129aa9d70 in clang::DiagnosticsEngine::~DiagnosticsEngine() Diagnostic.cpp:91 08:15:39 llvm#8 0x0001230436b8 in llvm::RefCountedBase<clang::DiagnosticsEngine>::Release() const IntrusiveRefCntPtr.h:103 08:15:39 llvm#9 0x0001231fe6c8 in (anonymous namespace)::ClangModulesDeclVendorImpl::~ClangModulesDeclVendorImpl() ClangModulesDeclVendor.cpp:93 08:15:39 llvm#10 0x0001231fe858 in (anonymous namespace)::ClangModulesDeclVendorImpl::~ClangModulesDeclVendorImpl() ClangModulesDeclVendor.cpp:93 ... 08:15:39 08:15:39 0x60600012cf40 is located 32 bytes inside of 56-byte region [0x60600012cf20,0x60600012cf58) 08:15:39 freed by thread T0 here: 08:15:39 #0 0x0001018abb88 in _ZdlPv+0x74 (libclang_rt.asan_osx_dynamic.dylib:arm64e+0x4bb88) 08:15:39 #1 0x0001231fb1c0 in (anonymous namespace)::StoringDiagnosticConsumer::~StoringDiagnosticConsumer() ClangModulesDeclVendor.cpp:47 08:15:39 llvm#2 0x0001231fb3bc in (anonymous namespace)::StoringDiagnosticConsumer::~StoringDiagnosticConsumer() ClangModulesDeclVendor.cpp:47 08:15:39 llvm#3 0x000129aa9d70 in clang::DiagnosticsEngine::~DiagnosticsEngine() Diagnostic.cpp:91 08:15:39 llvm#4 0x0001230436b8 in llvm::RefCountedBase<clang::DiagnosticsEngine>::Release() const IntrusiveRefCntPtr.h:103 08:15:39 llvm#5 0x0001231fe6c8 in (anonymous namespace)::ClangModulesDeclVendorImpl::~ClangModulesDeclVendorImpl() ClangModulesDeclVendor.cpp:93 08:15:39 llvm#6 0x0001231fe858 in (anonymous namespace)::ClangModulesDeclVendorImpl::~ClangModulesDeclVendorImpl() ClangModulesDeclVendor.cpp:93 ... 08:15:39 08:15:39 previously allocated by thread T0 here: 08:15:39 #0 0x0001018ab760 in _Znwm+0x74 (libclang_rt.asan_osx_dynamic.dylib:arm64e+0x4b760) 08:15:39 #1 0x0001231f8dec in lldb_private::ClangModulesDeclVendor::Create(lldb_private::Target&) ClangModulesDeclVendor.cpp:732 08:15:39 llvm#2 0x00012320af58 in lldb_private::ClangPersistentVariables::GetClangModulesDeclVendor() ClangPersistentVariables.cpp:124 08:15:39 llvm#3 0x0001232111f0 in lldb_private::ClangUserExpression::PrepareForParsing(lldb_private::DiagnosticManager&, lldb_private::ExecutionContext&, bool) ClangUserExpression.cpp:536 08:15:39 llvm#4 0x000123213790 in lldb_private::ClangUserExpression::Parse(lldb_private::DiagnosticManager&, lldb_private::ExecutionContext&, lldb_private::ExecutionPolicy, bool, bool) ClangUserExpression.cpp:647 08:15:39 llvm#5 0x00012032b258 in lldb_private::UserExpression::Evaluate(lldb_private::ExecutionContext&, lldb_private::EvaluateExpressionOptions const&, llvm::StringRef, llvm::StringRef, std::__1::shared_ptr<lldb_private::ValueObject>&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>*, lldb_private::ValueObject*) UserExpression.cpp:280 08:15:39 llvm#6 0x000120724010 in lldb_private::Target::EvaluateExpression(llvm::StringRef, lldb_private::ExecutionContextScope*, std::__1::shared_ptr<lldb_private::ValueObject>&, lldb_private::EvaluateExpressionOptions const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>*, lldb_private::ValueObject*) Target.cpp:2905 08:15:39 llvm#7 0x00011fc7bde0 in lldb::SBTarget::EvaluateExpression(char const*, lldb::SBExpressionOptions const&) SBTarget.cpp:2305 08:15:39 ==61103==ABORTING ... ```
moar55
pushed a commit
that referenced
this pull request
Nov 23, 2025
llvm#168105) …63019)" This reverts commit 92e5608.
moar55
pushed a commit
that referenced
this pull request
Nov 23, 2025
llvm#168619) I've been working on some scripts that evaluate the parent and child frame. It's been very annoying that the parent frame has a property but not the child. So I've added this to the extensions, I would've preferred to return None, but because the existing impl returns an invalid SBFrame, so I'm conforming to that API. ``` (lldb) script Python Interactive Interpreter. To exit, type 'quit()', 'exit()' or Ctrl-D. >>> lldb.frame frame #0: 0x0000555555555200 fib.out`main >>> lldb.frame.parent frame #1: 0x00007ffff782a610 libc.so.6`__libc_start_call_main + 128 >>> lldb.frame.parent.child frame #0: 0x0000555555555200 fib.out`main ```
moar55
pushed a commit
that referenced
this pull request
Feb 2, 2026
In this PR i move the insertion point in the `yieldReplacementForFusedProducer` because i ran into some issue where a `tensor.extract_slices` tried to use a result of `affine.apply` that was inserted at the end of the block instead of the start of it. This is the full error of the test i added before this change: ```mlir third-party/llvm-project/mlir/test/Interfaces/TilingInterface/tile-fuse-and-yield-using-scfforall.mlir:83:11: error: operand #1 does not dominate this use %pack = linalg.pack %gen#1 ^ third-party/llvm-project/mlir/test/Interfaces/TilingInterface/tile-fuse-and-yield-using-scfforall.mlir:83:11: note: see current operation: %24 = "tensor.extract_slice"(%23, %36, %8) <{operandSegmentSizes = array<i32: 1, 1, 1, 0>, static_offsets = array<i64: -9223372036854775808, 0>, static_sizes = array<i64: -9223372036854775808, 1024>, static_strides = array<i64: 1, 1>}> : (tensor<32x1024xf32>, index, index) -> tensor<?x1024xf32> third-party/llvm-project/mlir/test/Interfaces/TilingInterface/tile-fuse-and-yield-using-scfforall.mlir:71:12: note: operand defined here (op in the same block) %gen:2 = linalg.generic { ^ // -----// IR Dump After InterpreterPass Failed (transform-interpreter) //----- // #map = affine_map<(d0, d1) -> (d0, d1)> #map1 = affine_map<(d0) -> (d0 * 16)> #map2 = affine_map<(d0) -> (d0 * -16 + 32)> #map3 = affine_map<(d0) -> (16, d0 * -16 + 32)> #map4 = affine_map<(d0) -> (d0 - 1)> "builtin.module"() ({ "func.func"() <{function_type = (tensor<32x1024xf32>) -> (tensor<32x1024xf32>, tensor<2x512x16x2xi8>), sym_name = "fuse_pack_consumer_into_multi_output_generic"}> ({ ^bb0(%arg1: tensor<32x1024xf32>): %2 = "arith.constant"() <{value = 0 : i8}> : () -> i8 %3 = "tensor.empty"() : () -> tensor<32x1024xf32> %4 = "tensor.empty"() : () -> tensor<32x1024xi8> %5 = "tensor.empty"() : () -> tensor<2x512x16x2xi8> %6:2 = "linalg.generic"(%arg1, %3, %4) <{indexing_maps = [#map, #map, #map], iterator_types = [#linalg.iterator_type<parallel>, #linalg.iterator_type<parallel>], operandSegmentSizes = array<i32: 1, 2>}> ({ ^bb0(%arg9: f32, %arg10: f32, %arg11: i8): %41 = "arith.fptoui"(%arg9) : (f32) -> i8 "linalg.yield"(%arg9, %41) : (f32, i8) -> () }) : (tensor<32x1024xf32>, tensor<32x1024xf32>, tensor<32x1024xi8>) -> (tensor<32x1024xf32>, tensor<32x1024xi8>) %7:3 = "scf.forall"(%5, %3, %4) <{operandSegmentSizes = array<i32: 0, 0, 0, 3>, staticLowerBound = array<i64: 0>, staticStep = array<i64: 1>, staticUpperBound = array<i64: 2>}> ({ ^bb0(%arg2: index, %arg3: tensor<2x512x16x2xi8>, %arg4: tensor<32x1024xf32>, %arg5: tensor<32x1024xi8>): %8 = "affine.apply"(%arg2) <{map = #map1}> : (index) -> index %9 = "affine.apply"(%arg2) <{map = #map2}> : (index) -> index %10 = "affine.min"(%arg2) <{map = #map3}> : (index) -> index %11 = "affine.apply"(%10) <{map = #map4}> : (index) -> index %12 = "affine.apply"(%arg2) <{map = #map1}> : (index) -> index %13 = "affine.apply"(%10) <{map = #map4}> : (index) -> index %14 = "affine.apply"(%arg2) <{map = #map1}> : (index) -> index %15 = "affine.apply"(%10) <{map = #map4}> : (index) -> index %16 = "affine.apply"(%arg2) <{map = #map1}> : (index) -> index %17 = "affine.apply"(%10) <{map = #map4}> : (index) -> index %18 = "tensor.extract_slice"(%arg1, %12, %10) <{operandSegmentSizes = array<i32: 1, 1, 1, 0>, static_offsets = array<i64: -9223372036854775808, 0>, static_sizes = array<i64: -9223372036854775808, 1024>, static_strides = array<i64: 1, 1>}> : (tensor<32x1024xf32>, index, index) -> tensor<?x1024xf32> %19 = "tensor.empty"() : () -> tensor<32x1024xf32> %20 = "tensor.extract_slice"(%19, %14, %10) <{operandSegmentSizes = array<i32: 1, 1, 1, 0>, static_offsets = array<i64: -9223372036854775808, 0>, static_sizes = array<i64: -9223372036854775808, 1024>, static_strides = array<i64: 1, 1>}> : (tensor<32x1024xf32>, index, index) -> tensor<?x1024xf32> %21 = "tensor.extract_slice"(%3, %14, %10) <{operandSegmentSizes = array<i32: 1, 1, 1, 0>, static_offsets = array<i64: -9223372036854775808, 0>, static_sizes = array<i64: -9223372036854775808, 1024>, static_strides = array<i64: 1, 1>}> : (tensor<32x1024xf32>, index, index) -> tensor<?x1024xf32> %22 = "tensor.empty"() : () -> tensor<32x1024xi8> %23 = "tensor.extract_slice"(%22, %16, %10) <{operandSegmentSizes = array<i32: 1, 1, 1, 0>, static_offsets = array<i64: -9223372036854775808, 0>, static_sizes = array<i64: -9223372036854775808, 1024>, static_strides = array<i64: 1, 1>}> : (tensor<32x1024xi8>, index, index) -> tensor<?x1024xi8> %24 = "tensor.extract_slice"(%4, %16, %10) <{operandSegmentSizes = array<i32: 1, 1, 1, 0>, static_offsets = array<i64: -9223372036854775808, 0>, static_sizes = array<i64: -9223372036854775808, 1024>, static_strides = array<i64: 1, 1>}> : (tensor<32x1024xi8>, index, index) -> tensor<?x1024xi8> %25 = "tensor.empty"() : () -> tensor<32x1024xf32> %26 = "tensor.extract_slice"(%25, %38, %10) <{operandSegmentSizes = array<i32: 1, 1, 1, 0>, static_offsets = array<i64: -9223372036854775808, 0>, static_sizes = array<i64: -9223372036854775808, 1024>, static_strides = array<i64: 1, 1>}> : (tensor<32x1024xf32>, index, index) -> tensor<?x1024xf32> %27 = "tensor.extract_slice"(%arg4, %38, %10) <{operandSegmentSizes = array<i32: 1, 1, 1, 0>, static_offsets = array<i64: -9223372036854775808, 0>, static_sizes = array<i64: -9223372036854775808, 1024>, static_strides = array<i64: 1, 1>}> : (tensor<32x1024xf32>, index, index) -> tensor<?x1024xf32> %28 = "tensor.empty"() : () -> tensor<32x1024xi8> %29 = "tensor.extract_slice"(%28, %8, %10) <{operandSegmentSizes = array<i32: 1, 1, 1, 0>, static_offsets = array<i64: -9223372036854775808, 0>, static_sizes = array<i64: -9223372036854775808, 1024>, static_strides = array<i64: 1, 1>}> : (tensor<32x1024xi8>, index, index) -> tensor<?x1024xi8> %30 = "tensor.extract_slice"(%arg5, %8, %10) <{operandSegmentSizes = array<i32: 1, 1, 1, 0>, static_offsets = array<i64: -9223372036854775808, 0>, static_sizes = array<i64: -9223372036854775808, 1024>, static_strides = array<i64: 1, 1>}> : (tensor<32x1024xi8>, index, index) -> tensor<?x1024xi8> %31:2 = "linalg.generic"(%18, %27, %30) <{indexing_maps = [#map, #map, #map], iterator_types = [#linalg.iterator_type<parallel>, #linalg.iterator_type<parallel>], operandSegmentSizes = array<i32: 1, 2>}> ({ ^bb0(%arg6: f32, %arg7: f32, %arg8: i8): %40 = "arith.fptoui"(%arg6) : (f32) -> i8 "linalg.yield"(%arg6, %40) : (f32, i8) -> () }) : (tensor<?x1024xf32>, tensor<?x1024xf32>, tensor<?x1024xi8>) -> (tensor<?x1024xf32>, tensor<?x1024xi8>) %32 = "tensor.extract_slice"(%6#1, %8, %10) <{operandSegmentSizes = array<i32: 1, 1, 1, 0>, static_offsets = array<i64: -9223372036854775808, 0>, static_sizes = array<i64: -9223372036854775808, 1024>, static_strides = array<i64: 1, 1>}> : (tensor<32x1024xi8>, index, index) -> tensor<?x1024xi8> %33 = "tensor.empty"() : () -> tensor<2x512x16x2xi8> %34 = "tensor.extract_slice"(%33, %arg2) <{operandSegmentSizes = array<i32: 1, 1, 0, 0>, static_offsets = array<i64: -9223372036854775808, 0, 0, 0>, static_sizes = array<i64: 1, 512, 16, 2>, static_strides = array<i64: 1, 1, 1, 1>}> : (tensor<2x512x16x2xi8>, index) -> tensor<1x512x16x2xi8> %35 = "tensor.extract_slice"(%arg3, %arg2) <{operandSegmentSizes = array<i32: 1, 1, 0, 0>, static_offsets = array<i64: -9223372036854775808, 0, 0, 0>, static_sizes = array<i64: 1, 512, 16, 2>, static_strides = array<i64: 1, 1, 1, 1>}> : (tensor<2x512x16x2xi8>, index) -> tensor<1x512x16x2xi8> %36 = "linalg.pack"(%31#1, %35, %2) <{inner_dims_pos = array<i64: 0, 1>, operandSegmentSizes = array<i32: 1, 1, 1, 0>, static_inner_tiles = array<i64: 16, 2>}> : (tensor<?x1024xi8>, tensor<1x512x16x2xi8>, i8) -> tensor<1x512x16x2xi8> %37 = "affine.apply"(%10) <{map = #map4}> : (index) -> index %38 = "affine.apply"(%arg2) <{map = #map1}> : (index) -> index %39 = "affine.apply"(%10) <{map = #map4}> : (index) -> index "scf.forall.in_parallel"() ({ "tensor.parallel_insert_slice"(%36, %arg3, %arg2) <{operandSegmentSizes = array<i32: 1, 1, 1, 0, 0>, static_offsets = array<i64: -9223372036854775808, 0, 0, 0>, static_sizes = array<i64: 1, 512, 16, 2>, static_strides = array<i64: 1, 1, 1, 1>}> : (tensor<1x512x16x2xi8>, tensor<2x512x16x2xi8>, index) -> () "tensor.parallel_insert_slice"(%31#0, %arg4, %38, %10) <{operandSegmentSizes = array<i32: 1, 1, 1, 1, 0>, static_offsets = array<i64: -9223372036854775808, 0>, static_sizes = array<i64: -9223372036854775808, 1024>, static_strides = array<i64: 1, 1>}> : (tensor<?x1024xf32>, tensor<32x1024xf32>, index, index) -> () "tensor.parallel_insert_slice"(%31#1, %arg5, %8, %10) <{operandSegmentSizes = array<i32: 1, 1, 1, 1, 0>, static_offsets = array<i64: -9223372036854775808, 0>, static_sizes = array<i64: -9223372036854775808, 1024>, static_strides = array<i64: 1, 1>}> : (tensor<?x1024xi8>, tensor<32x1024xi8>, index, index) -> () }) : () -> () }) : (tensor<2x512x16x2xi8>, tensor<32x1024xf32>, tensor<32x1024xi8>) -> (tensor<2x512x16x2xi8>, tensor<32x1024xf32>, tensor<32x1024xi8>) "func.return"(%7#1, %7#0) : (tensor<32x1024xf32>, tensor<2x512x16x2xi8>) -> () }) : () -> () "builtin.module"() ({ "transform.named_sequence"() <{arg_attrs = [{transform.readonly}], function_type = (!transform.any_op) -> (), sym_name = "__transform_main"}> ({ ^bb0(%arg0: !transform.any_op): %0 = "transform.structured.match"(%arg0) <{ops = ["linalg.pack"]}> : (!transform.any_op) -> !transform.any_op %1:2 = "transform.test.fuse_and_yield"(%0) <{tile_interchange = [], tile_sizes = [1], use_forall = true}> : (!transform.any_op) -> (!transform.any_op, !transform.any_op) "transform.yield"() : () -> () }) : () -> () }) {transform.with_named_sequence} : () -> () }) : () -> () ``` I also noticed that Interface tests are missing from the bazel overlay so i also added this.
moar55
pushed a commit
that referenced
this pull request
Feb 3, 2026
…m#167446) Add SVE optimization for AArch64 architectures. The idea is to use predicate registers to avoid branching. Microbench in repo shows considerable improvements on NV GB10 (locked on largest X925): ``` ====================================================================== BENCHMARK STATISTICS (time in nanoseconds) ====================================================================== memcpy_Google_A: Old - Mean: 3.1257 ns, Median: 3.1162 ns New - Mean: 2.8402 ns, Median: 2.8265 ns Improvement: +9.14% (mean), +9.30% (median) memcpy_Google_B: Old - Mean: 2.3171 ns, Median: 2.3159 ns New - Mean: 1.6589 ns, Median: 1.6593 ns Improvement: +28.40% (mean), +28.35% (median) memcpy_Google_D: Old - Mean: 8.7602 ns, Median: 8.7645 ns New - Mean: 8.4307 ns, Median: 8.4308 ns Improvement: +3.76% (mean), +3.81% (median) memcpy_Google_L: Old - Mean: 1.7137 ns, Median: 1.7091 ns New - Mean: 1.4530 ns, Median: 1.4553 ns Improvement: +15.22% (mean), +14.85% (median) memcpy_Google_M: Old - Mean: 1.9823 ns, Median: 1.9825 ns New - Mean: 1.4826 ns, Median: 1.4840 ns Improvement: +25.20% (mean), +25.15% (median) memcpy_Google_Q: Old - Mean: 1.6812 ns, Median: 1.6784 ns New - Mean: 1.1538 ns, Median: 1.1517 ns Improvement: +31.37% (mean), +31.38% (median) memcpy_Google_S: Old - Mean: 2.1816 ns, Median: 2.1786 ns New - Mean: 1.6297 ns, Median: 1.6287 ns Improvement: +25.29% (mean), +25.24% (median) memcpy_Google_U: Old - Mean: 2.2851 ns, Median: 2.2825 ns New - Mean: 1.7219 ns, Median: 1.7187 ns Improvement: +24.65% (mean), +24.70% (median) memcpy_Google_W: Old - Mean: 2.0408 ns, Median: 2.0361 ns New - Mean: 1.5260 ns, Median: 1.5252 ns Improvement: +25.23% (mean), +25.09% (median) uniform_384_to_4096: Old - Mean: 26.9067 ns, Median: 26.8845 ns New - Mean: 26.8083 ns, Median: 26.8149 ns Improvement: +0.37% (mean), +0.26% (median) ``` The beginning of the memcpy function looks like the following: ``` Dump of assembler code for function _ZN22__llvm_libc_22_0_0_git6memcpyEPvPKvm: 0x0000000000001340 <+0>: cbz x2, 0x143c <_ZN22__llvm_libc_22_0_0_git6memcpyEPvPKvm+252> 0x0000000000001344 <+4>: cbz x0, 0x1440 <_ZN22__llvm_libc_22_0_0_git6memcpyEPvPKvm+256> 0x0000000000001348 <+8>: cbz x1, 0x1444 <_ZN22__llvm_libc_22_0_0_git6memcpyEPvPKvm+260> 0x000000000000134c <+12>: subs x8, x2, #0x20 0x0000000000001350 <+16>: b.hi 0x1374 <_ZN22__llvm_libc_22_0_0_git6memcpyEPvPKvm+52> // b.pmore 0x0000000000001354 <+20>: rdvl x8, #1 0x0000000000001358 <+24>: whilelo p0.b, xzr, x2 0x000000000000135c <+28>: ld1b {z0.b}, p0/z, [x1] 0x0000000000001360 <+32>: whilelo p1.b, x8, x2 0x0000000000001364 <+36>: ld1b {z1.b}, p1/z, [x1, #1, mul vl] 0x0000000000001368 <+40>: st1b {z0.b}, p0, [x0] 0x000000000000136c <+44>: st1b {z1.b}, p1, [x0, #1, mul vl] 0x0000000000001370 <+48>: ret ``` --------- Co-authored-by: Guillaume Chatelet <chatelet.guillaume@gmail.com>
moar55
pushed a commit
that referenced
this pull request
Feb 3, 2026
…m#167446) Add SVE optimization for AArch64 architectures. The idea is to use predicate registers to avoid branching. Microbench in repo shows considerable improvements on NV GB10 (locked on largest X925): ``` ====================================================================== BENCHMARK STATISTICS (time in nanoseconds) ====================================================================== memcpy_Google_A: Old - Mean: 3.1257 ns, Median: 3.1162 ns New - Mean: 2.8402 ns, Median: 2.8265 ns Improvement: +9.14% (mean), +9.30% (median) memcpy_Google_B: Old - Mean: 2.3171 ns, Median: 2.3159 ns New - Mean: 1.6589 ns, Median: 1.6593 ns Improvement: +28.40% (mean), +28.35% (median) memcpy_Google_D: Old - Mean: 8.7602 ns, Median: 8.7645 ns New - Mean: 8.4307 ns, Median: 8.4308 ns Improvement: +3.76% (mean), +3.81% (median) memcpy_Google_L: Old - Mean: 1.7137 ns, Median: 1.7091 ns New - Mean: 1.4530 ns, Median: 1.4553 ns Improvement: +15.22% (mean), +14.85% (median) memcpy_Google_M: Old - Mean: 1.9823 ns, Median: 1.9825 ns New - Mean: 1.4826 ns, Median: 1.4840 ns Improvement: +25.20% (mean), +25.15% (median) memcpy_Google_Q: Old - Mean: 1.6812 ns, Median: 1.6784 ns New - Mean: 1.1538 ns, Median: 1.1517 ns Improvement: +31.37% (mean), +31.38% (median) memcpy_Google_S: Old - Mean: 2.1816 ns, Median: 2.1786 ns New - Mean: 1.6297 ns, Median: 1.6287 ns Improvement: +25.29% (mean), +25.24% (median) memcpy_Google_U: Old - Mean: 2.2851 ns, Median: 2.2825 ns New - Mean: 1.7219 ns, Median: 1.7187 ns Improvement: +24.65% (mean), +24.70% (median) memcpy_Google_W: Old - Mean: 2.0408 ns, Median: 2.0361 ns New - Mean: 1.5260 ns, Median: 1.5252 ns Improvement: +25.23% (mean), +25.09% (median) uniform_384_to_4096: Old - Mean: 26.9067 ns, Median: 26.8845 ns New - Mean: 26.8083 ns, Median: 26.8149 ns Improvement: +0.37% (mean), +0.26% (median) ``` The beginning of the memcpy function looks like the following: ``` Dump of assembler code for function _ZN22__llvm_libc_22_0_0_git6memcpyEPvPKvm: 0x0000000000001340 <+0>: cbz x2, 0x143c <_ZN22__llvm_libc_22_0_0_git6memcpyEPvPKvm+252> 0x0000000000001344 <+4>: cbz x0, 0x1440 <_ZN22__llvm_libc_22_0_0_git6memcpyEPvPKvm+256> 0x0000000000001348 <+8>: cbz x1, 0x1444 <_ZN22__llvm_libc_22_0_0_git6memcpyEPvPKvm+260> 0x000000000000134c <+12>: subs x8, x2, #0x20 0x0000000000001350 <+16>: b.hi 0x1374 <_ZN22__llvm_libc_22_0_0_git6memcpyEPvPKvm+52> // b.pmore 0x0000000000001354 <+20>: rdvl x8, #1 0x0000000000001358 <+24>: whilelo p0.b, xzr, x2 0x000000000000135c <+28>: ld1b {z0.b}, p0/z, [x1] 0x0000000000001360 <+32>: whilelo p1.b, x8, x2 0x0000000000001364 <+36>: ld1b {z1.b}, p1/z, [x1, #1, mul vl] 0x0000000000001368 <+40>: st1b {z0.b}, p0, [x0] 0x000000000000136c <+44>: st1b {z1.b}, p1, [x0, #1, mul vl] 0x0000000000001370 <+48>: ret ``` --------- Co-authored-by: Guillaume Chatelet <chatelet.guillaume@gmail.com>
moar55
pushed a commit
that referenced
this pull request
Feb 6, 2026
…8306) In FreeBSD, allproc is a prepend list and new processes are appended at head. This results in reverse pid order, so we first need to order pid incrementally then print threads according to the correct order. Before: ``` Process 0 stopped * thread #1: tid = 101866, 0xffffffff80bf9322 kernel`sched_switch(td=0xfffff8015882f780, flags=259) at sched_ule.c:2448:26, name = '(pid 12991) dtrace' thread llvm#2: tid = 101915, 0xffffffff80bf9322 kernel`sched_switch(td=0xfffff80158825780, flags=259) at sched_ule.c:2448:26, name = '(pid 11509) zsh' thread llvm#3: tid = 101942, 0xffffffff80bf9322 kernel`sched_switch(td=0xfffff80142599000, flags=259) at sched_ule.c:2448:26, name = '(pid 11504) ftcleanup' thread llvm#4: tid = 101545, 0xffffffff80bf9322 kernel`sched_switch(td=0xfffff80131898000, flags=259) at sched_ule.c:2448:26, name = '(pid 5599) zsh' thread llvm#5: tid = 100905, 0xffffffff80bf9322 kernel`sched_switch(td=0xfffff80131899000, flags=259) at sched_ule.c:2448:26, name = '(pid 5598) sshd-session' thread llvm#6: tid = 101693, 0xffffffff80bf9322 kernel`sched_switch(td=0xfffff8015886e780, flags=259) at sched_ule.c:2448:26, name = '(pid 5595) sshd-session' thread llvm#7: tid = 101626, 0xffffffff80bf9322 kernel`sched_switch(td=0xfffff801588be000, flags=259) at sched_ule.c:2448:26, name = '(pid 5592) sh' ... ``` After: ``` (lldb) thread list Process 0 stopped * thread #1: tid = 100000, 0xffffffff80bf9322 kernel`sched_switch(td=0xffffffff81abe840, flags=259) at sched_ule.c:2448:26, name = '(pid 0) kernel' thread llvm#2: tid = 100035, 0xffffffff80bf9322 kernel`sched_switch(td=0xfffff801052d9780, flags=259) at sched_ule.c:2448:26, name = '(pid 0) kernel/softirq_0' thread llvm#3: tid = 100036, 0xffffffff80bf9322 kernel`sched_switch(td=0xfffff801052d9000, flags=259) at sched_ule.c:2448:26, name = '(pid 0) kernel/softirq_1' thread llvm#4: tid = 100037, 0xffffffff80bf9322 kernel`sched_switch(td=0xfffff801052d8780, flags=259) at sched_ule.c:2448:26, name = '(pid 0) kernel/softirq_2' thread llvm#5: tid = 100038, 0xffffffff80bf9322 kernel`sched_switch(td=0xfffff801052d8000, flags=259) at sched_ule.c:2448:26, name = '(pid 0) kernel/softirq_3' thread llvm#6: tid = 100039, 0xffffffff80bf9322 kernel`sched_switch(td=0xfffff801052d7780, flags=259) at sched_ule.c:2448:26, name = '(pid 0) kernel/softirq_4' thread llvm#7: tid = 100040, 0xffffffff80bf9322 kernel`sched_switch(td=0xfffff801052d7000, flags=259) at sched_ule.c:2448:26, name = '(pid 0) kernel/softirq_5' ... ``` Signed-off-by: Minsoo Choo <minsoochoo0122@proton.me>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Very minor PR... I couldn't find the verb "indices", and it was actually a bit confusing for me reading this.
I think this should be "indexes" instead.