Skip to content

Conversation

@moar55
Copy link
Owner

@moar55 moar55 commented Feb 16, 2025

Very minor PR... I couldn't find the verb "indices", and it was actually a bit confusing for me reading this.
I think this should be "indexes" instead.

@moar55 moar55 closed this Feb 17, 2025
moar55 pushed a commit that referenced this pull request Feb 19, 2025
For function declarations (i.e. func op has no entry block), the
FunctionOpInterface method `insertArgument` and `eraseArgument` will
cause segfault. This PR guards against manipulation of empty entry block
by checking whether func op is external.

An example can be seen in google/heir#1324

The segfault trace

```
 #1 0x0000560f1289d9db PrintStackTraceSignalHandler(void*) /proc/self/cwd/external/llvm-project/llvm/lib/Support/Unix/Signals.inc:874:1
 llvm#2 0x0000560f1289b116 llvm::sys::RunSignalHandlers() /proc/self/cwd/external/llvm-project/llvm/lib/Support/Signals.cpp:105:5
 llvm#3 0x0000560f1289e145 SignalHandler(int) /proc/self/cwd/external/llvm-project/llvm/lib/Support/Unix/Signals.inc:415:1
 llvm#4 0x00007f829a3d9520 (/lib/x86_64-linux-gnu/libc.so.6+0x42520)
 llvm#5 0x0000560f1257f8bc void __gnu_cxx::new_allocator<mlir::BlockArgument>::construct<mlir::BlockArgument, mlir::BlockArgument>(mlir::BlockArgument*, mlir::BlockArgument&&) /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/ext/new_allocator.h:162:23
 llvm#6 0x0000560f1257f84d void std::allocator_traits<std::allocator<mlir::BlockArgument> >::construct<mlir::BlockArgument, mlir::BlockArgument>(std::allocator<mlir::BlockArgument>&, mlir::BlockArgument*, mlir::BlockArgument&&) /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/alloc_traits.h:520:2
 llvm#7 0x0000560f12580498 void std::vector<mlir::BlockArgument, std::allocator<mlir::BlockArgument> >::_M_insert_aux<mlir::BlockArgument>(__gnu_cxx::__normal_iterator<mlir::BlockArgument*, std::vector<mlir::BlockArgument, std::allocator<mlir::BlockArgument> > >, mlir::BlockArgument&&) /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/vector.tcc:405:7
 llvm#8 0x0000560f1257cf7e std::vector<mlir::BlockArgument, std::allocator<mlir::BlockArgument> >::insert(__gnu_cxx::__normal_iterator<mlir::BlockArgument const*, std::vector<mlir::BlockArgument, std::allocator<mlir::BlockArgument> > >, mlir::BlockArgument const&) /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/vector.tcc:154:6
 llvm#9 0x0000560f1257b349 mlir::Block::insertArgument(unsigned int, mlir::Type, mlir::Location) /proc/self/cwd/external/llvm-project/mlir/lib/IR/Block.cpp:178:13
llvm#10 0x0000560f123d2a1c mlir::function_interface_impl::insertFunctionArguments(mlir::FunctionOpInterface, llvm::ArrayRef<unsigned int>, mlir::TypeRange, llvm::ArrayRef<mlir::DictionaryAttr>, llvm::ArrayRef<mlir::Location>, unsigned int, mlir::Type) /proc/self/cwd/external/llvm-project/mlir/lib/Interfaces/FunctionInterfaces.cpp:232:11
llvm#11 0x0000560f0be6b727 mlir::detail::FunctionOpInterfaceTrait<mlir::func::FuncOp>::insertArguments(llvm::ArrayRef<unsigned int>, mlir::TypeRange, llvm::ArrayRef<mlir::DictionaryAttr>, llvm::ArrayRef<mlir::Location>) /proc/self/cwd/bazel-out/k8-dbg/bin/external/llvm-project/mlir/include/mlir/Interfaces/FunctionInterfaces.h.inc:809:7
llvm#12 0x0000560f0be6b536 mlir::detail::FunctionOpInterfaceTrait<mlir::func::FuncOp>::insertArgument(unsigned int, mlir::Type, mlir::DictionaryAttr, mlir::Location) /proc/self/cwd/bazel-out/k8-dbg/bin/external/llvm-project/mlir/include/mlir/Interfaces/FunctionInterfaces.h.inc:796:7
```
moar55 pushed a commit that referenced this pull request Nov 23, 2025
## Summary

Fix `FindProcesses` to respect Android's `hidepid=2` security model and
enable name matching for Android apps.

## Problem

1. Called `adb shell pidof` or `adb shell ps` directly, bypassing
Android's process visibility restrictions
2. Name matching failed for Android apps - searched for
`com.example.myapp` but GDB Remote Protocol reports `app_process64`

Android apps fork from Zygote, so `/proc/PID/exe` points to
`app_process64` for all apps. The actual package name is only in
`/proc/PID/cmdline`. The previous implementation applied name filters
without supplementing with cmdline, so searches failed.

## Fix

- Delegate to lldb-server via GDB Remote Protocol (respects `hidepid=2`)
- Get all visible processes, supplement zygote/app_process entries with
cmdline, then apply name matching
- Only fetch cmdline for zygote apps (performance), parallelize with
`xargs -P 8`
- Remove redundant code (GDB Remote Protocol already provides GID/arch)

## Test Results

### Before this fix:

```
(lldb) platform process list
error: no processes were found on the "remote-android" platform

(lldb) platform process list -n com.example.hellojni
1 matching process was found on "remote-android"
PID    PARENT USER       TRIPLE                         NAME
====== ====== ========== ============================== ============================
5276   359    u0_a192                                   com.example.hellojni
                         ^^^^^^^^ Missing triple!
```

### After this fix:

```
(lldb) platform process list
PID    PARENT USER       TRIPLE                         NAME
====== ====== ========== ============================== ============================
1      0      root       aarch64-unknown-linux-android  init
2      0      root                                      [kthreadd]
359    1      system     aarch64-unknown-linux-android  app_process64
5276   359    u0_a192    aarch64-unknown-linux-android  com.example.hellojni
5357   5355   u0_a192    aarch64-unknown-linux-android  sh
5377   5370   u0_a192    aarch64-unknown-linux-android  lldb-server
                          ^^^^^^^^ User-space processes now have triples!

(lldb) platform process list -n com.example.hellojni
1 matching process was found on "remote-android"
PID    PARENT USER       TRIPLE                         NAME
====== ====== ========== ============================== ============================
5276   359    u0_a192    aarch64-unknown-linux-android  com.example.hellojni


(lldb) process attach -n com.example.hellojni
Process 5276 stopped
* thread #1, name = 'example.hellojni', stop reason = signal SIGSTOP
```

## Test Plan

With an Android device/emulator connected:

1. Start lldb-server on device:
```bash
adb push lldb-server /data/local/tmp/
adb shell chmod +x /data/local/tmp/lldb-server
adb shell /data/local/tmp/lldb-server platform  --listen 127.0.0.1:9500 --server
```

2. Connect from LLDB:
```
(lldb) platform select remote-android
(lldb) platform connect connect://127.0.0.1:9500
(lldb) platform process list
```

3. Verify:
- `platform process list` returns all processes with triple information
- `platform process list -n com.example.app` finds Android apps by
package name
- `process attach -n com.example.app` successfully attaches to Android
apps

## Impact

Restores `platform process list` on Android with architecture
information and package name lookup. All name matching modes now work
correctly.

Fixes llvm#164192
moar55 pushed a commit that referenced this pull request Nov 23, 2025
…am (llvm#167724)

This got exposed by `09262656f32ab3f2e1d82e5342ba37eecac52522`.

The underlying stream of `m_os` is referenced by the `TextDiagnostic`
member of `TextDiagnosticPrinter`. It got turned into a
`llvm::formatted_raw_ostream` in the commit above. When
`~TextDiagnosticPrinter` (and thus `~TextDiagnostic`) is invoked, we now
call `~formatted_raw_ostream`, which tries to access the underlying
stream. But `m_os` was already deleted because it is earlier in the
order of destruction in `TextDiagnosticPrinter`. Move the `m_os` member
before the `TextDiagnosticPrinter` to avoid a use-after-free.

Drive-by:
* Also move the `m_output` member which the `m_os` holds a reference to.
The fact it's a reference indicates the expectation is most likely that
the string outlives the stream.

The ASAN macOS bot is currently failing with this:
```
08:15:39  =================================================================
08:15:39  ==61103==ERROR: AddressSanitizer: heap-use-after-free on address 0x60600012cf40 at pc 0x00012140d304 bp 0x00016eecc850 sp 0x00016eecc848
08:15:39  READ of size 8 at 0x60600012cf40 thread T0
08:15:39      #0 0x00012140d300 in llvm::formatted_raw_ostream::releaseStream() FormattedStream.h:205
08:15:39      #1 0x00012140d3a4 in llvm::formatted_raw_ostream::~formatted_raw_ostream() FormattedStream.h:145
08:15:39      llvm#2 0x00012604abf8 in clang::TextDiagnostic::~TextDiagnostic() TextDiagnostic.cpp:721
08:15:39      llvm#3 0x00012605dc80 in clang::TextDiagnosticPrinter::~TextDiagnosticPrinter() TextDiagnosticPrinter.cpp:30
08:15:39      llvm#4 0x00012605dd5c in clang::TextDiagnosticPrinter::~TextDiagnosticPrinter() TextDiagnosticPrinter.cpp:27
08:15:39      llvm#5 0x0001231fb210 in (anonymous namespace)::StoringDiagnosticConsumer::~StoringDiagnosticConsumer() ClangModulesDeclVendor.cpp:47
08:15:39      llvm#6 0x0001231fb3bc in (anonymous namespace)::StoringDiagnosticConsumer::~StoringDiagnosticConsumer() ClangModulesDeclVendor.cpp:47
08:15:39      llvm#7 0x000129aa9d70 in clang::DiagnosticsEngine::~DiagnosticsEngine() Diagnostic.cpp:91
08:15:39      llvm#8 0x0001230436b8 in llvm::RefCountedBase<clang::DiagnosticsEngine>::Release() const IntrusiveRefCntPtr.h:103
08:15:39      llvm#9 0x0001231fe6c8 in (anonymous namespace)::ClangModulesDeclVendorImpl::~ClangModulesDeclVendorImpl() ClangModulesDeclVendor.cpp:93
08:15:39      llvm#10 0x0001231fe858 in (anonymous namespace)::ClangModulesDeclVendorImpl::~ClangModulesDeclVendorImpl() ClangModulesDeclVendor.cpp:93
...
08:15:39
08:15:39  0x60600012cf40 is located 32 bytes inside of 56-byte region [0x60600012cf20,0x60600012cf58)
08:15:39  freed by thread T0 here:
08:15:39      #0 0x0001018abb88 in _ZdlPv+0x74 (libclang_rt.asan_osx_dynamic.dylib:arm64e+0x4bb88)
08:15:39      #1 0x0001231fb1c0 in (anonymous namespace)::StoringDiagnosticConsumer::~StoringDiagnosticConsumer() ClangModulesDeclVendor.cpp:47
08:15:39      llvm#2 0x0001231fb3bc in (anonymous namespace)::StoringDiagnosticConsumer::~StoringDiagnosticConsumer() ClangModulesDeclVendor.cpp:47
08:15:39      llvm#3 0x000129aa9d70 in clang::DiagnosticsEngine::~DiagnosticsEngine() Diagnostic.cpp:91
08:15:39      llvm#4 0x0001230436b8 in llvm::RefCountedBase<clang::DiagnosticsEngine>::Release() const IntrusiveRefCntPtr.h:103
08:15:39      llvm#5 0x0001231fe6c8 in (anonymous namespace)::ClangModulesDeclVendorImpl::~ClangModulesDeclVendorImpl() ClangModulesDeclVendor.cpp:93
08:15:39      llvm#6 0x0001231fe858 in (anonymous namespace)::ClangModulesDeclVendorImpl::~ClangModulesDeclVendorImpl() ClangModulesDeclVendor.cpp:93
...
08:15:39
08:15:39  previously allocated by thread T0 here:
08:15:39      #0 0x0001018ab760 in _Znwm+0x74 (libclang_rt.asan_osx_dynamic.dylib:arm64e+0x4b760)
08:15:39      #1 0x0001231f8dec in lldb_private::ClangModulesDeclVendor::Create(lldb_private::Target&) ClangModulesDeclVendor.cpp:732
08:15:39      llvm#2 0x00012320af58 in lldb_private::ClangPersistentVariables::GetClangModulesDeclVendor() ClangPersistentVariables.cpp:124
08:15:39      llvm#3 0x0001232111f0 in lldb_private::ClangUserExpression::PrepareForParsing(lldb_private::DiagnosticManager&, lldb_private::ExecutionContext&, bool) ClangUserExpression.cpp:536
08:15:39      llvm#4 0x000123213790 in lldb_private::ClangUserExpression::Parse(lldb_private::DiagnosticManager&, lldb_private::ExecutionContext&, lldb_private::ExecutionPolicy, bool, bool) ClangUserExpression.cpp:647
08:15:39      llvm#5 0x00012032b258 in lldb_private::UserExpression::Evaluate(lldb_private::ExecutionContext&, lldb_private::EvaluateExpressionOptions const&, llvm::StringRef, llvm::StringRef, std::__1::shared_ptr<lldb_private::ValueObject>&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>*, lldb_private::ValueObject*) UserExpression.cpp:280
08:15:39      llvm#6 0x000120724010 in lldb_private::Target::EvaluateExpression(llvm::StringRef, lldb_private::ExecutionContextScope*, std::__1::shared_ptr<lldb_private::ValueObject>&, lldb_private::EvaluateExpressionOptions const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>*, lldb_private::ValueObject*) Target.cpp:2905
08:15:39      llvm#7 0x00011fc7bde0 in lldb::SBTarget::EvaluateExpression(char const*, lldb::SBExpressionOptions const&) SBTarget.cpp:2305
08:15:39  ==61103==ABORTING
...
```
moar55 pushed a commit that referenced this pull request Nov 23, 2025
moar55 pushed a commit that referenced this pull request Nov 23, 2025
llvm#168619)

I've been working on some scripts that evaluate the parent and child
frame. It's been very annoying that the parent frame has a property but
not the child. So I've added this to the extensions, I would've
preferred to return None, but because the existing impl returns an
invalid SBFrame, so I'm conforming to that API.

```
(lldb) script
Python Interactive Interpreter. To exit, type 'quit()', 'exit()' or Ctrl-D.
>>> lldb.frame
frame #0: 0x0000555555555200 fib.out`main
>>> lldb.frame.parent
frame #1: 0x00007ffff782a610 libc.so.6`__libc_start_call_main + 128
>>> lldb.frame.parent.child
frame #0: 0x0000555555555200 fib.out`main
```
moar55 pushed a commit that referenced this pull request Feb 2, 2026
In this PR i move the insertion point in the
`yieldReplacementForFusedProducer` because i ran into some issue where a
`tensor.extract_slices` tried to use a result of `affine.apply` that was
inserted at the end of the block instead of the start of it.

This is the full error of the test i added before this change:

```mlir
third-party/llvm-project/mlir/test/Interfaces/TilingInterface/tile-fuse-and-yield-using-scfforall.mlir:83:11: error: operand #1 does not dominate this use
  %pack = linalg.pack %gen#1
          ^
third-party/llvm-project/mlir/test/Interfaces/TilingInterface/tile-fuse-and-yield-using-scfforall.mlir:83:11: note: see current operation: %24 = "tensor.extract_slice"(%23, %36, %8) <{operandSegmentSizes = array<i32: 1, 1, 1, 0>, static_offsets = array<i64: -9223372036854775808, 0>, static_sizes = array<i64: -9223372036854775808, 1024>, static_strides = array<i64: 1, 1>}> : (tensor<32x1024xf32>, index, index) -> tensor<?x1024xf32>
third-party/llvm-project/mlir/test/Interfaces/TilingInterface/tile-fuse-and-yield-using-scfforall.mlir:71:12: note: operand defined here (op in the same block)
  %gen:2 = linalg.generic {
           ^
// -----// IR Dump After InterpreterPass Failed (transform-interpreter) //----- //
#map = affine_map<(d0, d1) -> (d0, d1)>
#map1 = affine_map<(d0) -> (d0 * 16)>
#map2 = affine_map<(d0) -> (d0 * -16 + 32)>
#map3 = affine_map<(d0) -> (16, d0 * -16 + 32)>
#map4 = affine_map<(d0) -> (d0 - 1)>
"builtin.module"() ({
  "func.func"() <{function_type = (tensor<32x1024xf32>) -> (tensor<32x1024xf32>, tensor<2x512x16x2xi8>), sym_name = "fuse_pack_consumer_into_multi_output_generic"}> ({
  ^bb0(%arg1: tensor<32x1024xf32>):
    %2 = "arith.constant"() <{value = 0 : i8}> : () -> i8
    %3 = "tensor.empty"() : () -> tensor<32x1024xf32>
    %4 = "tensor.empty"() : () -> tensor<32x1024xi8>
    %5 = "tensor.empty"() : () -> tensor<2x512x16x2xi8>
    %6:2 = "linalg.generic"(%arg1, %3, %4) <{indexing_maps = [#map, #map, #map], iterator_types = [#linalg.iterator_type<parallel>, #linalg.iterator_type<parallel>], operandSegmentSizes = array<i32: 1, 2>}> ({
    ^bb0(%arg9: f32, %arg10: f32, %arg11: i8):
      %41 = "arith.fptoui"(%arg9) : (f32) -> i8
      "linalg.yield"(%arg9, %41) : (f32, i8) -> ()
    }) : (tensor<32x1024xf32>, tensor<32x1024xf32>, tensor<32x1024xi8>) -> (tensor<32x1024xf32>, tensor<32x1024xi8>)
    %7:3 = "scf.forall"(%5, %3, %4) <{operandSegmentSizes = array<i32: 0, 0, 0, 3>, staticLowerBound = array<i64: 0>, staticStep = array<i64: 1>, staticUpperBound = array<i64: 2>}> ({
    ^bb0(%arg2: index, %arg3: tensor<2x512x16x2xi8>, %arg4: tensor<32x1024xf32>, %arg5: tensor<32x1024xi8>):
      %8 = "affine.apply"(%arg2) <{map = #map1}> : (index) -> index
      %9 = "affine.apply"(%arg2) <{map = #map2}> : (index) -> index
      %10 = "affine.min"(%arg2) <{map = #map3}> : (index) -> index
      %11 = "affine.apply"(%10) <{map = #map4}> : (index) -> index
      %12 = "affine.apply"(%arg2) <{map = #map1}> : (index) -> index
      %13 = "affine.apply"(%10) <{map = #map4}> : (index) -> index
      %14 = "affine.apply"(%arg2) <{map = #map1}> : (index) -> index
      %15 = "affine.apply"(%10) <{map = #map4}> : (index) -> index
      %16 = "affine.apply"(%arg2) <{map = #map1}> : (index) -> index
      %17 = "affine.apply"(%10) <{map = #map4}> : (index) -> index
      %18 = "tensor.extract_slice"(%arg1, %12, %10) <{operandSegmentSizes = array<i32: 1, 1, 1, 0>, static_offsets = array<i64: -9223372036854775808, 0>, static_sizes = array<i64: -9223372036854775808, 1024>, static_strides = array<i64: 1, 1>}> : (tensor<32x1024xf32>, index, index) -> tensor<?x1024xf32>
      %19 = "tensor.empty"() : () -> tensor<32x1024xf32>
      %20 = "tensor.extract_slice"(%19, %14, %10) <{operandSegmentSizes = array<i32: 1, 1, 1, 0>, static_offsets = array<i64: -9223372036854775808, 0>, static_sizes = array<i64: -9223372036854775808, 1024>, static_strides = array<i64: 1, 1>}> : (tensor<32x1024xf32>, index, index) -> tensor<?x1024xf32>
      %21 = "tensor.extract_slice"(%3, %14, %10) <{operandSegmentSizes = array<i32: 1, 1, 1, 0>, static_offsets = array<i64: -9223372036854775808, 0>, static_sizes = array<i64: -9223372036854775808, 1024>, static_strides = array<i64: 1, 1>}> : (tensor<32x1024xf32>, index, index) -> tensor<?x1024xf32>
      %22 = "tensor.empty"() : () -> tensor<32x1024xi8>
      %23 = "tensor.extract_slice"(%22, %16, %10) <{operandSegmentSizes = array<i32: 1, 1, 1, 0>, static_offsets = array<i64: -9223372036854775808, 0>, static_sizes = array<i64: -9223372036854775808, 1024>, static_strides = array<i64: 1, 1>}> : (tensor<32x1024xi8>, index, index) -> tensor<?x1024xi8>
      %24 = "tensor.extract_slice"(%4, %16, %10) <{operandSegmentSizes = array<i32: 1, 1, 1, 0>, static_offsets = array<i64: -9223372036854775808, 0>, static_sizes = array<i64: -9223372036854775808, 1024>, static_strides = array<i64: 1, 1>}> : (tensor<32x1024xi8>, index, index) -> tensor<?x1024xi8>
      %25 = "tensor.empty"() : () -> tensor<32x1024xf32>
      %26 = "tensor.extract_slice"(%25, %38, %10) <{operandSegmentSizes = array<i32: 1, 1, 1, 0>, static_offsets = array<i64: -9223372036854775808, 0>, static_sizes = array<i64: -9223372036854775808, 1024>, static_strides = array<i64: 1, 1>}> : (tensor<32x1024xf32>, index, index) -> tensor<?x1024xf32>
      %27 = "tensor.extract_slice"(%arg4, %38, %10) <{operandSegmentSizes = array<i32: 1, 1, 1, 0>, static_offsets = array<i64: -9223372036854775808, 0>, static_sizes = array<i64: -9223372036854775808, 1024>, static_strides = array<i64: 1, 1>}> : (tensor<32x1024xf32>, index, index) -> tensor<?x1024xf32>
      %28 = "tensor.empty"() : () -> tensor<32x1024xi8>
      %29 = "tensor.extract_slice"(%28, %8, %10) <{operandSegmentSizes = array<i32: 1, 1, 1, 0>, static_offsets = array<i64: -9223372036854775808, 0>, static_sizes = array<i64: -9223372036854775808, 1024>, static_strides = array<i64: 1, 1>}> : (tensor<32x1024xi8>, index, index) -> tensor<?x1024xi8>
      %30 = "tensor.extract_slice"(%arg5, %8, %10) <{operandSegmentSizes = array<i32: 1, 1, 1, 0>, static_offsets = array<i64: -9223372036854775808, 0>, static_sizes = array<i64: -9223372036854775808, 1024>, static_strides = array<i64: 1, 1>}> : (tensor<32x1024xi8>, index, index) -> tensor<?x1024xi8>
      %31:2 = "linalg.generic"(%18, %27, %30) <{indexing_maps = [#map, #map, #map], iterator_types = [#linalg.iterator_type<parallel>, #linalg.iterator_type<parallel>], operandSegmentSizes = array<i32: 1, 2>}> ({
      ^bb0(%arg6: f32, %arg7: f32, %arg8: i8):
        %40 = "arith.fptoui"(%arg6) : (f32) -> i8
        "linalg.yield"(%arg6, %40) : (f32, i8) -> ()
      }) : (tensor<?x1024xf32>, tensor<?x1024xf32>, tensor<?x1024xi8>) -> (tensor<?x1024xf32>, tensor<?x1024xi8>)
      %32 = "tensor.extract_slice"(%6#1, %8, %10) <{operandSegmentSizes = array<i32: 1, 1, 1, 0>, static_offsets = array<i64: -9223372036854775808, 0>, static_sizes = array<i64: -9223372036854775808, 1024>, static_strides = array<i64: 1, 1>}> : (tensor<32x1024xi8>, index, index) -> tensor<?x1024xi8>
      %33 = "tensor.empty"() : () -> tensor<2x512x16x2xi8>
      %34 = "tensor.extract_slice"(%33, %arg2) <{operandSegmentSizes = array<i32: 1, 1, 0, 0>, static_offsets = array<i64: -9223372036854775808, 0, 0, 0>, static_sizes = array<i64: 1, 512, 16, 2>, static_strides = array<i64: 1, 1, 1, 1>}> : (tensor<2x512x16x2xi8>, index) -> tensor<1x512x16x2xi8>
      %35 = "tensor.extract_slice"(%arg3, %arg2) <{operandSegmentSizes = array<i32: 1, 1, 0, 0>, static_offsets = array<i64: -9223372036854775808, 0, 0, 0>, static_sizes = array<i64: 1, 512, 16, 2>, static_strides = array<i64: 1, 1, 1, 1>}> : (tensor<2x512x16x2xi8>, index) -> tensor<1x512x16x2xi8>
      %36 = "linalg.pack"(%31#1, %35, %2) <{inner_dims_pos = array<i64: 0, 1>, operandSegmentSizes = array<i32: 1, 1, 1, 0>, static_inner_tiles = array<i64: 16, 2>}> : (tensor<?x1024xi8>, tensor<1x512x16x2xi8>, i8) -> tensor<1x512x16x2xi8>
      %37 = "affine.apply"(%10) <{map = #map4}> : (index) -> index
      %38 = "affine.apply"(%arg2) <{map = #map1}> : (index) -> index
      %39 = "affine.apply"(%10) <{map = #map4}> : (index) -> index
      "scf.forall.in_parallel"() ({
        "tensor.parallel_insert_slice"(%36, %arg3, %arg2) <{operandSegmentSizes = array<i32: 1, 1, 1, 0, 0>, static_offsets = array<i64: -9223372036854775808, 0, 0, 0>, static_sizes = array<i64: 1, 512, 16, 2>, static_strides = array<i64: 1, 1, 1, 1>}> : (tensor<1x512x16x2xi8>, tensor<2x512x16x2xi8>, index) -> ()
        "tensor.parallel_insert_slice"(%31#0, %arg4, %38, %10) <{operandSegmentSizes = array<i32: 1, 1, 1, 1, 0>, static_offsets = array<i64: -9223372036854775808, 0>, static_sizes = array<i64: -9223372036854775808, 1024>, static_strides = array<i64: 1, 1>}> : (tensor<?x1024xf32>, tensor<32x1024xf32>, index, index) -> ()
        "tensor.parallel_insert_slice"(%31#1, %arg5, %8, %10) <{operandSegmentSizes = array<i32: 1, 1, 1, 1, 0>, static_offsets = array<i64: -9223372036854775808, 0>, static_sizes = array<i64: -9223372036854775808, 1024>, static_strides = array<i64: 1, 1>}> : (tensor<?x1024xi8>, tensor<32x1024xi8>, index, index) -> ()
      }) : () -> ()
    }) : (tensor<2x512x16x2xi8>, tensor<32x1024xf32>, tensor<32x1024xi8>) -> (tensor<2x512x16x2xi8>, tensor<32x1024xf32>, tensor<32x1024xi8>)
    "func.return"(%7#1, %7#0) : (tensor<32x1024xf32>, tensor<2x512x16x2xi8>) -> ()
  }) : () -> ()
  "builtin.module"() ({
    "transform.named_sequence"() <{arg_attrs = [{transform.readonly}], function_type = (!transform.any_op) -> (), sym_name = "__transform_main"}> ({
    ^bb0(%arg0: !transform.any_op):
      %0 = "transform.structured.match"(%arg0) <{ops = ["linalg.pack"]}> : (!transform.any_op) -> !transform.any_op
      %1:2 = "transform.test.fuse_and_yield"(%0) <{tile_interchange = [], tile_sizes = [1], use_forall = true}> : (!transform.any_op) -> (!transform.any_op, !transform.any_op)
      "transform.yield"() : () -> ()
    }) : () -> ()
  }) {transform.with_named_sequence} : () -> ()
}) : () -> ()
``` 

I also noticed that Interface tests are missing from the bazel overlay
so i also added this.
moar55 pushed a commit that referenced this pull request Feb 3, 2026
…m#167446)

Add SVE optimization for AArch64 architectures. The idea is to use
predicate registers to avoid branching.
Microbench in repo shows considerable improvements on NV GB10 (locked on
largest X925):

```
======================================================================
BENCHMARK STATISTICS (time in nanoseconds)
======================================================================

memcpy_Google_A:
  Old - Mean: 3.1257 ns, Median: 3.1162 ns
  New - Mean: 2.8402 ns, Median: 2.8265 ns
  Improvement: +9.14% (mean), +9.30% (median)

memcpy_Google_B:
  Old - Mean: 2.3171 ns, Median: 2.3159 ns
  New - Mean: 1.6589 ns, Median: 1.6593 ns
  Improvement: +28.40% (mean), +28.35% (median)

memcpy_Google_D:
  Old - Mean: 8.7602 ns, Median: 8.7645 ns
  New - Mean: 8.4307 ns, Median: 8.4308 ns
  Improvement: +3.76% (mean), +3.81% (median)

memcpy_Google_L:
  Old - Mean: 1.7137 ns, Median: 1.7091 ns
  New - Mean: 1.4530 ns, Median: 1.4553 ns
  Improvement: +15.22% (mean), +14.85% (median)

memcpy_Google_M:
  Old - Mean: 1.9823 ns, Median: 1.9825 ns
  New - Mean: 1.4826 ns, Median: 1.4840 ns
  Improvement: +25.20% (mean), +25.15% (median)

memcpy_Google_Q:
  Old - Mean: 1.6812 ns, Median: 1.6784 ns
  New - Mean: 1.1538 ns, Median: 1.1517 ns
  Improvement: +31.37% (mean), +31.38% (median)

memcpy_Google_S:
  Old - Mean: 2.1816 ns, Median: 2.1786 ns
  New - Mean: 1.6297 ns, Median: 1.6287 ns
  Improvement: +25.29% (mean), +25.24% (median)

memcpy_Google_U:
  Old - Mean: 2.2851 ns, Median: 2.2825 ns
  New - Mean: 1.7219 ns, Median: 1.7187 ns
  Improvement: +24.65% (mean), +24.70% (median)

memcpy_Google_W:
  Old - Mean: 2.0408 ns, Median: 2.0361 ns
  New - Mean: 1.5260 ns, Median: 1.5252 ns
  Improvement: +25.23% (mean), +25.09% (median)

uniform_384_to_4096:
  Old - Mean: 26.9067 ns, Median: 26.8845 ns
  New - Mean: 26.8083 ns, Median: 26.8149 ns
  Improvement: +0.37% (mean), +0.26% (median)
```
The beginning of the memcpy function looks like the following:
```
Dump of assembler code for function _ZN22__llvm_libc_22_0_0_git6memcpyEPvPKvm:
   0x0000000000001340 <+0>:     cbz     x2, 0x143c <_ZN22__llvm_libc_22_0_0_git6memcpyEPvPKvm+252>
   0x0000000000001344 <+4>:     cbz     x0, 0x1440 <_ZN22__llvm_libc_22_0_0_git6memcpyEPvPKvm+256>
   0x0000000000001348 <+8>:     cbz     x1, 0x1444 <_ZN22__llvm_libc_22_0_0_git6memcpyEPvPKvm+260>
   0x000000000000134c <+12>:    subs    x8, x2, #0x20
   0x0000000000001350 <+16>:    b.hi    0x1374 <_ZN22__llvm_libc_22_0_0_git6memcpyEPvPKvm+52>  // b.pmore
   0x0000000000001354 <+20>:    rdvl    x8, #1
   0x0000000000001358 <+24>:    whilelo p0.b, xzr, x2
   0x000000000000135c <+28>:    ld1b    {z0.b}, p0/z, [x1]
   0x0000000000001360 <+32>:    whilelo p1.b, x8, x2
   0x0000000000001364 <+36>:    ld1b    {z1.b}, p1/z, [x1, #1, mul vl]
   0x0000000000001368 <+40>:    st1b    {z0.b}, p0, [x0]
   0x000000000000136c <+44>:    st1b    {z1.b}, p1, [x0, #1, mul vl]
   0x0000000000001370 <+48>:    ret
```

---------

Co-authored-by: Guillaume Chatelet <chatelet.guillaume@gmail.com>
moar55 pushed a commit that referenced this pull request Feb 3, 2026
…m#167446)

Add SVE optimization for AArch64 architectures. The idea is to use
predicate registers to avoid branching.
Microbench in repo shows considerable improvements on NV GB10 (locked on
largest X925):

```
======================================================================
BENCHMARK STATISTICS (time in nanoseconds)
======================================================================

memcpy_Google_A:
  Old - Mean: 3.1257 ns, Median: 3.1162 ns
  New - Mean: 2.8402 ns, Median: 2.8265 ns
  Improvement: +9.14% (mean), +9.30% (median)

memcpy_Google_B:
  Old - Mean: 2.3171 ns, Median: 2.3159 ns
  New - Mean: 1.6589 ns, Median: 1.6593 ns
  Improvement: +28.40% (mean), +28.35% (median)

memcpy_Google_D:
  Old - Mean: 8.7602 ns, Median: 8.7645 ns
  New - Mean: 8.4307 ns, Median: 8.4308 ns
  Improvement: +3.76% (mean), +3.81% (median)

memcpy_Google_L:
  Old - Mean: 1.7137 ns, Median: 1.7091 ns
  New - Mean: 1.4530 ns, Median: 1.4553 ns
  Improvement: +15.22% (mean), +14.85% (median)

memcpy_Google_M:
  Old - Mean: 1.9823 ns, Median: 1.9825 ns
  New - Mean: 1.4826 ns, Median: 1.4840 ns
  Improvement: +25.20% (mean), +25.15% (median)

memcpy_Google_Q:
  Old - Mean: 1.6812 ns, Median: 1.6784 ns
  New - Mean: 1.1538 ns, Median: 1.1517 ns
  Improvement: +31.37% (mean), +31.38% (median)

memcpy_Google_S:
  Old - Mean: 2.1816 ns, Median: 2.1786 ns
  New - Mean: 1.6297 ns, Median: 1.6287 ns
  Improvement: +25.29% (mean), +25.24% (median)

memcpy_Google_U:
  Old - Mean: 2.2851 ns, Median: 2.2825 ns
  New - Mean: 1.7219 ns, Median: 1.7187 ns
  Improvement: +24.65% (mean), +24.70% (median)

memcpy_Google_W:
  Old - Mean: 2.0408 ns, Median: 2.0361 ns
  New - Mean: 1.5260 ns, Median: 1.5252 ns
  Improvement: +25.23% (mean), +25.09% (median)

uniform_384_to_4096:
  Old - Mean: 26.9067 ns, Median: 26.8845 ns
  New - Mean: 26.8083 ns, Median: 26.8149 ns
  Improvement: +0.37% (mean), +0.26% (median)
```
The beginning of the memcpy function looks like the following:
```
Dump of assembler code for function _ZN22__llvm_libc_22_0_0_git6memcpyEPvPKvm:
   0x0000000000001340 <+0>:     cbz     x2, 0x143c <_ZN22__llvm_libc_22_0_0_git6memcpyEPvPKvm+252>
   0x0000000000001344 <+4>:     cbz     x0, 0x1440 <_ZN22__llvm_libc_22_0_0_git6memcpyEPvPKvm+256>
   0x0000000000001348 <+8>:     cbz     x1, 0x1444 <_ZN22__llvm_libc_22_0_0_git6memcpyEPvPKvm+260>
   0x000000000000134c <+12>:    subs    x8, x2, #0x20
   0x0000000000001350 <+16>:    b.hi    0x1374 <_ZN22__llvm_libc_22_0_0_git6memcpyEPvPKvm+52>  // b.pmore
   0x0000000000001354 <+20>:    rdvl    x8, #1
   0x0000000000001358 <+24>:    whilelo p0.b, xzr, x2
   0x000000000000135c <+28>:    ld1b    {z0.b}, p0/z, [x1]
   0x0000000000001360 <+32>:    whilelo p1.b, x8, x2
   0x0000000000001364 <+36>:    ld1b    {z1.b}, p1/z, [x1, #1, mul vl]
   0x0000000000001368 <+40>:    st1b    {z0.b}, p0, [x0]
   0x000000000000136c <+44>:    st1b    {z1.b}, p1, [x0, #1, mul vl]
   0x0000000000001370 <+48>:    ret
```

---------

Co-authored-by: Guillaume Chatelet <chatelet.guillaume@gmail.com>
moar55 pushed a commit that referenced this pull request Feb 6, 2026
…8306)

In FreeBSD, allproc is a prepend list and new processes are appended at
head. This results in reverse pid order, so we first need to order pid
incrementally then print threads according to the correct order.

Before:
```
Process 0 stopped
* thread #1: tid = 101866, 0xffffffff80bf9322 kernel`sched_switch(td=0xfffff8015882f780, flags=259) at sched_ule.c:2448:26, name = '(pid 12991) dtrace'
  thread llvm#2: tid = 101915, 0xffffffff80bf9322 kernel`sched_switch(td=0xfffff80158825780, flags=259) at sched_ule.c:2448:26, name = '(pid 11509) zsh'
  thread llvm#3: tid = 101942, 0xffffffff80bf9322 kernel`sched_switch(td=0xfffff80142599000, flags=259) at sched_ule.c:2448:26, name = '(pid 11504) ftcleanup'
  thread llvm#4: tid = 101545, 0xffffffff80bf9322 kernel`sched_switch(td=0xfffff80131898000, flags=259) at sched_ule.c:2448:26, name = '(pid 5599) zsh'
  thread llvm#5: tid = 100905, 0xffffffff80bf9322 kernel`sched_switch(td=0xfffff80131899000, flags=259) at sched_ule.c:2448:26, name = '(pid 5598) sshd-session'
  thread llvm#6: tid = 101693, 0xffffffff80bf9322 kernel`sched_switch(td=0xfffff8015886e780, flags=259) at sched_ule.c:2448:26, name = '(pid 5595) sshd-session'
  thread llvm#7: tid = 101626, 0xffffffff80bf9322 kernel`sched_switch(td=0xfffff801588be000, flags=259) at sched_ule.c:2448:26, name = '(pid 5592) sh'
...
```

After:
```
(lldb) thread list
Process 0 stopped
* thread #1: tid = 100000, 0xffffffff80bf9322 kernel`sched_switch(td=0xffffffff81abe840, flags=259) at sched_ule.c:2448:26, name = '(pid 0) kernel'
  thread llvm#2: tid = 100035, 0xffffffff80bf9322 kernel`sched_switch(td=0xfffff801052d9780, flags=259) at sched_ule.c:2448:26, name = '(pid 0) kernel/softirq_0'
  thread llvm#3: tid = 100036, 0xffffffff80bf9322 kernel`sched_switch(td=0xfffff801052d9000, flags=259) at sched_ule.c:2448:26, name = '(pid 0) kernel/softirq_1'
  thread llvm#4: tid = 100037, 0xffffffff80bf9322 kernel`sched_switch(td=0xfffff801052d8780, flags=259) at sched_ule.c:2448:26, name = '(pid 0) kernel/softirq_2'
  thread llvm#5: tid = 100038, 0xffffffff80bf9322 kernel`sched_switch(td=0xfffff801052d8000, flags=259) at sched_ule.c:2448:26, name = '(pid 0) kernel/softirq_3'
  thread llvm#6: tid = 100039, 0xffffffff80bf9322 kernel`sched_switch(td=0xfffff801052d7780, flags=259) at sched_ule.c:2448:26, name = '(pid 0) kernel/softirq_4'
  thread llvm#7: tid = 100040, 0xffffffff80bf9322 kernel`sched_switch(td=0xfffff801052d7000, flags=259) at sched_ule.c:2448:26, name = '(pid 0) kernel/softirq_5'
...
```

Signed-off-by: Minsoo Choo <minsoochoo0122@proton.me>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant