Skip to content

Conversation

@ggerganov
Copy link
Member

@ggerganov ggerganov commented Oct 17, 2025

  • Rework matrix-matrix multiplication
  • Use Tensor API when available

TODOs

  • Update mul_mm_id kernel
  • Test on M5 (looking for volunteers to test as I won't have hardware anytime soon)
  • How to handle missing bfloat tensor API? metal : initial Metal4 tensor API support #16634 (comment)
  • Confirm that using the Tensor API maintains the existing performance without using it on M4 and earlier

@github-actions github-actions bot added testing Everything test related ggml changes relating to the ggml tensor library for machine learning Apple Metal https://en.wikipedia.org/wiki/Metal_(API) labels Oct 17, 2025
@jeffbolznv
Copy link
Collaborator

Any early performance data?

@ggerganov ggerganov force-pushed the gg/metal-mul-mm-rework branch from 6271c44 to 6726e53 Compare October 18, 2025 14:24
@ggerganov
Copy link
Member Author

@jeffbolznv I think the performance using the tensor API is the same as the old simdgroup-based implementation, but I haven't done detailed analysis yet. I don't have hardware yet to test the actual Neural Accelerators that exist in the new chips and if they would be utilized with these changes.

@ggerganov ggerganov force-pushed the gg/metal-mul-mm-rework branch from 6726e53 to 57fa815 Compare October 20, 2025 13:01
@ggerganov ggerganov marked this pull request as ready for review October 20, 2025 16:30
@ggerganov ggerganov requested a review from slaren as a code owner October 20, 2025 16:30
@ggerganov
Copy link
Member Author

Looking for volunteers with iPhone 17 or MacBook M5 for testing

@ddh0
Copy link
Contributor

ddh0 commented Oct 23, 2025

Looking for volunteers with iPhone 17 or MacBook M5 for testing

I have an iPhone 17, how can I help?

@ggerganov
Copy link
Member Author

@ddh0 Run the benchmark from #4508 on master and on this branch and report the results.

@ngladitz
Copy link

I was just curious and don't have an M5. Disregard this if this isn't expected to work outside of that context yet. I see the following diagnostics (MacBook Air M2, Tahoe 26.0.1; Xcode 26.0.1; Apple Clang 17.0.0) when trying to run e.g. llama-server

Output
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M2
ggml_metal_init: picking default device: Apple M2
ggml_metal_init: the device does not have a precompiled Metal library - this is unexpected
ggml_metal_init: will try to compile it on the fly
ggml_metal_library_init: using embedded metal library
ggml_metal_library_init: error: Error Domain=MTLLibraryErrorDomain Code=3 "In file included from program_source:2693:
In file included from /System/Library/Frameworks/MetalPerformancePrimitives.framework/Headers/MetalPerformancePrimitives.h:10:
In file included from /System/Library/Frameworks/MetalPerformancePrimitives.framework/Headers/MPPTensorOpsMatMul2d.h:412:
/System/Library/Frameworks/MetalPerformancePrimitives.framework/Headers/__impl/MPPTensorOpsMatMul2dImpl.h:5885:9: error: static_assert failed due to requirement '__tensor_ops_detail::__assert_false_v<float>' "Unsupported type"
        static_assert(
        ^
/System/Library/Frameworks/MetalPerformancePrimitives.framework/Headers/MPPTensorOpsMatMul2d.h:441:24: note: in instantiation of function template specialization 'mpp::tensor_ops::__mutmul2d_detail::__run<{32, 64, 32, false, true, false, 1}, metal::execution_simdgroups<4>, metal::tensor<threadgroup bfloat, metal::extents<int, 18446744073709551615, 18446744073709551615>, metal::tensor_inline>, metal::tensor<threadgroup bfloat, metal::extents<int, 18446744073709551615, 18446744073709551615>, metal::tensor_inline>, metal::cooperative_tensor<float, metal::extents<int, 18446744073709551615, 18446744073709551615>, mpp::tensor_ops::__mutmul2d_detail::__operand_layout<{32, 64, 32, false, true, false, 1}, mpp::tensor_ops::matmul2d_cooperative_operand_index::destination, metal::execution_simdgroups<4>, metal::tensor<threadgroup bfloat, metal::extents<int, 18446744073709551615, 18446744073709551615>, metal::tensor_inline>, metal::tensor<threadgroup bfloat, metal::extents<int, 18446744073709551615, 18446744073709551615>, metal::tensor_inline>, float, int>>>' requested here
    __mutmul2d_detail::__run<Descriptor, Scope, LeftOperandType,
                       ^
program_source:11054:12: note: in instantiation of function template specialization 'mpp::tensor_ops::matmul2d<{32, 64, 32, false, true, false, 1}, metal::execution_simdgroups<4>>::run<metal::tensor<threadgroup bfloat, metal::extents<int, 18446744073709551615, 18446744073709551615>, metal::tensor_inline>, metal::tensor<threadgroup bfloat, metal::extents<int, 18446744073709551615, 18446744073709551615>, metal::tensor_inline>, metal::cooperative_tensor<float, metal::extents<int, 18446744073709551615, 18446744073709551615>, mpp::tensor_ops::__mutmul2d_detail::__operand_layout<{32, 64, 32, false, true, false, 1}, mpp::tensor_ops::matmul2d_cooperative_operand_index::destination, metal::execution_simdgroups<4>, metal::tensor<threadgroup bfloat, metal::extents<int, 18446744073709551615, 18446744073709551615>, metal::tensor_inline>, metal::tensor<threadgroup bfloat, metal::extents<int, 18446744073709551615, 18446744073709551615>, metal::tensor_inline>, float, int>>, void>' requested here
        mm.run(sB, sA, cT);
           ^
In file included from program_source:2693:
In file included from /System/Library/Frameworks/MetalPerformancePrimitives.framework/Headers/MetalPerformancePrimitives.h:10:
In file included from /System/Library/Frameworks/MetalPerformancePrimitives.framework/Headers/MPPTensorOpsMatMul2d.h:412:
/System/Library/Frameworks/MetalPerformancePrimitives.framework/Headers/__impl/MPPTensorOpsMatMul2dImpl.h:5885:9: error: static_assert failed due to requirement '__tensor_ops_detail::__assert_false_v<float>' "Unsupported type"
        static_assert(
        ^
/System/Library/Frameworks/MetalPerformancePrimitives.framework/Headers/MPPTensorOpsMatMul2d.h:441:24: note: in instantiation of function template specialization 'mpp::tensor_ops::__mutmul2d_detail::__run<{32, 64, 32, false, true, false, 1}, metal::execution_simdgroups<4>, metal::tensor<threadgroup half, metal::extents<int, 18446744073709551615, 18446744073709551615>, metal::tensor_inline>, metal::tensor<threadgroup bfloat, metal::extents<int, 18446744073709551615, 18446744073709551615>, metal::tensor_inline>, metal::cooperative_tensor<float, metal::extents<int, 18446744073709551615, 18446744073709551615>, mpp::tensor_ops::__mutmul2d_detail::__operand_layout<{32, 64, 32, false, true, false, 1}, mpp::tensor_ops::matmul2d_cooperative_operand_index::destination, metal::execution_simdgroups<4>, metal::tensor<threadgroup bfloat, metal::extents<int, 18446744073709551615, 18446744073709551615>, metal::tensor_inline>, metal::tensor<threadgroup half, metal::extents<int, 18446744073709551615, 18446744073709551615>, metal::tensor_inline>, float, int>>>' requested here
    __mutmul2d_detail::__run<Descriptor, Scope, LeftOperandType,
                       ^
program_source:11054:12: note: in instantiation of function template specialization 'mpp::tensor_ops::matmul2d<{32, 64, 32, false, true, false, 1}, metal::execution_simdgroups<4>>::run<metal::tensor<threadgroup half, metal::extents<int, 18446744073709551615, 18446744073709551615>, metal::tensor_inline>, metal::tensor<threadgroup bfloat, metal::extents<int, 18446744073709551615, 18446744073709551615>, metal::tensor_inline>, metal::cooperative_tensor<float, metal::extents<int, 18446744073709551615, 18446744073709551615>, mpp::tensor_ops::__mutmul2d_detail::__operand_layout<{32, 64, 32, false, true, false, 1}, mpp::tensor_ops::matmul2d_cooperative_operand_index::destination, metal::execution_simdgroups<4>, metal::tensor<threadgroup bfloat, metal::extents<int, 18446744073709551615, 18446744073709551615>, metal::tensor_inline>, metal::tensor<threadgroup half, metal::extents<int, 18446744073709551615, 18446744073709551615>, metal::tensor_inline>, float, int>>, void>' requested here
        mm.run(sB, sA, cT);
           ^
" UserInfo={NSLocalizedDescription=In file included from program_source:2693:
In file included from /System/Library/Frameworks/MetalPerformancePrimitives.framework/Headers/MetalPerformancePrimitives.h:10:
In file included from /System/Library/Frameworks/MetalPerformancePrimitives.framework/Headers/MPPTensorOpsMatMul2d.h:412:
/System/Library/Frameworks/MetalPerformancePrimitives.framework/Headers/__impl/MPPTensorOpsMatMul2dImpl.h:5885:9: error: static_assert failed due to requirement '__tensor_ops_detail::__assert_false_v<float>' "Unsupported type"
        static_assert(
        ^
/System/Library/Frameworks/MetalPerformancePrimitives.framework/Headers/MPPTensorOpsMatMul2d.h:441:24: note: in instantiation of function template specialization 'mpp::tensor_ops::__mutmul2d_detail::__run<{32, 64, 32, false, true, false, 1}, metal::execution_simdgroups<4>, metal::tensor<threadgroup bfloat, metal::extents<int, 18446744073709551615, 18446744073709551615>, metal::tensor_inline>, metal::tensor<threadgroup bfloat, metal::extents<int, 18446744073709551615, 18446744073709551615>, metal::tensor_inline>, metal::cooperative_tensor<float, metal::extents<int, 18446744073709551615, 18446744073709551615>, mpp::tensor_ops::__mutmul2d_detail::__operand_layout<{32, 64, 32, false, true, false, 1}, mpp::tensor_ops::matmul2d_cooperative_operand_index::destination, metal::execution_simdgroups<4>, metal::tensor<threadgroup bfloat, metal::extents<int, 18446744073709551615, 18446744073709551615>, metal::tensor_inline>, metal::tensor<threadgroup bfloat, metal::extents<int, 18446744073709551615, 18446744073709551615>, metal::tensor_inline>, float, int>>>' requested here
    __mutmul2d_detail::__run<Descriptor, Scope, LeftOperandType,
                       ^
program_source:11054:12: note: in instantiation of function template specialization 'mpp::tensor_ops::matmul2d<{32, 64, 32, false, true, false, 1}, metal::execution_simdgroups<4>>::run<metal::tensor<threadgroup bfloat, metal::extents<int, 18446744073709551615, 18446744073709551615>, metal::tensor_inline>, metal::tensor<threadgroup bfloat, metal::extents<int, 18446744073709551615, 18446744073709551615>, metal::tensor_inline>, metal::cooperative_tensor<float, metal::extents<int, 18446744073709551615, 18446744073709551615>, mpp::tensor_ops::__mutmul2d_detail::__operand_layout<{32, 64, 32, false, true, false, 1}, mpp::tensor_ops::matmul2d_cooperative_operand_index::destination, metal::execution_simdgroups<4>, metal::tensor<threadgroup bfloat, metal::extents<int, 18446744073709551615, 18446744073709551615>, metal::tensor_inline>, metal::tensor<threadgroup bfloat, metal::extents<int, 18446744073709551615, 18446744073709551615>, metal::tensor_inline>, float, int>>, void>' requested here
        mm.run(sB, sA, cT);
           ^
In file included from program_source:2693:
In file included from /System/Library/Frameworks/MetalPerformancePrimitives.framework/Headers/MetalPerformancePrimitives.h:10:
In file included from /System/Library/Frameworks/MetalPerformancePrimitives.framework/Headers/MPPTensorOpsMatMul2d.h:412:
/System/Library/Frameworks/MetalPerformancePrimitives.framework/Headers/__impl/MPPTensorOpsMatMul2dImpl.h:5885:9: error: static_assert failed due to requirement '__tensor_ops_detail::__assert_false_v<float>' "Unsupported type"
        static_assert(
        ^
/System/Library/Frameworks/MetalPerformancePrimitives.framework/Headers/MPPTensorOpsMatMul2d.h:441:24: note: in instantiation of function template specialization 'mpp::tensor_ops::__mutmul2d_detail::__run<{32, 64, 32, false, true, false, 1}, metal::execution_simdgroups<4>, metal::tensor<threadgroup half, metal::extents<int, 18446744073709551615, 18446744073709551615>, metal::tensor_inline>, metal::tensor<threadgroup bfloat, metal::extents<int, 18446744073709551615, 18446744073709551615>, metal::tensor_inline>, metal::cooperative_tensor<float, metal::extents<int, 18446744073709551615, 18446744073709551615>, mpp::tensor_ops::__mutmul2d_detail::__operand_layout<{32, 64, 32, false, true, false, 1}, mpp::tensor_ops::matmul2d_cooperative_operand_index::destination, metal::execution_simdgroups<4>, metal::tensor<threadgroup bfloat, metal::extents<int, 18446744073709551615, 18446744073709551615>, metal::tensor_inline>, metal::tensor<threadgroup half, metal::extents<int, 18446744073709551615, 18446744073709551615>, metal::tensor_inline>, float, int>>>' requested here
    __mutmul2d_detail::__run<Descriptor, Scope, LeftOperandType,
                       ^
program_source:11054:12: note: in instantiation of function template specialization 'mpp::tensor_ops::matmul2d<{32, 64, 32, false, true, false, 1}, metal::execution_simdgroups<4>>::run<metal::tensor<threadgroup half, metal::extents<int, 18446744073709551615, 18446744073709551615>, metal::tensor_inline>, metal::tensor<threadgroup bfloat, metal::extents<int, 18446744073709551615, 18446744073709551615>, metal::tensor_inline>, metal::cooperative_tensor<float, metal::extents<int, 18446744073709551615, 18446744073709551615>, mpp::tensor_ops::__mutmul2d_detail::__operand_layout<{32, 64, 32, false, true, false, 1}, mpp::tensor_ops::matmul2d_cooperative_operand_index::destination, metal::execution_simdgroups<4>, metal::tensor<threadgroup bfloat, metal::extents<int, 18446744073709551615, 18446744073709551615>, metal::tensor_inline>, metal::tensor<threadgroup half, metal::extents<int, 18446744073709551615, 18446744073709551615>, metal::tensor_inline>, float, int>>, void>' requested here
        mm.run(sB, sA, cT);
           ^
}
ggml_metal_init: error: failed to initialize the Metal library
ggml_backend_metal_device_init: error: failed to allocate context
llama_init_from_model: failed to initialize the context: failed to initialize Metal backend

@ggerganov
Copy link
Member Author

@ngladitz Pushed a temporary workaround. It appears that some old versions of the MetalPerformancePrimitives framework are not compatible with bfloat.

Could you provide the output of the following commands:

defaults read /System/Library/Frameworks/MetalPerformancePrimitives.framework/Versions/Current/Resources/Info.plist CFBundleShortVersionString

xcrun metal -x metal -E -dM /dev/null

@ngladitz
Copy link

@ggerganov Thank you.

# defaults read /System/Library/Frameworks/MetalPerformancePrimitives.framework/Versions/Current/Resources/Info.plist CFBundleShortVersionString
1.0
# xcrun metal -x metal -E -dM /dev/null
error: error: cannot execute tool 'metal' due to missing Metal Toolchain; use: xcodebuild -downloadComponent MetalToolchain

I ran the download command indicated in that error and reran the initial command:

Output
#define _LP64 1
#define __AIR64__ 1
#define __AIR_ABI__ __AIR_PB_ABI__
#define __AIR_MB_ABI__ 1
#define __AIR_PB_ABI__ 0
#define __AIR_VB_ABI__ 2
#define __AIR_VERSION__ 20800
#define __APPLE_CC__ 6000
#define __APPLE__ 1
#define __ATOMIC_ACQUIRE 2
#define __ATOMIC_ACQ_REL 4
#define __ATOMIC_CONSUME 1
#define __ATOMIC_RELAXED 0
#define __ATOMIC_RELEASE 3
#define __ATOMIC_SEQ_CST 5
#define __BFLT16_DECIMAL_DIG__ 4
#define __BFLT16_DENORM_MIN__ 9.183550e-41BF16
#define __BFLT16_DIG__ 2
#define __BFLT16_EPSILON__ 7.812500e-03BF16
#define __BFLT16_HAS_DENORM__ 1
#define __BFLT16_HAS_INFINITY__ 1
#define __BFLT16_HAS_QUIET_NAN__ 1
#define __BFLT16_MANT_DIG__ 8
#define __BFLT16_MAX_10_EXP__ 38
#define __BFLT16_MAX_EXP__ 128
#define __BFLT16_MAX__ 3.389531e+38BF16
#define __BFLT16_MIN_10_EXP__ (-37)
#define __BFLT16_MIN_EXP__ (-125)
#define __BFLT16_MIN__ 1.175494e-38BF16
#define __BFLT16_M_1_PI__ 0.318309886183790671537767526745028724BF16
#define __BFLT16_M_2_PI__ 0.636619772367581343075535053490057448BF16
#define __BFLT16_M_2_SQRTPI__ 1.12837916709551257389615890312154517BF16
#define __BFLT16_M_E__ 2.71828182845904523536028747135266250BF16
#define __BFLT16_M_LN10__ 2.3025850929940456840179914546843642BF16
#define __BFLT16_M_LN2__ 0.693147180559945309417232121458176568BF16
#define __BFLT16_M_LOG10E__ 0.434294481903251827651128918916605082BF16
#define __BFLT16_M_LOG2E__ 1.44269504088896340735992468100189214BF16
#define __BFLT16_M_PI_2__ 1.57079632679489661923132169163975144BF16
#define __BFLT16_M_PI_4__ 0.785398163397448309615660845819875721BF16
#define __BFLT16_M_PI__ 3.14159265358979323846264338327950288BF16
#define __BFLT16_M_SQRT1_2__ 0.707106781186547524400844362104849039BF16
#define __BFLT16_M_SQRT2__ 1.41421356237309504880168872420969808BF16
#define __BFLT16_RADIX__ 2
#define __BIGGEST_ALIGNMENT__ 8
#define __BITINT_MAXWIDTH__ 128
#define __BOOL_WIDTH__ 8
#define __BYTE_ORDER__ __ORDER_LITTLE_ENDIAN__
#define __CHAR16_TYPE__ unsigned short
#define __CHAR32_TYPE__ unsigned int
#define __CHAR_BIT__ 8
#define __CLANG_ATOMIC_BOOL_LOCK_FREE 1
#define __CLANG_ATOMIC_CHAR16_T_LOCK_FREE 1
#define __CLANG_ATOMIC_CHAR32_T_LOCK_FREE 1
#define __CLANG_ATOMIC_CHAR_LOCK_FREE 1
#define __CLANG_ATOMIC_INT_LOCK_FREE 1
#define __CLANG_ATOMIC_LLONG_LOCK_FREE 1
#define __CLANG_ATOMIC_LONG_LOCK_FREE 1
#define __CLANG_ATOMIC_POINTER_LOCK_FREE 1
#define __CLANG_ATOMIC_SHORT_LOCK_FREE 1
#define __CLANG_ATOMIC_WCHAR_T_LOCK_FREE 1
#define __CLANG_CUDA_APPROX_TRANSCENDENTALS__ 1
#define __CONSTANT_CFSTRINGS__ 1
#define __DBL_DECIMAL_DIG__ 17
#define __DBL_DENORM_MIN__ 4.9406564584124654e-324
#define __DBL_DIG__ 15
#define __DBL_EPSILON__ 2.2204460492503131e-16
#define __DBL_HAS_DENORM__ 1
#define __DBL_HAS_INFINITY__ 1
#define __DBL_HAS_QUIET_NAN__ 1
#define __DBL_MANT_DIG__ 53
#define __DBL_MAX_10_EXP__ 308
#define __DBL_MAX_EXP__ 1024
#define __DBL_MAX__ 1.7976931348623157e+308
#define __DBL_MIN_10_EXP__ (-307)
#define __DBL_MIN_EXP__ (-1021)
#define __DBL_MIN__ 2.2250738585072014e-308
#define __DBL_M_1_PI__ 0.318309886183790671537767526745028724
#define __DBL_M_2_PI__ 0.636619772367581343075535053490057448
#define __DBL_M_2_SQRTPI__ 1.12837916709551257389615890312154517
#define __DBL_M_E__ 2.71828182845904523536028747135266250
#define __DBL_M_LN10__ 2.3025850929940456840179914546843642
#define __DBL_M_LN2__ 0.693147180559945309417232121458176568
#define __DBL_M_LOG10E__ 0.434294481903251827651128918916605082
#define __DBL_M_LOG2E__ 1.44269504088896340735992468100189214
#define __DBL_M_PI_2__ 1.57079632679489661923132169163975144
#define __DBL_M_PI_4__ 0.785398163397448309615660845819875721
#define __DBL_M_PI__ 3.14159265358979323846264338327950288
#define __DBL_M_SQRT1_2__ 0.707106781186547524400844362104849039
#define __DBL_M_SQRT2__ 1.41421356237309504880168872420969808
#define __DBL_RADIX__ 2
#define __DECIMAL_DIG__ __LDBL_DECIMAL_DIG__
#define __DYNAMIC__ 1
#define __ENVIRONMENT_MAC_OS_X_VERSION_MIN_REQUIRED__ 260000
#define __ENVIRONMENT_OS_VERSION_MIN_REQUIRED__ 260000
#define __FAST_MATH__ 1
#define __FINITE_MATH_ONLY__ 1
#define __FLT16_DECIMAL_DIG__ 5
#define __FLT16_DENORM_MIN__ 5.9604644775390625e-8F16
#define __FLT16_DIG__ 3
#define __FLT16_EPSILON__ 9.765625e-4F16
#define __FLT16_HAS_DENORM__ 1
#define __FLT16_HAS_INFINITY__ 1
#define __FLT16_HAS_QUIET_NAN__ 1
#define __FLT16_MANT_DIG__ 11
#define __FLT16_MAX_10_EXP__ 4
#define __FLT16_MAX_EXP__ 16
#define __FLT16_MAX__ 6.5504e+4F16
#define __FLT16_MIN_10_EXP__ (-4)
#define __FLT16_MIN_EXP__ (-13)
#define __FLT16_MIN__ 6.103515625e-5F16
#define __FLT16_M_1_PI__ 0.318309886183790671537767526745028724F16
#define __FLT16_M_2_PI__ 0.636619772367581343075535053490057448F16
#define __FLT16_M_2_SQRTPI__ 1.12837916709551257389615890312154517F16
#define __FLT16_M_E__ 2.71828182845904523536028747135266250F16
#define __FLT16_M_LN10__ 2.3025850929940456840179914546843642F16
#define __FLT16_M_LN2__ 0.693147180559945309417232121458176568F16
#define __FLT16_M_LOG10E__ 0.434294481903251827651128918916605082F16
#define __FLT16_M_LOG2E__ 1.44269504088896340735992468100189214F16
#define __FLT16_M_PI_2__ 1.57079632679489661923132169163975144F16
#define __FLT16_M_PI_4__ 0.785398163397448309615660845819875721F16
#define __FLT16_M_PI__ 3.14159265358979323846264338327950288F16
#define __FLT16_M_SQRT1_2__ 0.707106781186547524400844362104849039F16
#define __FLT16_M_SQRT2__ 1.41421356237309504880168872420969808F16
#define __FLT16_RADIX__ 2
#define __FLT_DECIMAL_DIG__ 9
#define __FLT_DENORM_MIN__ 1.40129846e-45F
#define __FLT_DIG__ 6
#define __FLT_EPSILON__ 1.19209290e-7F
#define __FLT_HAS_DENORM__ 1
#define __FLT_HAS_INFINITY__ 1
#define __FLT_HAS_QUIET_NAN__ 1
#define __FLT_MANT_DIG__ 24
#define __FLT_MAX_10_EXP__ 38
#define __FLT_MAX_EXP__ 128
#define __FLT_MAX__ 3.40282347e+38F
#define __FLT_MIN_10_EXP__ (-37)
#define __FLT_MIN_EXP__ (-125)
#define __FLT_MIN__ 1.17549435e-38F
#define __FLT_M_1_PI__ 0.318309886183790671537767526745028724F
#define __FLT_M_2_PI__ 0.636619772367581343075535053490057448F
#define __FLT_M_2_SQRTPI__ 1.12837916709551257389615890312154517F
#define __FLT_M_E__ 2.71828182845904523536028747135266250F
#define __FLT_M_LN10__ 2.3025850929940456840179914546843642F
#define __FLT_M_LN2__ 0.693147180559945309417232121458176568F
#define __FLT_M_LOG10E__ 0.434294481903251827651128918916605082F
#define __FLT_M_LOG2E__ 1.44269504088896340735992468100189214F
#define __FLT_M_PI_2__ 1.57079632679489661923132169163975144F
#define __FLT_M_PI_4__ 0.785398163397448309615660845819875721F
#define __FLT_M_PI__ 3.14159265358979323846264338327950288F
#define __FLT_M_SQRT1_2__ 0.707106781186547524400844362104849039F
#define __FLT_M_SQRT2__ 1.41421356237309504880168872420969808F
#define __FLT_RADIX__ 2
#define __FP_ILOGB0__ (-__INT_MAX__ - 1)
#define __FP_ILOGBNAN__ (-__INT_MAX__ - 1)
#define __GCC_ATOMIC_BOOL_LOCK_FREE 1
#define __GCC_ATOMIC_CHAR16_T_LOCK_FREE 1
#define __GCC_ATOMIC_CHAR32_T_LOCK_FREE 1
#define __GCC_ATOMIC_CHAR_LOCK_FREE 1
#define __GCC_ATOMIC_INT_LOCK_FREE 1
#define __GCC_ATOMIC_LLONG_LOCK_FREE 1
#define __GCC_ATOMIC_LONG_LOCK_FREE 1
#define __GCC_ATOMIC_POINTER_LOCK_FREE 1
#define __GCC_ATOMIC_SHORT_LOCK_FREE 1
#define __GCC_ATOMIC_TEST_AND_SET_TRUEVAL 1
#define __GCC_ATOMIC_WCHAR_T_LOCK_FREE 1
#define __GNUC_GNU_INLINE__ 1
#define __GNUC_MINOR__ 2
#define __GNUC_PATCHLEVEL__ 1
#define __GNUC__ 4
#define __GNUG__ 4
#define __GXX_ABI_VERSION 1002
#define __GXX_EXPERIMENTAL_CXX0X__ 1
#define __GXX_RTTI 1
#define __GXX_WEAK__ 1
#define __HALF_DECIMAL_DIG__ 5
#define __HALF_DENORM_MIN__ 5.9604644775390625e-8H
#define __HALF_DIG__ 3
#define __HALF_EPSILON__ 9.765625e-4H
#define __HALF_HAS_DENORM__ 1
#define __HALF_HAS_INFINITY__ 1
#define __HALF_HAS_QUIET_NAN__ 1
#define __HALF_MANT_DIG__ 11
#define __HALF_MAX_10_EXP__ 4
#define __HALF_MAX_EXP__ 16
#define __HALF_MAX__ 6.5504e+4H
#define __HALF_MIN_10_EXP__ (-4)
#define __HALF_MIN_EXP__ (-13)
#define __HALF_MIN__ 6.103515625e-5H
#define __HALF_M_1_PI__ 0.318309886183790671537767526745028724H
#define __HALF_M_2_PI__ 0.636619772367581343075535053490057448H
#define __HALF_M_2_SQRTPI__ 1.12837916709551257389615890312154517H
#define __HALF_M_E__ 2.71828182845904523536028747135266250H
#define __HALF_M_LN10__ 2.3025850929940456840179914546843642H
#define __HALF_M_LN2__ 0.693147180559945309417232121458176568H
#define __HALF_M_LOG10E__ 0.434294481903251827651128918916605082H
#define __HALF_M_LOG2E__ 1.44269504088896340735992468100189214H
#define __HALF_M_PI_2__ 1.57079632679489661923132169163975144H
#define __HALF_M_PI_4__ 0.785398163397448309615660845819875721H
#define __HALF_M_PI__ 3.14159265358979323846264338327950288H
#define __HALF_M_SQRT1_2__ 0.707106781186547524400844362104849039H
#define __HALF_M_SQRT2__ 1.41421356237309504880168872420969808H
#define __HALF_RADIX__ 2
#define __INT128_C_SUFFIX__ LL
#define __INT128_FMTd__ "lld"
#define __INT128_FMTi__ "lli"
#define __INT128_MAX__ 170141183460469231731687303715884105727LL
#define __INT128_TYPE__ long long int
#define __INT16_C_SUFFIX__ 
#define __INT16_FMTd__ "hd"
#define __INT16_FMTi__ "hi"
#define __INT16_MAX__ 32767
#define __INT16_TYPE__ short
#define __INT32_C_SUFFIX__ 
#define __INT32_FMTd__ "d"
#define __INT32_FMTi__ "i"
#define __INT32_MAX__ 2147483647
#define __INT32_TYPE__ int
#define __INT64_C_SUFFIX__ L
#define __INT64_FMTd__ "ld"
#define __INT64_FMTi__ "li"
#define __INT64_MAX__ 9223372036854775807L
#define __INT64_TYPE__ long int
#define __INT8_C_SUFFIX__ 
#define __INT8_FMTd__ "hhd"
#define __INT8_FMTi__ "hhi"
#define __INT8_MAX__ 127
#define __INT8_TYPE__ signed char
#define __INTMAX_C_SUFFIX__ LL
#define __INTMAX_FMTd__ "lld"
#define __INTMAX_FMTi__ "lli"
#define __INTMAX_MAX__ 170141183460469231731687303715884105727LL
#define __INTMAX_TYPE__ long long int
#define __INTMAX_WIDTH__ 128
#define __INTPTR_FMTd__ "ld"
#define __INTPTR_FMTi__ "li"
#define __INTPTR_MAX__ 9223372036854775807L
#define __INTPTR_TYPE__ long int
#define __INTPTR_WIDTH__ 64
#define __INT_FAST16_FMTd__ "hd"
#define __INT_FAST16_FMTi__ "hi"
#define __INT_FAST16_MAX__ 32767
#define __INT_FAST16_TYPE__ short
#define __INT_FAST16_WIDTH__ 16
#define __INT_FAST32_FMTd__ "d"
#define __INT_FAST32_FMTi__ "i"
#define __INT_FAST32_MAX__ 2147483647
#define __INT_FAST32_TYPE__ int
#define __INT_FAST32_WIDTH__ 32
#define __INT_FAST64_FMTd__ "lld"
#define __INT_FAST64_FMTi__ "lli"
#define __INT_FAST64_MAX__ 170141183460469231731687303715884105727LL
#define __INT_FAST64_TYPE__ long long int
#define __INT_FAST64_WIDTH__ 128
#define __INT_FAST8_FMTd__ "hhd"
#define __INT_FAST8_FMTi__ "hhi"
#define __INT_FAST8_MAX__ 127
#define __INT_FAST8_TYPE__ signed char
#define __INT_FAST8_WIDTH__ 8
#define __INT_LEAST16_FMTd__ "hd"
#define __INT_LEAST16_FMTi__ "hi"
#define __INT_LEAST16_MAX__ 32767
#define __INT_LEAST16_TYPE__ short
#define __INT_LEAST16_WIDTH__ 16
#define __INT_LEAST32_FMTd__ "d"
#define __INT_LEAST32_FMTi__ "i"
#define __INT_LEAST32_MAX__ 2147483647
#define __INT_LEAST32_TYPE__ int
#define __INT_LEAST32_WIDTH__ 32
#define __INT_LEAST64_FMTd__ "lld"
#define __INT_LEAST64_FMTi__ "lli"
#define __INT_LEAST64_MAX__ 170141183460469231731687303715884105727LL
#define __INT_LEAST64_TYPE__ long long int
#define __INT_LEAST64_WIDTH__ 128
#define __INT_LEAST8_FMTd__ "hhd"
#define __INT_LEAST8_FMTi__ "hhi"
#define __INT_LEAST8_MAX__ 127
#define __INT_LEAST8_TYPE__ signed char
#define __INT_LEAST8_WIDTH__ 8
#define __INT_MAX__ 2147483647
#define __INT_WIDTH__ 32
#define __LDBL_DECIMAL_DIG__ 17
#define __LDBL_DENORM_MIN__ 4.9406564584124654e-324L
#define __LDBL_DIG__ 15
#define __LDBL_EPSILON__ 2.2204460492503131e-16L
#define __LDBL_HAS_DENORM__ 1
#define __LDBL_HAS_INFINITY__ 1
#define __LDBL_HAS_QUIET_NAN__ 1
#define __LDBL_MANT_DIG__ 53
#define __LDBL_MAX_10_EXP__ 308
#define __LDBL_MAX_EXP__ 1024
#define __LDBL_MAX__ 1.7976931348623157e+308L
#define __LDBL_MIN_10_EXP__ (-307)
#define __LDBL_MIN_EXP__ (-1021)
#define __LDBL_MIN__ 2.2250738585072014e-308L
#define __LDBL_M_1_PI__ 0.318309886183790671537767526745028724L
#define __LDBL_M_2_PI__ 0.636619772367581343075535053490057448L
#define __LDBL_M_2_SQRTPI__ 1.12837916709551257389615890312154517L
#define __LDBL_M_E__ 2.71828182845904523536028747135266250L
#define __LDBL_M_LN10__ 2.3025850929940456840179914546843642L
#define __LDBL_M_LN2__ 0.693147180559945309417232121458176568L
#define __LDBL_M_LOG10E__ 0.434294481903251827651128918916605082L
#define __LDBL_M_LOG2E__ 1.44269504088896340735992468100189214L
#define __LDBL_M_PI_2__ 1.57079632679489661923132169163975144L
#define __LDBL_M_PI_4__ 0.785398163397448309615660845819875721L
#define __LDBL_M_PI__ 3.14159265358979323846264338327950288L
#define __LDBL_M_SQRT1_2__ 0.707106781186547524400844362104849039L
#define __LDBL_M_SQRT2__ 1.41421356237309504880168872420969808L
#define __LITTLE_ENDIAN__ 1
#define __LLONG_WIDTH__ 128
#define __LONG_LONG_MAX__ 170141183460469231731687303715884105727LL
#define __LONG_MAX__ 9223372036854775807L
#define __LONG_WIDTH__ 64
#define __LP64__ 1
#define __MACH__ 1
#define __MAX_BUFFERS__ 31u
#define __MAX_COLORS__ 8u
#define __MAX_CONSTANT_BUFFERS__ 31u
#define __MAX_CONSTANT_BUFFER_SIZE__ 4294967295u
#define __MAX_FUNCTION_CONSTANT_INDEX__ 65535
#define __MAX_PATCH_DATA_ATTRIBUTES__ 29u
#define __MAX_READ_WRITE_TEXTURES__ 8u
#define __MAX_SAMPLERS__ 16u
#define __MAX_TESSELLATION_CONTROL_POINTS__ 32u
#define __MAX_TEXTURES__ 128u
#define __MAX_THREADGROUP_BUFFERS__ 31u
#define __METAL_ACCESS_READ_WRITE__ 3
#define __METAL_ACCESS_READ__ 1
#define __METAL_ACCESS_SAMPLE__ 0
#define __METAL_ACCESS_WRITE__ 2
#define __METAL_ADDRESS_CLAMP_TO_BORDER__ 4
#define __METAL_ADDRESS_CLAMP_TO_EDGE__ 1
#define __METAL_ADDRESS_CLAMP_TO_ZERO__ 0
#define __METAL_ADDRESS_MIRRORED_REPEAT__ 3
#define __METAL_ADDRESS_REPEAT__ 2
#define __METAL_BORDER_COLOR_OPAQUE_BLACK__ 1
#define __METAL_BORDER_COLOR_OPAQUE_WHITE__ 2
#define __METAL_BORDER_COLOR_TRANSPARENT_BLACK__ 0
#define __METAL_COHERENCE_DEVICE__ 1
#define __METAL_COHERENCE_THREADGROUP__ 0
#define __METAL_COMPARE_FUNC_ALWAYS__ 7
#define __METAL_COMPARE_FUNC_EQUAL__ 5
#define __METAL_COMPARE_FUNC_GREATER_EQUAL__ 4
#define __METAL_COMPARE_FUNC_GREATER__ 3
#define __METAL_COMPARE_FUNC_LESS_EQUAL__ 2
#define __METAL_COMPARE_FUNC_LESS__ 1
#define __METAL_COMPARE_FUNC_NEVER__ 8
#define __METAL_COMPARE_FUNC_NONE__ 0
#define __METAL_COMPARE_FUNC_NOT_EQUAL__ 6
#define __METAL_COORD_NORMALIZED__ 0
#define __METAL_COORD_PIXEL__ 1
#define __METAL_CULL_MODE_BACK__ 2
#define __METAL_CULL_MODE_FRONT__ 1
#define __METAL_CULL_MODE_NONE__ 0
#define __METAL_DEPTH_CLIP_MODE_CLAMP__ 1
#define __METAL_DEPTH_CLIP_MODE_CLIP__ 0
#define __METAL_FAST_MATH__ 0
#define __METAL_FILTER_BICUBIC__ 2
#define __METAL_FILTER_LINEAR__ 1
#define __METAL_FILTER_NEAREST__ 0
#define __METAL_HALF_MATH__ 1
#define __METAL_MAG_FILTER_BICUBIC__ 2
#define __METAL_MAG_FILTER_LINEAR__ 1
#define __METAL_MAG_FILTER_NEAREST__ 0
#define __METAL_MATH_FP32_FUNCTIONS_FAST__ 1
#define __METAL_MEMORY_FLAGS_DEVICE__ 1
#define __METAL_MEMORY_FLAGS_NONE__ 0
#define __METAL_MEMORY_FLAGS_OBJECT_DATA__ 16
#define __METAL_MEMORY_FLAGS_TEXTURE__ 4
#define __METAL_MEMORY_FLAGS_THREADGROUP_IMAGEBLOCK__ 8
#define __METAL_MEMORY_FLAGS_THREADGROUP__ 2
#define __METAL_MEMORY_ORDER_ACQUIRE__ 2
#define __METAL_MEMORY_ORDER_ACQ_REL__ 4
#define __METAL_MEMORY_ORDER_RELAXED__ 0
#define __METAL_MEMORY_ORDER_RELEASE__ 3
#define __METAL_MEMORY_ORDER_SEQ_CST__ 5
#define __METAL_MEMORY_SCOPE_DEVICE__ 2
#define __METAL_MEMORY_SCOPE_SIMDGROUP__ 4
#define __METAL_MEMORY_SCOPE_THREADGROUP__ 1
#define __METAL_MEMORY_SCOPE_THREAD__ 0
#define __METAL_MIN_FILTER_BICUBIC__ 2
#define __METAL_MIN_FILTER_LINEAR__ 1
#define __METAL_MIN_FILTER_NEAREST__ 0
#define __METAL_MIP_FILTER_LINEAR__ 2
#define __METAL_MIP_FILTER_NEAREST__ 1
#define __METAL_MIP_FILTER_NONE__ 0
#define __METAL_NATIVE_MATH__ 2
#define __METAL_OS_LOG_TYPE_DEBUG__ 0x02
#define __METAL_OS_LOG_TYPE_DEFAULT__ 0x00
#define __METAL_OS_LOG_TYPE_ERROR__ 0x10
#define __METAL_OS_LOG_TYPE_FAULT__ 0x11
#define __METAL_OS_LOG_TYPE_INFO__ 0x01
#define __METAL_PRECISE_MATH__ 3
#define __METAL_PRIMITIVE_TYPE_LINE_STRIP__ 2
#define __METAL_PRIMITIVE_TYPE_LINE__ 1
#define __METAL_PRIMITIVE_TYPE_POINT__ 0
#define __METAL_PRIMITIVE_TYPE_TRIANGLE_STRIP__ 4
#define __METAL_PRIMITIVE_TYPE_TRIANGLE__ 3
#define __METAL_RAYTRACING_CURVE_BASIS_ALL__ 0xFFFFFFFF
#define __METAL_RAYTRACING_CURVE_BASIS_BEZIER__ 3
#define __METAL_RAYTRACING_CURVE_BASIS_BSPLINE__ 0
#define __METAL_RAYTRACING_CURVE_BASIS_CATMULL_ROM__ 1
#define __METAL_RAYTRACING_CURVE_BASIS_LINEAR__ 2
#define __METAL_RAYTRACING_CURVE_TYPE_ALL__ 0xFFFFFFFF
#define __METAL_RAYTRACING_CURVE_TYPE_FLAT__ 1
#define __METAL_RAYTRACING_CURVE_TYPE_ROUND__ 0
#define __METAL_RAYTRACING_FORCED_OPACITY_NONE__ 0
#define __METAL_RAYTRACING_FORCED_OPACITY_NON_OPAQUE__ 2
#define __METAL_RAYTRACING_FORCED_OPACITY_OPAQUE__ 1
#define __METAL_RAYTRACING_GEOMETRY_CULL_MODE_BOUNDING_BOX__ 2
#define __METAL_RAYTRACING_GEOMETRY_CULL_MODE_CURVE__ 4
#define __METAL_RAYTRACING_GEOMETRY_CULL_MODE_NONE__ 0
#define __METAL_RAYTRACING_GEOMETRY_CULL_MODE_TRIANGLE__ 1
#define __METAL_RAYTRACING_GEOMETRY_TYPE_BOUNDING_BOX__ 2
#define __METAL_RAYTRACING_GEOMETRY_TYPE_CURVE__ 4
#define __METAL_RAYTRACING_GEOMETRY_TYPE_NONE__ 0
#define __METAL_RAYTRACING_GEOMETRY_TYPE_TRIANGLE__ 1
#define __METAL_RAYTRACING_INTERSECTION_TAG_CURVE_DATA__ 16
#define __METAL_RAYTRACING_INTERSECTION_TAG_EXTENDED_LIMITS__ 512
#define __METAL_RAYTRACING_INTERSECTION_TAG_INSTANCE_MOTION__ 256
#define __METAL_RAYTRACING_INTERSECTION_TAG_INSTANCING__ 2
#define __METAL_RAYTRACING_INTERSECTION_TAG_INTERSECTION_FUNCTION_BUFFER__ 1
#define __METAL_RAYTRACING_INTERSECTION_TAG_MAX_LEVELS__ 4
#define __METAL_RAYTRACING_INTERSECTION_TAG_NONE__ 0
#define __METAL_RAYTRACING_INTERSECTION_TAG_PRIMITIVE_MOTION__ 128
#define __METAL_RAYTRACING_INTERSECTION_TAG_TRIANGLE_DATA__ 8
#define __METAL_RAYTRACING_INTERSECTION_TAG_USER_DATA__ 64
#define __METAL_RAYTRACING_INTERSECTION_TAG_WORLD_SPACE_DATA__ 32
#define __METAL_RAYTRACING_INTERSECTION_TYPE_BOUNDING_BOX__ 2
#define __METAL_RAYTRACING_INTERSECTION_TYPE_CURVE__ 3
#define __METAL_RAYTRACING_INTERSECTION_TYPE_NONE__ 0
#define __METAL_RAYTRACING_INTERSECTION_TYPE_TRIANGLE__ 1
#define __METAL_RAYTRACING_OPACITY_CULL_MODE_NONE__ 0
#define __METAL_RAYTRACING_OPACITY_CULL_MODE_NON_OPAQUE__ 2
#define __METAL_RAYTRACING_OPACITY_CULL_MODE_OPAQUE__ 1
#define __METAL_RAYTRACING_TRIANGLE_CULL_MODE_BACK__ 2
#define __METAL_RAYTRACING_TRIANGLE_CULL_MODE_FRONT__ 1
#define __METAL_RAYTRACING_TRIANGLE_CULL_MODE_NONE__ 0
#define __METAL_REDUCTION_MAXIMUM__ 2
#define __METAL_REDUCTION_MINIMUM__ 1
#define __METAL_REDUCTION_WEIGHTED_AVERAGE__ 0
#define __METAL_ROUNDING_RTE__ 0
#define __METAL_ROUNDING_RTNA__ 4
#define __METAL_ROUNDING_RTN__ 3
#define __METAL_ROUNDING_RTP__ 2
#define __METAL_ROUNDING_RTZ__ 1
#define __METAL_SIMDGROUP_LOAD_STORE_BOUNDS_CHECK_BOTH__ 3
#define __METAL_SIMDGROUP_LOAD_STORE_BOUNDS_CHECK_COLS__ 1
#define __METAL_SIMDGROUP_LOAD_STORE_BOUNDS_CHECK_NONE__ 0
#define __METAL_SIMDGROUP_LOAD_STORE_BOUNDS_CHECK_ROWS__ 2
#define __METAL_TEXTURE_WRITE_ROUNDING_MODE_NATIVE__ 0
#define __METAL_TEXTURE_WRITE_ROUNDING_MODE_RTE__ 1
#define __METAL_TEXTURE_WRITE_ROUNDING_MODE_RTZ__ 2
#define __METAL_TEXTURE_WRITE_ROUNDING_MODE__ 0
#define __METAL_TOPOLOGY_LINE__ 1
#define __METAL_TOPOLOGY_POINT__ 0
#define __METAL_TOPOLOGY_TRIANGLE__ 2
#define __METAL_TRIANGLE_FILL_MODE_FILL__ 0
#define __METAL_TRIANGLE_FILL_MODE_LINES__ 1
#define __METAL_VERSION__ 400
#define __METAL_VERTEX_INDEX_FIRST__ 0
#define __METAL_VERTEX_INDEX_SECOND__ 1
#define __METAL_VERTEX_INDEX_THIRD__ 2
#define __METAL_VOTE_T__ long unsigned int
#define __METAL_WINDING_CLOCKWISE__ 0
#define __METAL_WINDING_COUNTERCLOCKWISE__ 1
#define __METAL__ 1
#define __NO_MATH_ERRNO__ 1
#define __OBJC_BOOL_IS_BOOL 0
#define __OPENCL_MEMORY_SCOPE_ALL_SVM_DEVICES 3
#define __OPENCL_MEMORY_SCOPE_DEVICE 2
#define __OPENCL_MEMORY_SCOPE_SUB_GROUP 4
#define __OPENCL_MEMORY_SCOPE_WORK_GROUP 1
#define __OPENCL_MEMORY_SCOPE_WORK_ITEM 0
#define __OPTIMIZE__ 1
#define __ORDER_BIG_ENDIAN__ 4321
#define __ORDER_LITTLE_ENDIAN__ 1234
#define __ORDER_PDP_ENDIAN__ 3412
#define __POINTER_WIDTH__ 64
#define __PRAGMA_REDEFINE_EXTNAME 1
#define __PTRDIFF_FMTd__ "ld"
#define __PTRDIFF_FMTi__ "li"
#define __PTRDIFF_MAX__ 9223372036854775807L
#define __PTRDIFF_TYPE__ long int
#define __PTRDIFF_WIDTH__ 64
#define __SCHAR_MAX__ 127
#define __SHRT_MAX__ 32767
#define __SHRT_WIDTH__ 16
#define __SIG_ATOMIC_MAX__ 2147483647
#define __SIG_ATOMIC_WIDTH__ 32
#define __SIZEOF_DOUBLE__ 8
#define __SIZEOF_FLOAT__ 4
#define __SIZEOF_INT128__ 16
#define __SIZEOF_INT__ 4
#define __SIZEOF_LONG_DOUBLE__ 8
#define __SIZEOF_LONG_LONG__ 16
#define __SIZEOF_LONG__ 8
#define __SIZEOF_POINTER__ 8
#define __SIZEOF_PTRDIFF_T__ 8
#define __SIZEOF_SHORT__ 2
#define __SIZEOF_SIZE_T__ 8
#define __SIZEOF_WCHAR_T__ 4
#define __SIZEOF_WINT_T__ 4
#define __SIZE_FMTX__ "lX"
#define __SIZE_FMTo__ "lo"
#define __SIZE_FMTu__ "lu"
#define __SIZE_FMTx__ "lx"
#define __SIZE_MAX__ 18446744073709551615UL
#define __SIZE_TYPE__ long unsigned int
#define __SIZE_WIDTH__ 64
#define __STDCPP_DEFAULT_NEW_ALIGNMENT__ 16UL
#define __STDCPP_THREADS__ 1
#define __STDC_HOSTED__ 1
#define __STDC_NO_THREADS__ 1
#define __STDC_UTF_16__ 1
#define __STDC_UTF_32__ 1
#define __STDC__ 1
#define __STRICT_ANSI__ 1
#define __UINT128_C_SUFFIX__ ULL
#define __UINT128_FMTX__ "llX"
#define __UINT128_FMTo__ "llo"
#define __UINT128_FMTu__ "llu"
#define __UINT128_FMTx__ "llx"
#define __UINT128_MAX__ 340282366920938463463374607431768211455ULL
#define __UINT128_TYPE__ long long unsigned int
#define __UINT16_C_SUFFIX__ 
#define __UINT16_FMTX__ "hX"
#define __UINT16_FMTo__ "ho"
#define __UINT16_FMTu__ "hu"
#define __UINT16_FMTx__ "hx"
#define __UINT16_MAX__ 65535
#define __UINT16_TYPE__ unsigned short
#define __UINT32_C_SUFFIX__ U
#define __UINT32_FMTX__ "X"
#define __UINT32_FMTo__ "o"
#define __UINT32_FMTu__ "u"
#define __UINT32_FMTx__ "x"
#define __UINT32_MAX__ 4294967295U
#define __UINT32_TYPE__ unsigned int
#define __UINT64_C_SUFFIX__ UL
#define __UINT64_FMTX__ "lX"
#define __UINT64_FMTo__ "lo"
#define __UINT64_FMTu__ "lu"
#define __UINT64_FMTx__ "lx"
#define __UINT64_MAX__ 18446744073709551615UL
#define __UINT64_TYPE__ long unsigned int
#define __UINT8_C_SUFFIX__ 
#define __UINT8_FMTX__ "hhX"
#define __UINT8_FMTo__ "hho"
#define __UINT8_FMTu__ "hhu"
#define __UINT8_FMTx__ "hhx"
#define __UINT8_MAX__ 255
#define __UINT8_TYPE__ unsigned char
#define __UINTMAX_C_SUFFIX__ ULL
#define __UINTMAX_FMTX__ "llX"
#define __UINTMAX_FMTo__ "llo"
#define __UINTMAX_FMTu__ "llu"
#define __UINTMAX_FMTx__ "llx"
#define __UINTMAX_MAX__ 340282366920938463463374607431768211455ULL
#define __UINTMAX_TYPE__ long long unsigned int
#define __UINTMAX_WIDTH__ 128
#define __UINTPTR_FMTX__ "lX"
#define __UINTPTR_FMTo__ "lo"
#define __UINTPTR_FMTu__ "lu"
#define __UINTPTR_FMTx__ "lx"
#define __UINTPTR_MAX__ 18446744073709551615UL
#define __UINTPTR_TYPE__ long unsigned int
#define __UINTPTR_WIDTH__ 64
#define __UINT_FAST16_FMTX__ "hX"
#define __UINT_FAST16_FMTo__ "ho"
#define __UINT_FAST16_FMTu__ "hu"
#define __UINT_FAST16_FMTx__ "hx"
#define __UINT_FAST16_MAX__ 65535
#define __UINT_FAST16_TYPE__ unsigned short
#define __UINT_FAST32_FMTX__ "X"
#define __UINT_FAST32_FMTo__ "o"
#define __UINT_FAST32_FMTu__ "u"
#define __UINT_FAST32_FMTx__ "x"
#define __UINT_FAST32_MAX__ 4294967295U
#define __UINT_FAST32_TYPE__ unsigned int
#define __UINT_FAST64_FMTX__ "llX"
#define __UINT_FAST64_FMTo__ "llo"
#define __UINT_FAST64_FMTu__ "llu"
#define __UINT_FAST64_FMTx__ "llx"
#define __UINT_FAST64_MAX__ 340282366920938463463374607431768211455ULL
#define __UINT_FAST64_TYPE__ long long unsigned int
#define __UINT_FAST8_FMTX__ "hhX"
#define __UINT_FAST8_FMTo__ "hho"
#define __UINT_FAST8_FMTu__ "hhu"
#define __UINT_FAST8_FMTx__ "hhx"
#define __UINT_FAST8_MAX__ 255
#define __UINT_FAST8_TYPE__ unsigned char
#define __UINT_LEAST16_FMTX__ "hX"
#define __UINT_LEAST16_FMTo__ "ho"
#define __UINT_LEAST16_FMTu__ "hu"
#define __UINT_LEAST16_FMTx__ "hx"
#define __UINT_LEAST16_MAX__ 65535
#define __UINT_LEAST16_TYPE__ unsigned short
#define __UINT_LEAST32_FMTX__ "X"
#define __UINT_LEAST32_FMTo__ "o"
#define __UINT_LEAST32_FMTu__ "u"
#define __UINT_LEAST32_FMTx__ "x"
#define __UINT_LEAST32_MAX__ 4294967295U
#define __UINT_LEAST32_TYPE__ unsigned int
#define __UINT_LEAST64_FMTX__ "llX"
#define __UINT_LEAST64_FMTo__ "llo"
#define __UINT_LEAST64_FMTu__ "llu"
#define __UINT_LEAST64_FMTx__ "llx"
#define __UINT_LEAST64_MAX__ 340282366920938463463374607431768211455ULL
#define __UINT_LEAST64_TYPE__ long long unsigned int
#define __UINT_LEAST8_FMTX__ "hhX"
#define __UINT_LEAST8_FMTo__ "hho"
#define __UINT_LEAST8_FMTu__ "hhu"
#define __UINT_LEAST8_FMTx__ "hhx"
#define __UINT_LEAST8_MAX__ 255
#define __UINT_LEAST8_TYPE__ unsigned char
#define __USER_LABEL_PREFIX__ 
#define __VERSION__ "Apple Metal 32023.830 (metalfe-32023.830.2)"
#define __WCHAR_MAX__ 2147483647
#define __WCHAR_TYPE__ int
#define __WCHAR_WIDTH__ 32
#define __WINT_MAX__ 2147483647
#define __WINT_TYPE__ int
#define __WINT_WIDTH__ 32
#define __clang__ 1
#define __clang_literal_encoding__ "UTF-8"
#define __clang_major__ 32023
#define __clang_minor__ 830
#define __clang_patchlevel__ 2
#define __clang_version__ "32023.830 (metalfe-32023.830.2)"
#define __clang_wide_literal_encoding__ "UTF-32"
#define __cplusplus 201703L
#define __cpp_aggregate_bases 201603L
#define __cpp_aggregate_nsdmi 201304L
#define __cpp_alias_templates 200704L
#define __cpp_attributes 200809L
#define __cpp_binary_literals 201304L
#define __cpp_capture_star_this 201603L
#define __cpp_constexpr 201603L
#define __cpp_constexpr_in_decltype 201711L
#define __cpp_decltype 200707L
#define __cpp_decltype_auto 201304L
#define __cpp_deduction_guides 201703L
#define __cpp_delegating_constructors 200604L
#define __cpp_digit_separators 201309L
#define __cpp_enumerator_attributes 201411L
#define __cpp_fold_expressions 201603L
#define __cpp_generic_lambdas 201304L
#define __cpp_guaranteed_copy_elision 201606L
#define __cpp_hex_float 201603L
#define __cpp_if_constexpr 201606L
#define __cpp_impl_destroying_delete 201806L
#define __cpp_inheriting_constructors 201511L
#define __cpp_init_captures 201304L
#define __cpp_initializer_lists 200806L
#define __cpp_inline_variables 201606L
#define __cpp_lambdas 200907L
#define __cpp_namespace_attributes 201411L
#define __cpp_nested_namespace_definitions 201411L
#define __cpp_noexcept_function_type 201510L
#define __cpp_nontype_template_args 201411L
#define __cpp_nontype_template_parameter_auto 201606L
#define __cpp_nsdmi 200809L
#define __cpp_range_based_for 201603L
#define __cpp_raw_strings 200710L
#define __cpp_ref_qualifiers 200710L
#define __cpp_return_type_deduction 201304L
#define __cpp_rtti 199711L
#define __cpp_rvalue_references 200610L
#define __cpp_static_assert 201411L
#define __cpp_structured_bindings 201606L
#define __cpp_template_auto 201606L
#define __cpp_threadsafe_static_init 200806L
#define __cpp_unicode_characters 200704L
#define __cpp_unicode_literals 200710L
#define __cpp_user_defined_literals 200809L
#define __cpp_variable_templates 201304L
#define __cpp_variadic_templates 200704L
#define __cpp_variadic_using 201611L
#define __llvm__ 1
#define __nonnull _Nonnull
#define __null_unspecified _Null_unspecified
#define __nullable _Nullable
#define __private_extern__ extern
#define __strong 
#define __unsafe_unretained 
#define __weak __attribute__((objc_gc(weak)))

@ggerganov
Copy link
Member Author

Thanks. Does this command produce any output on your end:

cat /System/Library/Frameworks/MetalPerformancePrimitives.framework/Versions/Current/Headers/__impl/MPPTensorOpsMatMul2dImpl.h  | grep bfloat

In any case, the latest branch should work for you. The expectation is that the performance should be comparable with master (i.e. because M2 does not have Neural Accelerators).

@ngladitz
Copy link

Yes, thank you the latest branch does build and run with the now presumably expected:

ggml_metal_device_init: disabling bfloat support as a workaround for tensor API incompatibility
cat /System/Library/Frameworks/MetalPerformancePrimitives.framework/Versions/Current/Headers/__impl/MPPTensorOpsMatMul2dImpl.h  | grep bfloat

This produces no output. (Quick glance at the file I think I see e.g. half but no bfloat)

@ggerganov
Copy link
Member Author

Yup, my version has bfloat as well and also I see this in the public header comments:

head -n 50 /System/Library/Frameworks/MetalPerformancePrimitives.framework/Versions/A/Headers/MPPTensorOpsMatMul2d.h 

// -*- Metal -*-
//===-- MetalTensorOpsMatMul2d
//------------------------------------------------------===//
// Copyright (c) 2025 Apple Inc. All rights reserved
//===----------------------------------------------------------------------===//
// This API performs generalized matrix multiplication operation
//             C = A*B + C;
// A and B can be tensor_handle, tensor_offset, and tensor_inline.
// C can be tensor_handle, tensor_offset, tensor_inline or cooperative_tensor.
// Data type combinations supported by this operation are as follows:
//
//  A          B         C
//  ---------------------------
//  half       half      half
//  half       int8_t    half
//  int8_t     half      half
//  half       half      float
//  half       float     float
//  half       int8_t    float
//  float      half      float
//  float      float     float
//  float      int8_t    float
//  int8_t     half      float
//  int8_t     float     float
//  int8_t     int8_t    int32_t
//  bfloat     bfloat    bfloat
//  bfloat     bfloat    float
//  bfloat     float     float
//  bfloat     int8_t    bfloat
//  bfloat     int8_t    float
//  float      bfloat    float
//  int8_t     bfloat    bfloat
//  int8_t     bfloat    float
//  bfloat     half      bfloat
//  bfloat     half      half
//  bfloat     half      float
//  half       bfloat    bfloat
//  half       bfloat    half
//  half       bfloat    float
//
// Basic usage is in the following example which takes M x K matrix A of type
// half, K x N matrix B of type half, both in device memory and produces M x N
// matrix C of type float in device memory. It tiles this matrix multiplication
// in thread groups, where each thread group computes a 64 x 32 tile of output
// but multiplying 64 x K tile of A with K x 32 tile of B. This compute kernel
// will be launched with dispatch grid of
//
//        MTLSize threadgroups = MTLSizeMake((M + 63)/64, (N + 31)/32, 1);
//

So bfloat should be supported, but I think some versions of MPP do not have it yet.

Indeed I am working on the MacOS Developer Beta with Tahoe Beta 26.1, so probably this explains it.

@mweinbach
Copy link

mweinbach commented Oct 25, 2025

Just ran it on M5, this is what I got. It does seem to be working!

This branch:
Loading model...
Loaded model mistral-7b-v0.1.Q4_0.gguf

Running benchmark...
Model info: llama 7B Q4_0
Heat up time: 0.313745292 seconds, please wait...

model size params backend test t/s
llama 7B Q4_0 3.83 GiB 7.24 B Metal pp 512 608.05 ± 8.37
llama 7B Q4_0 3.83 GiB 7.24 B Metal tg 128 26.59 ± 0.88

Loading model...
Loaded model phi-2-q4_0.gguf

Running benchmark...
Model info: phi2 3B Q4_0
Heat up time: 0.314252375 seconds, please wait...

model size params backend test t/s
phi2 3B Q4_0 1.49 GiB 2.78 B Metal pp 512 1481.38 ± 62.43
phi2 3B Q4_0 1.49 GiB 2.78 B Metal tg 128 64.11 ± 2.11

Loading model...
Loaded model tinyllama-1.1b-1t-openorca.Q4_0.gguf

Running benchmark...
Model info: llama 1B Q4_0
Heat up time: 0.142863208 seconds, please wait...

model size params backend test t/s
llama 1B Q4_0 0.59 GiB 1.10 B Metal pp 512 3959.13 ± 869.34
llama 1B Q4_0 0.59 GiB 1.10 B Metal tg 128 149.69 ± 0.12

Loading model...
Loaded model tinyllama-1.1b-f16.gguf

Running benchmark...
Model info: llama 1B F16
Heat up time: 0.156012792 seconds, please wait...

model size params backend test t/s
llama 1B F16 2.05 GiB 1.10 B Metal pp 512 2304.44 ± 460.68
llama 1B F16 2.05 GiB 1.10 B Metal tg 128 56.34 ± 2.44

Main:
Load a model from the list below
Loading model...
Loaded model mistral-7b-v0.1.Q4_0.gguf

Running benchmark...
Model info: llama 7B Q4_0
Heat up time: 0.363522542 seconds, please wait...

model size params backend test t/s
llama 7B Q4_0 3.83 GiB 7.24 B Metal pp 512 252.82 ± 8.67
llama 7B Q4_0 3.83 GiB 7.24 B Metal tg 128 27.55 ± 0.83

Loading model...
Loaded model phi-2-q4_0.gguf

Running benchmark...
Model info: phi2 3B Q4_0
Heat up time: 0.226215583 seconds, please wait...

model size params backend test t/s
phi2 3B Q4_0 1.49 GiB 2.78 B Metal pp 512 662.44 ± 1.98
phi2 3B Q4_0 1.49 GiB 2.78 B Metal tg 128 66.10 ± 0.73

Loading model...
Loaded model tinyllama-1.1b-1t-openorca.Q4_0.gguf

Running benchmark...
Model info: llama 1B Q4_0
Heat up time: 0.067696875 seconds, please wait...

model size params backend test t/s
llama 1B Q4_0 0.59 GiB 1.10 B Metal pp 512 1645.79 ± 235.50
llama 1B Q4_0 0.59 GiB 1.10 B Metal tg 128 147.60 ± 0.90

Loading model...
Loaded model tinyllama-1.1b-f16.gguf

Running benchmark...
Model info: llama 1B F16
Heat up time: 0.133832209 seconds, please wait...

model size params backend test t/s
llama 1B F16 2.05 GiB 1.10 B Metal pp 512 1750.78 ± 146.33
llama 1B F16 2.05 GiB 1.10 B Metal tg 128 55.22 ± 2.67

@ggerganov
Copy link
Member Author

@mweinbach Is this an M5 iPad? If it is a MacBook, you can simply run llama-bench.

@mweinbach
Copy link

Was a Mac, will rerun with llama bench.

@mweinbach
Copy link

Here it is with gpt-oss 20b, just to see. About 2x speedup!

model size params backend threads test t/s
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B Metal,BLAS 4 pp512 846.69 ± 22.15
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B Metal,BLAS 4 tg128 42.63 ± 0.69

build: 9fce244 (6817)

model size params backend threads test t/s
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B Metal,BLAS 4 pp512 415.45 ± 30.55
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B Metal,BLAS 4 tg128 32.53 ± 6.07

build: 5cca254 (6835)

@Anemll
Copy link

Anemll commented Oct 26, 2025

model size params backend threads test t/s
qwen3 0.6B F16 1.40 GiB 751.63 M Metal,BLAS 4 pp512 4936.38 ± 15.63
qwen3 0.6B F16 1.40 GiB 751.63 M Metal,BLAS 4 tg128 97.98 ± 1.32
build: 9fce244 (6817)
model size params backend threads test t/s
qwen3 0.6B F16 1.40 GiB 751.63 M Metal,BLAS 4 pp512 3073.19 ± 11.31
qwen3 0.6B F16 1.40 GiB 751.63 M Metal,BLAS 4 tg128 94.75 ± 0.61
build: 5cca254 (6835)

@ggerganov
Copy link
Member Author

Apple marketing used a 8B llama model, so you can try this as well:

llama-bench -m llama-3.1-8b.gguf -p 512,2048 -ub 2048 -fa 1

The marketing claimed ~3.5x pp when comparing M5 + NA vs M4. My estimates are that we are not there atm - probably closer to ~2.5x. I think they used 8k tokens prompt, but I don't recommend that since my experience on MacBook Pro M4 Max is that it starts heat throttling for this task. Not sure how M5 behaves in this case.

@aniolekx
Copy link

What about M3 Ultra?

@liuliu
Copy link

liuliu commented Oct 27, 2025

Try to share some learnings in github.com/liuliu/example_matmul_metal4 in the past a few days!

  1. It seems you get the best performance with 128x64x64 tile (possibly because it does better multi-stage pipelining under the hood?);
  2. If you do 64x64x32, executionSIMDGroups to be 2 seems to marginally beat 4 (possibly smaller tile size -> less pipelining opportunities?);
  3. PSO (pipeline state object) creation takes long time (usually 1 to 2 seconds per PSO containing tensor ops), I might need to explore binary archive to help this;
  4. If K is large (>= 4096), split K and additional reduction kernel invocation helps to reach the top performance numbers;
  5. Unroll K seems to help, but that also contribute to long PSO creation time;

Of course, our implementations are vastly different. I am trying to write a new matmul kernel while yours using the existing kernel and leverage the existing dequant to threadgroup memory logic etc. So take whatever I said with a pinch of doubt. All these experiments conducted under iOS / iPad OS 26.0.1 with A19 Pro and M5 (iPad).

Also, the tensor ops API is slower than properly implemented GEMM on older devices (MFAv2 GEMM kernels), but that possibly due to wrong tile size selections I do (as I am focusing on neural accelerators performance).

@mweinbach
Copy link

The 4x speed up was with Qwen3 8B on this branch of MLX ml-explore/mlx#2687

Not sure what context but at around 20K tokens I saw 3.65x speed up with neural accelerators vs without, and made sure the dtype for the model was fp16, not bf16. Model weights were int4.

https://creativestrategies.com/research/m5-apple-silicon-its-all-about-the-cache-and-tensors/

@ggerganov
Copy link
Member Author

ggerganov commented Oct 27, 2025

@liuliu Thank you for the insights. AFAICT you don't load the inputs into shared memory and instead use the tensors directly with device memory. Do you think that going through threadgroup memory is redundant when using the Neural Accelerators (i.e. somehow the implementation is clever enough to do it internally for us)?

Did you encounter the issue with the missing bfloat support in the MPP headers (#16634 (comment))? Any ideas how to check for this at runtime? Currently, I have disabled BF16 support when the tensor API is detected (i.e. Metal4 family is supported by the device), which is a temporary workaround until I figure out something better.

@liuliu
Copy link

liuliu commented Oct 27, 2025

@liuliu Thank you for the insights. AFAICT you don't load the inputs into shared memory and instead use the tensors directly with device memory. Do you think that going through threadgroup memory is redundant when using the Neural Accelerators (i.e. somehow the implementation is clever enough to do it internally for us)?

I didn't test it, since 1. we don't have a performant device -> threadgroup loader for MFAv2 (we rely on async_copy_2d previously, but that was removed in macOS 26) to do comparison fairly; 2. In M3 / M4 testing for MFAv2, we found that device -> sram directly seems to be faster than device -> tg memory (with async_copy_2d) -> sram.

Did you encounter the issue with the missing bfloat support in the MPP headers (#16634 (comment))? Any ideas how to check for this at runtime? Currently, I have disabled BF16 support when the tensor API is detected (i.e. Metal4 family is supported by the device), which is a temporary workaround until I figure out something better.

Yeah, bfloat is missing for 26.0 as you pointed out, hence following your instruction, I downloaded 26.1 to check it out (WIP!). From what I understand, if you want to test it, you can just JIT a simple shader with the device.makeLibrary function that contains something like

  auto Q = tensor<device bfloat,  dextents<int32_t, 2>, tensor_inline>(Q_buf, dextents<int32_t, 2>(128, 1024));
  auto K = tensor<device bfloat,  dextents<int32_t, 2>, tensor_inline>(K_buf, dextents<int32_t, 2>(128, 1024));
  constexpr auto qk_desc = matmul2d_descriptor(64, 32, 64, false, true, false, matmul2d_descriptor::mode::multiply_accumulate);
  matmul2d<qk_desc, execution_simdgroups<1>> matmul_qk_op;
  auto mQ = Q.slice<64, 64>(0, 0);
  auto cS = matmul_qk_op.get_destination_cooperative_tensor<decltype(mQ), decltype(mK), float>();
  matmul_qk_op.run(mQ, mK, cS);

(I am pretty sure the code is wrong, but you get the idea). Since it is static assert failure, you can check the failure of compilation for that.

Do you know if 26.1 compiled shader works with bfloat on 26.0 or not? That might give indication to see if I can backport bfloat support to 26.0.

@Anemll
Copy link

Anemll commented Oct 27, 2025

 -p 512,2048 -ub 2048 -fa 1

@ggerganov

model size params backend threads n_ubatch fa test t/s
llama 8B Q4_0 5.61 GiB 8.03 B Metal,BLAS 4 2048 1 pp512 609.26 ± 5.21
llama 8B Q4_0 5.61 GiB 8.03 B Metal,BLAS 4 2048 1 pp2048 540.15 ± 14.68
llama 8B Q4_0 5.61 GiB 8.03 B Metal,BLAS 4 2048 1 tg128 22.88 ± 0.55

build: 9fce244 (6817)

model size params backend threads n_ubatch fa test t/s
llama 8B Q4_0 5.61 GiB 8.03 B Metal,BLAS 4 2048 1 pp512 257.25 ± 0.15
llama 8B Q4_0 5.61 GiB 8.03 B Metal,BLAS 4 2048 1 pp2048 247.79 ± 2.58
llama 8B Q4_0 5.61 GiB 8.03 B Metal,BLAS 4 2048 1 tg128 22.56 ± 0.75

build: 5cca254 (6835)


MBP M5 32GB 
( I don't have base M4 to compare outside M4 iPad)
Model: https://huggingface.co/ggml-org/Meta-Llama-3.1-8B-Instruct-Q4_0-GGUF/tree/main

llama-bench -m meta-llama-3.1-8b-instruct-q4_0.gguf -p 512,2048 -ub 2048 -fa 1

@ggerganov ggerganov force-pushed the gg/metal-mul-mm-rework branch from 9fce244 to f2927f4 Compare October 28, 2025 11:41
@ggerganov
Copy link
Member Author

Since it is static assert failure, you can check the failure of compilation for that.

Thanks, that worked f2927f4.

Do you know if 26.1 compiled shader works with bfloat on 26.0 or not? That might give indication to see if I can backport bfloat support to 26.0.

I can test this later today - will update my second Mac from Sequoia to Tahoe 26.0 and will build a bfloat shader on 26.1 to test it.

@Anemll The pp jump from M3 10c -> M4 10c was ~20% (#4167). So if we assume a similar jump to M5 10c, we can estimate the expected performance for M4 10c on this test.

@Anemll
Copy link

Anemll commented Oct 30, 2025

Posted M5 results for LLAMA2 7B here, so it's apples-to-apples comparison
#4167 (comment)

@ggerganov
Copy link
Member Author

@Anemll Thanks. I think the only thing left for this PR is to confirm that using the Tensor API maintains the performance for old generations (M4 and earlier). I did some early tests on M4 Max and it looked like it did, but need to take a closer look.

Btw, could you run one more test for me - I want to see how much the M5 throttles. For example, here is the result on M4 Max:

llama-bench -m model.gguf -p 512,512,512,512,512,512,512,512,512,512,512,512,512,512,512,512,512,512,512,512,512,512,512,512,512,512,512,512,512,512,512

ggml_metal_library_init: using embedded metal library
ggml_metal_library_init: loaded in 4.538 sec
ggml_metal_device_init: GPU name: Apple M4 Max
ggml_metal_device_init: GPU family: MTLGPUFamilyApple9 (1009)
ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_device_init: GPU family: MTLGPUFamilyMetal4 (5002)
ggml_metal_device_init: simdgroup reduction = true
ggml_metal_device_init: simdgroup matrix mul. = true
ggml_metal_device_init: has unified memory = true
ggml_metal_device_init: has bfloat = true
ggml_metal_device_init: use residency sets = true
ggml_metal_device_init: use shared buffers = true
ggml_metal_device_init: recommendedMaxWorkingSetSize = 30150.67 MB

model size params backend threads test t/s
qwen3 4B Q8_0 3.98 GiB 4.02 B Metal,BLAS 10 pp512 1335.10 ± 0.23
qwen3 4B Q8_0 3.98 GiB 4.02 B Metal,BLAS 10 pp512 1335.47 ± 1.31
qwen3 4B Q8_0 3.98 GiB 4.02 B Metal,BLAS 10 pp512 1332.93 ± 2.68
qwen3 4B Q8_0 3.98 GiB 4.02 B Metal,BLAS 10 pp512 1288.48 ± 21.24
qwen3 4B Q8_0 3.98 GiB 4.02 B Metal,BLAS 10 pp512 1258.64 ± 13.80
qwen3 4B Q8_0 3.98 GiB 4.02 B Metal,BLAS 10 pp512 1214.77 ± 22.13
qwen3 4B Q8_0 3.98 GiB 4.02 B Metal,BLAS 10 pp512 1211.14 ± 4.74
qwen3 4B Q8_0 3.98 GiB 4.02 B Metal,BLAS 10 pp512 1184.23 ± 16.44
qwen3 4B Q8_0 3.98 GiB 4.02 B Metal,BLAS 10 pp512 1187.92 ± 11.00
qwen3 4B Q8_0 3.98 GiB 4.02 B Metal,BLAS 10 pp512 1166.89 ± 17.64
qwen3 4B Q8_0 3.98 GiB 4.02 B Metal,BLAS 10 pp512 1179.78 ± 11.05
qwen3 4B Q8_0 3.98 GiB 4.02 B Metal,BLAS 10 pp512 1163.24 ± 21.32
qwen3 4B Q8_0 3.98 GiB 4.02 B Metal,BLAS 10 pp512 1153.51 ± 19.06
qwen3 4B Q8_0 3.98 GiB 4.02 B Metal,BLAS 10 pp512 1156.93 ± 19.18
qwen3 4B Q8_0 3.98 GiB 4.02 B Metal,BLAS 10 pp512 1165.60 ± 9.81
qwen3 4B Q8_0 3.98 GiB 4.02 B Metal,BLAS 10 pp512 1153.30 ± 32.84
qwen3 4B Q8_0 3.98 GiB 4.02 B Metal,BLAS 10 pp512 1160.31 ± 14.83
qwen3 4B Q8_0 3.98 GiB 4.02 B Metal,BLAS 10 pp512 1147.51 ± 19.14
qwen3 4B Q8_0 3.98 GiB 4.02 B Metal,BLAS 10 pp512 1152.52 ± 6.07
qwen3 4B Q8_0 3.98 GiB 4.02 B Metal,BLAS 10 pp512 1149.42 ± 14.83
qwen3 4B Q8_0 3.98 GiB 4.02 B Metal,BLAS 10 pp512 1155.22 ± 9.12
...

We can see that we immediately lose almost 20% of the performance and later as the test continues this increases up to 30%. I already received a report on my email that M5 does not throttle like this - just want to confirm from a second source.

@liuliu I'm a bit late on the backport test - probably will do this over the weekend. Also I am not really sure if my plan would actually work because I suspect that the Tensor API would not be compatible with M2 Ultra (it reports MTLGPUFamilyMetal3) even if I upgrade it to Tahoe 26.0. But I have to do the upgrade sooner or later either way, so I'll give it a try.

@Anemll
Copy link

Anemll commented Oct 30, 2025

@Anemll Thanks. I think the only thing left for this PR is to confirm that using the Tensor API maintains the performance for old generations (M4 and earlier). I did some early tests on M4 Max and it looked like it did, but need to take a closer look.

Btw, could you run one more test for me - I want to see how much the M5 throttles. For example, here is the result on M4 Max:

llama-bench -m model.gguf -p 512,512,512,512,512,512,512,512,512,512,512,512,512,512,512,512,512,512,512,512,512,512,512,512,512,512,512,512,512,512,512

It starts fan after a few iterations, with GPU temp rising to 99C, then quickly drops to 93C and remains stable.
There is a slight drop before the fan starts, but overall it's stable

./build/bin/llama-bench
-m /Users/anemll/Models/GUFF/llama-7b-v2/ggml-model-Q4_0.gguf -p 512,512,512,512,512,512,512,512,512,512,512,512,512,512,512,512,512,512,512,512,512,512,512,512,512,512,512,512,512,512,512
ggml_metal_device_init: disabling bfloat support as a workaround for tensor API incompatibility
ggml_metal_library_init: using embedded metal library
ggml_metal_library_init: loaded in 0.004 sec
ggml_metal_device_init: GPU name: Apple M5
ggml_metal_device_init: GPU family: MTLGPUFamilyApple10 (1010)
ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_device_init: GPU family: MTLGPUFamilyMetal4 (5002)
ggml_metal_device_init: simdgroup reduction = true
ggml_metal_device_init: simdgroup matrix mul. = true
ggml_metal_device_init: has unified memory = true
ggml_metal_device_init: has bfloat = false
ggml_metal_device_init: has tensor = true
ggml_metal_device_init: use residency sets = true
ggml_metal_device_init: use shared buffers = true
ggml_metal_device_init: recommendedMaxWorkingSetSize = 26800.60 MB

model size params backend threads test t/s
llama 7B Q4_0 3.56 GiB 6.74 B Metal,BLAS 4 pp512 634.61 ± 5.60
llama 7B Q4_0 3.56 GiB 6.74 B Metal,BLAS 4 pp512 637.85 ± 1.77
llama 7B Q4_0 3.56 GiB 6.74 B Metal,BLAS 4 pp512 631.97 ± 8.60
llama 7B Q4_0 3.56 GiB 6.74 B Metal,BLAS 4 pp512 633.35 ± 7.86
llama 7B Q4_0 3.56 GiB 6.74 B Metal,BLAS 4 pp512 635.77 ± 2.81
llama 7B Q4_0 3.56 GiB 6.74 B Metal,BLAS 4 pp512 624.00 ± 8.48
llama 7B Q4_0 3.56 GiB 6.74 B Metal,BLAS 4 pp512 616.50 ± 2.07
llama 7B Q4_0 3.56 GiB 6.74 B Metal,BLAS 4 pp512 615.18 ± 0.26
llama 7B Q4_0 3.56 GiB 6.74 B Metal,BLAS 4 pp512 612.73 ± 1.72
llama 7B Q4_0 3.56 GiB 6.74 B Metal,BLAS 4 pp512 610.39 ± 5.86
llama 7B Q4_0 3.56 GiB 6.74 B Metal,BLAS 4 pp512 613.60 ± 0.95
llama 7B Q4_0 3.56 GiB 6.74 B Metal,BLAS 4 pp512 614.39 ± 0.82
llama 7B Q4_0 3.56 GiB 6.74 B Metal,BLAS 4 pp512 614.35 ± 0.71
llama 7B Q4_0 3.56 GiB 6.74 B Metal,BLAS 4 pp512 615.77 ± 0.52
llama 7B Q4_0 3.56 GiB 6.74 B Metal,BLAS 4 pp512 615.67 ± 1.03
llama 7B Q4_0 3.56 GiB 6.74 B Metal,BLAS 4 pp512 611.37 ± 8.98
llama 7B Q4_0 3.56 GiB 6.74 B Metal,BLAS 4 pp512 616.97 ± 1.89
llama 7B Q4_0 3.56 GiB 6.74 B Metal,BLAS 4 pp512 619.50 ± 0.94
llama 7B Q4_0 3.56 GiB 6.74 B Metal,BLAS 4 pp512 614.63 ± 11.30
llama 7B Q4_0 3.56 GiB 6.74 B Metal,BLAS 4 pp512 620.03 ± 4.28
llama 7B Q4_0 3.56 GiB 6.74 B Metal,BLAS 4 pp512 625.62 ± 1.41
llama 7B Q4_0 3.56 GiB 6.74 B Metal,BLAS 4 pp512 627.08 ± 0.66
llama 7B Q4_0 3.56 GiB 6.74 B Metal,BLAS 4 pp512 623.71 ± 8.20
llama 7B Q4_0 3.56 GiB 6.74 B Metal,BLAS 4 pp512 627.13 ± 0.75
llama 7B Q4_0 3.56 GiB 6.74 B Metal,BLAS 4 pp512 628.20 ± 0.47
llama 7B Q4_0 3.56 GiB 6.74 B Metal,BLAS 4 pp512 628.17 ± 0.13
llama 7B Q4_0 3.56 GiB 6.74 B Metal,BLAS 4 pp512 627.70 ± 0.30
llama 7B Q4_0 3.56 GiB 6.74 B Metal,BLAS 4 pp512 628.26 ± 0.29
llama 7B Q4_0 3.56 GiB 6.74 B Metal,BLAS 4 pp512 627.96 ± 0.52
llama 7B Q4_0 3.56 GiB 6.74 B Metal,BLAS 4 pp512 627.18 ± 0.55
llama 7B Q4_0 3.56 GiB 6.74 B Metal,BLAS 4 pp512 626.51 ± 0.92
llama 7B Q4_0 3.56 GiB 6.74 B Metal,BLAS 4 tg128 30.95 ± 0.28

build: 9fce244 (6817)

anemll@M5 llama.cpp % ./build/bin/llama-bench
-m /Users/anemll/Models/GUFF/Qwen3-0.6B-f16.gguf -p 512,512,512,512,512,512,512,512,512,512,512,512,512,512,512,512,512,512,512,512,512,512,512,512,512,512,512,512,512,512,512,512,512,512,512,512,512,512,512,512,512,512,512,512,512,512,512,512,512,512,512,512,512,512,512,512,512,512,512,512,512,512,512,512,512,512,512,512,512,512,512,512,512,512,512,512,512,512,512,512,512,512,512,512,512,512,512,512,512,512,512,512,512
ggml_metal_device_init: disabling bfloat support as a workaround for tensor API incompatibility
ggml_metal_library_init: using embedded metal library
ggml_metal_library_init: loaded in 0.004 sec
ggml_metal_device_init: GPU name: Apple M5
ggml_metal_device_init: GPU family: MTLGPUFamilyApple10 (1010)
ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_device_init: GPU family: MTLGPUFamilyMetal4 (5002)
ggml_metal_device_init: simdgroup reduction = true
ggml_metal_device_init: simdgroup matrix mul. = true
ggml_metal_device_init: has unified memory = true
ggml_metal_device_init: has bfloat = false
ggml_metal_device_init: has tensor = true
ggml_metal_device_init: use residency sets = true
ggml_metal_device_init: use shared buffers = true
ggml_metal_device_init: recommendedMaxWorkingSetSize = 26800.60 MB

model size params backend threads test t/s
qwen3 0.6B F16 1.40 GiB 751.63 M Metal,BLAS 4 pp512 4969.77 ± 16.43
qwen3 0.6B F16 1.40 GiB 751.63 M Metal,BLAS 4 pp512 4907.53 ± 19.46
qwen3 0.6B F16 1.40 GiB 751.63 M Metal,BLAS 4 pp512 4915.25 ± 38.97
qwen3 0.6B F16 1.40 GiB 751.63 M Metal,BLAS 4 pp512 4763.46 ± 132.12
qwen3 0.6B F16 1.40 GiB 751.63 M Metal,BLAS 4 pp512 4854.48 ± 56.51
qwen3 0.6B F16 1.40 GiB 751.63 M Metal,BLAS 4 pp512 4939.65 ± 28.34
qwen3 0.6B F16 1.40 GiB 751.63 M Metal,BLAS 4 pp512 4943.06 ± 15.57
qwen3 0.6B F16 1.40 GiB 751.63 M Metal,BLAS 4 pp512 4932.36 ± 46.98
qwen3 0.6B F16 1.40 GiB 751.63 M Metal,BLAS 4 pp512 4939.72 ± 6.22
qwen3 0.6B F16 1.40 GiB 751.63 M Metal,BLAS 4 pp512 4905.53 ± 16.24
qwen3 0.6B F16 1.40 GiB 751.63 M Metal,BLAS 4 pp512 4899.29 ± 29.58
qwen3 0.6B F16 1.40 GiB 751.63 M Metal,BLAS 4 pp512 4932.59 ± 13.90
qwen3 0.6B F16 1.40 GiB 751.63 M Metal,BLAS 4 pp512 4864.04 ± 56.45
qwen3 0.6B F16 1.40 GiB 751.63 M Metal,BLAS 4 pp512 4914.91 ± 29.42
qwen3 0.6B F16 1.40 GiB 751.63 M Metal,BLAS 4 pp512 4778.87 ± 198.36
qwen3 0.6B F16 1.40 GiB 751.63 M Metal,BLAS 4 pp512 4911.75 ± 29.40
qwen3 0.6B F16 1.40 GiB 751.63 M Metal,BLAS 4 pp512 4941.25 ± 15.29
qwen3 0.6B F16 1.40 GiB 751.63 M Metal,BLAS 4 pp512 4923.28 ± 9.91
qwen3 0.6B F16 1.40 GiB 751.63 M Metal,BLAS 4 pp512 4951.99 ± 16.98
qwen3 0.6B F16 1.40 GiB 751.63 M Metal,BLAS 4 pp512 4936.14 ± 13.39
qwen3 0.6B F16 1.40 GiB 751.63 M Metal,BLAS 4 pp512 4915.69 ± 17.02
qwen3 0.6B F16 1.40 GiB 751.63 M Metal,BLAS 4 pp512 4936.57 ± 6.61
qwen3 0.6B F16 1.40 GiB 751.63 M Metal,BLAS 4 pp512 4923.80 ± 25.46
qwen3 0.6B F16 1.40 GiB 751.63 M Metal,BLAS 4 pp512 4923.19 ± 27.27
qwen3 0.6B F16 1.40 GiB 751.63 M Metal,BLAS 4 pp512 4936.56 ± 10.71
qwen3 0.6B F16 1.40 GiB 751.63 M Metal,BLAS 4 pp512 4899.82 ± 22.19
qwen3 0.6B F16 1.40 GiB 751.63 M Metal,BLAS 4 pp512 4916.24 ± 8.44
qwen3 0.6B F16 1.40 GiB 751.63 M Metal,BLAS 4 pp512 4930.55 ± 20.63
qwen3 0.6B F16 1.40 GiB 751.63 M Metal,BLAS 4 pp512 4908.07 ± 21.09
qwen3 0.6B F16 1.40 GiB 751.63 M Metal,BLAS 4 pp512 4918.28 ± 7.69
qwen3 0.6B F16 1.40 GiB 751.63 M Metal,BLAS 4 pp512 4907.66 ± 15.22
qwen3 0.6B F16 1.40 GiB 751.63 M Metal,BLAS 4 pp512 4920.92 ± 37.02
qwen3 0.6B F16 1.40 GiB 751.63 M Metal,BLAS 4 pp512 4916.82 ± 12.09
qwen3 0.6B F16 1.40 GiB 751.63 M Metal,BLAS 4 pp512 4906.58 ± 25.31
qwen3 0.6B F16 1.40 GiB 751.63 M Metal,BLAS 4 pp512 4919.94 ± 18.00
qwen3 0.6B F16 1.40 GiB 751.63 M Metal,BLAS 4 pp512 4912.20 ± 30.31
qwen3 0.6B F16 1.40 GiB 751.63 M Metal,BLAS 4 pp512 4904.09 ± 30.26
qwen3 0.6B F16 1.40 GiB 751.63 M Metal,BLAS 4 pp512 4934.28 ± 11.05
qwen3 0.6B F16 1.40 GiB 751.63 M Metal,BLAS 4 pp512 4917.58 ± 35.15
qwen3 0.6B F16 1.40 GiB 751.63 M Metal,BLAS 4 pp512 4913.54 ± 25.63
qwen3 0.6B F16 1.40 GiB 751.63 M Metal,BLAS 4 pp512 4897.12 ± 14.97
qwen3 0.6B F16 1.40 GiB 751.63 M Metal,BLAS 4 pp512 4879.32 ± 27.22
qwen3 0.6B F16 1.40 GiB 751.63 M Metal,BLAS 4 pp512 4920.41 ± 20.18
qwen3 0.6B F16 1.40 GiB 751.63 M Metal,BLAS 4 pp512 4923.09 ± 34.84
qwen3 0.6B F16 1.40 GiB 751.63 M Metal,BLAS 4 pp512 4910.28 ± 17.27
qwen3 0.6B F16 1.40 GiB 751.63 M Metal,BLAS 4 pp512 4928.76 ± 15.77
qwen3 0.6B F16 1.40 GiB 751.63 M Metal,BLAS 4 pp512 4921.05 ± 13.94
qwen3 0.6B F16 1.40 GiB 751.63 M Metal,BLAS 4 pp512 4938.79 ± 6.04
qwen3 0.6B F16 1.40 GiB 751.63 M Metal,BLAS 4 pp512 4935.23 ± 18.38
qwen3 0.6B F16 1.40 GiB 751.63 M Metal,BLAS 4 pp512 4926.61 ± 29.28
qwen3 0.6B F16 1.40 GiB 751.63 M Metal,BLAS 4 pp512 4924.65 ± 10.25
qwen3 0.6B F16 1.40 GiB 751.63 M Metal,BLAS 4 pp512 4903.62 ± 18.22
qwen3 0.6B F16 1.40 GiB 751.63 M Metal,BLAS 4 pp512 4900.12 ± 30.90
qwen3 0.6B F16 1.40 GiB 751.63 M Metal,BLAS 4 pp512 4917.40 ± 13.48
qwen3 0.6B F16 1.40 GiB 751.63 M Metal,BLAS 4 pp512 4905.65 ± 18.07
qwen3 0.6B F16 1.40 GiB 751.63 M Metal,BLAS 4 pp512 4924.18 ± 10.62
qwen3 0.6B F16 1.40 GiB 751.63 M Metal,BLAS 4 pp512 4907.99 ± 20.11
qwen3 0.6B F16 1.40 GiB 751.63 M Metal,BLAS 4 pp512 4895.83 ± 22.88
qwen3 0.6B F16 1.40 GiB 751.63 M Metal,BLAS 4 pp512 4905.75 ± 11.89
qwen3 0.6B F16 1.40 GiB 751.63 M Metal,BLAS 4 pp512 4894.41 ± 40.62
qwen3 0.6B F16 1.40 GiB 751.63 M Metal,BLAS 4 pp512 4906.40 ± 12.91
qwen3 0.6B F16 1.40 GiB 751.63 M Metal,BLAS 4 pp512 4923.66 ± 8.75
qwen3 0.6B F16 1.40 GiB 751.63 M Metal,BLAS 4 pp512 4900.90 ± 29.51
qwen3 0.6B F16 1.40 GiB 751.63 M Metal,BLAS 4 pp512 4922.25 ± 14.56
qwen3 0.6B F16 1.40 GiB 751.63 M Metal,BLAS 4 pp512 4920.22 ± 18.51
qwen3 0.6B F16 1.40 GiB 751.63 M Metal,BLAS 4 pp512 4930.56 ± 22.81
qwen3 0.6B F16 1.40 GiB 751.63 M Metal,BLAS 4 pp512 4919.18 ± 15.50
qwen3 0.6B F16 1.40 GiB 751.63 M Metal,BLAS 4 pp512 4906.05 ± 25.11
qwen3 0.6B F16 1.40 GiB 751.63 M Metal,BLAS 4 pp512 4921.06 ± 16.45
qwen3 0.6B F16 1.40 GiB 751.63 M Metal,BLAS 4 pp512 4905.53 ± 19.63
qwen3 0.6B F16 1.40 GiB 751.63 M Metal,BLAS 4 pp512 4905.74 ± 24.46
qwen3 0.6B F16 1.40 GiB 751.63 M Metal,BLAS 4 pp512 4921.10 ± 9.02
qwen3 0.6B F16 1.40 GiB 751.63 M Metal,BLAS 4 pp512 4886.07 ± 40.93
qwen3 0.6B F16 1.40 GiB 751.63 M Metal,BLAS 4 pp512 4928.91 ± 28.78
qwen3 0.6B F16 1.40 GiB 751.63 M Metal,BLAS 4 pp512 4917.64 ± 15.14
qwen3 0.6B F16 1.40 GiB 751.63 M Metal,BLAS 4 pp512 4919.63 ± 15.49
qwen3 0.6B F16 1.40 GiB 751.63 M Metal,BLAS 4 pp512 4928.85 ± 13.95
qwen3 0.6B F16 1.40 GiB 751.63 M Metal,BLAS 4 pp512 4928.72 ± 29.07
qwen3 0.6B F16 1.40 GiB 751.63 M Metal,BLAS 4 pp512 4901.43 ± 28.54
qwen3 0.6B F16 1.40 GiB 751.63 M Metal,BLAS 4 pp512 4908.88 ± 6.45
qwen3 0.6B F16 1.40 GiB 751.63 M Metal,BLAS 4 pp512 4894.00 ± 31.53
qwen3 0.6B F16 1.40 GiB 751.63 M Metal,BLAS 4 pp512 4928.00 ± 11.55
qwen3 0.6B F16 1.40 GiB 751.63 M Metal,BLAS 4 pp512 4917.33 ± 26.53
qwen3 0.6B F16 1.40 GiB 751.63 M Metal,BLAS 4 pp512 4924.06 ± 15.38
qwen3 0.6B F16 1.40 GiB 751.63 M Metal,BLAS 4 pp512 4904.72 ± 11.29
qwen3 0.6B F16 1.40 GiB 751.63 M Metal,BLAS 4 pp512 4927.65 ± 12.81
qwen3 0.6B F16 1.40 GiB 751.63 M Metal,BLAS 4 pp512 4919.50 ± 14.98
qwen3 0.6B F16 1.40 GiB 751.63 M Metal,BLAS 4 pp512 4930.61 ± 9.72
qwen3 0.6B F16 1.40 GiB 751.63 M Metal,BLAS 4 pp512 4921.46 ± 30.65
qwen3 0.6B F16 1.40 GiB 751.63 M Metal,BLAS 4 pp512 4900.73 ± 27.83
qwen3 0.6B F16 1.40 GiB 751.63 M Metal,BLAS 4 pp512 4913.04 ± 7.92
qwen3 0.6B F16 1.40 GiB 751.63 M Metal,BLAS 4 pp512 4916.14 ± 22.81
qwen3 0.6B F16 1.40 GiB 751.63 M Metal,BLAS 4 pp512 4914.78 ± 19.20
qwen3 0.6B F16 1.40 GiB 751.63 M Metal,BLAS 4 tg128 97.27 ± 1.17

build: 9fce244 (6817)
anemll@M5 llama.cpp %
anemll@M5 llama.cpp % ./build/bin/llama-bench
-m /Users/anemll/Models/GUFF/llama-7b-v2/ggml-model-f16.gguf -p 512,512,512,512,512,512,512,512,512,512,512,512,512,512,512,512,512,512,512,512,512,512,512,512,512,512,512,512,512,512,512
ggml_metal_device_init: disabling bfloat support as a workaround for tensor API incompatibility
ggml_metal_library_init: using embedded metal library
ggml_metal_library_init: loaded in 0.004 sec
ggml_metal_device_init: GPU name: Apple M5
ggml_metal_device_init: GPU family: MTLGPUFamilyApple10 (1010)
ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_device_init: GPU family: MTLGPUFamilyMetal4 (5002)
ggml_metal_device_init: simdgroup reduction = true
ggml_metal_device_init: simdgroup matrix mul. = true
ggml_metal_device_init: has unified memory = true
ggml_metal_device_init: has bfloat = false
ggml_metal_device_init: has tensor = true
ggml_metal_device_init: use residency sets = true
ggml_metal_device_init: use shared buffers = true
ggml_metal_device_init: recommendedMaxWorkingSetSize = 26800.60 MB

model size params backend threads test t/s
llama 7B F16 12.55 GiB 6.74 B Metal,BLAS 4 pp512 371.88 ± 7.30
llama 7B F16 12.55 GiB 6.74 B Metal,BLAS 4 pp512 374.71 ± 1.41
llama 7B F16 12.55 GiB 6.74 B Metal,BLAS 4 pp512 373.59 ± 1.23
llama 7B F16 12.55 GiB 6.74 B Metal,BLAS 4 pp512 373.37 ± 4.70
llama 7B F16 12.55 GiB 6.74 B Metal,BLAS 4 pp512 373.06 ± 1.69
llama 7B F16 12.55 GiB 6.74 B Metal,BLAS 4 pp512 358.16 ± 11.87
llama 7B F16 12.55 GiB 6.74 B Metal,BLAS 4 pp512 361.76 ± 5.85
llama 7B F16 12.55 GiB 6.74 B Metal,BLAS 4 pp512 353.27 ± 3.17
llama 7B F16 12.55 GiB 6.74 B Metal,BLAS 4 pp512 355.57 ± 10.46
llama 7B F16 12.55 GiB 6.74 B Metal,BLAS 4 pp512 364.70 ± 5.68
llama 7B F16 12.55 GiB 6.74 B Metal,BLAS 4 pp512 368.98 ± 0.72
llama 7B F16 12.55 GiB 6.74 B Metal,BLAS 4 pp512 369.63 ± 0.63
llama 7B F16 12.55 GiB 6.74 B Metal,BLAS 4 pp512 366.67 ± 4.47
llama 7B F16 12.55 GiB 6.74 B Metal,BLAS 4 pp512 371.78 ± 0.74
llama 7B F16 12.55 GiB 6.74 B Metal,BLAS 4 pp512 373.84 ± 0.64
llama 7B F16 12.55 GiB 6.74 B Metal,BLAS 4 pp512 366.97 ± 4.86
llama 7B F16 12.55 GiB 6.74 B Metal,BLAS 4 pp512 370.62 ± 2.36
llama 7B F16 12.55 GiB 6.74 B Metal,BLAS 4 pp512 360.47 ± 7.15
llama 7B F16 12.55 GiB 6.74 B Metal,BLAS 4 pp512 366.54 ± 5.77
llama 7B F16 12.55 GiB 6.74 B Metal,BLAS 4 pp512 367.39 ± 1.64
llama 7B F16 12.55 GiB 6.74 B Metal,BLAS 4 pp512 371.02 ± 1.89
llama 7B F16 12.55 GiB 6.74 B Metal,BLAS 4 pp512 362.85 ± 9.06
llama 7B F16 12.55 GiB 6.74 B Metal,BLAS 4 pp512 370.46 ± 0.40
llama 7B F16 12.55 GiB 6.74 B Metal,BLAS 4 pp512 366.92 ± 3.28
llama 7B F16 12.55 GiB 6.74 B Metal,BLAS 4 pp512 364.71 ± 5.98
llama 7B F16 12.55 GiB 6.74 B Metal,BLAS 4 pp512 364.63 ± 3.43
llama 7B F16 12.55 GiB 6.74 B Metal,BLAS 4 pp512 365.59 ± 5.58
llama 7B F16 12.55 GiB 6.74 B Metal,BLAS 4 pp512 367.44 ± 5.44
llama 7B F16 12.55 GiB 6.74 B Metal,BLAS 4 pp512 371.75 ± 0.57
llama 7B F16 12.55 GiB 6.74 B Metal,BLAS 4 pp512 371.30 ± 0.23
llama 7B F16 12.55 GiB 6.74 B Metal,BLAS 4 pp512 365.91 ± 6.40
llama 7B F16 12.55 GiB 6.74 B Metal,BLAS 4 tg128 9.69 ± 0.06

@ggerganov ggerganov force-pushed the gg/metal-mul-mm-rework branch from f2927f4 to 738b916 Compare November 4, 2025 17:17
@woachk
Copy link
Contributor

woachk commented Nov 5, 2025

26.1 is a release version now, so maybe the best thing to do is to not use the new APIs if on 26.0 entirely...

@ggerganov
Copy link
Member Author

ggerganov commented Nov 5, 2025

Looking to confirm that the new tensor API does not degrade the performance on pre-M5 chips.

If you are on Tahoe 26.1, please run this command and post the results:

CMAKE_OPTS="-DGGML_BLAS=OFF" ./scripts/compare-commits.sh 9af8394e5 master test-backend-ops -o MUL_MAT -p "type_a=(bf16|f16|q8_0|q4_0|mxfp4)"
M4 Max

ggml_metal_device_init: testing tensor API for bfloat support
ggml_metal_library_init: using embedded metal library
ggml_metal_library_init: loaded in 0.006 sec
ggml_metal_device_init: GPU name: Apple M4 Max
ggml_metal_device_init: GPU family: MTLGPUFamilyApple9 (1009)
ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_device_init: GPU family: MTLGPUFamilyMetal4 (5002)
ggml_metal_device_init: simdgroup reduction = true
ggml_metal_device_init: simdgroup matrix mul. = true
ggml_metal_device_init: has unified memory = true
ggml_metal_device_init: has bfloat = true
ggml_metal_device_init: has tensor = true
ggml_metal_device_init: use residency sets = true
ggml_metal_device_init: use shared buffers = true
ggml_metal_device_init: recommendedMaxWorkingSetSize = 30150.67 MB
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M4 Max
ggml_metal_init: picking default device: Apple M4 Max
ggml_metal_init: use fusion = true
ggml_metal_init: use concurrency = true
ggml_metal_init: use graph optimize = true

Backend GGML op Op parameters TFLOPS 9af8394 TFLOPS master Speedup
Metal MUL_MAT type_a=bf16,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1 0.38 0.38 1.00
Metal MUL_MAT type_a=bf16,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1 0.77 0.77 1.00
Metal MUL_MAT type_a=bf16,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1 1.05 0.86 0.82
Metal MUL_MAT type_a=bf16,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1 1.14 0.95 0.83
Metal MUL_MAT type_a=bf16,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1 0.99 0.84 0.85
Metal MUL_MAT type_a=bf16,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1 8.87 9.11 1.03
Metal MUL_MAT type_a=bf16,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1 0.95 0.89 0.93
Metal MUL_MAT type_a=f16,type_b=f32,m=128,n=1,k=16416,bs=[8,1],nr=[4,1],per=[0,1,2,3],v=1,o=1 0.96 0.92 0.96
Metal MUL_MAT type_a=f16,type_b=f32,m=16416,n=1,k=128,bs=[8,1],nr=[4,1],per=[0,2,1,3],v=0,o=1 0.31 0.31 1.00
Metal MUL_MAT type_a=f16,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1 0.37 0.38 1.01
Metal MUL_MAT type_a=f16,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1 0.73 0.73 1.00
Metal MUL_MAT type_a=f16,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1 1.08 1.06 0.98
Metal MUL_MAT type_a=f16,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1 1.46 1.42 0.97
Metal MUL_MAT type_a=f16,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1 1.30 1.25 0.96
Metal MUL_MAT type_a=f16,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1 8.95 8.95 1.00
Metal MUL_MAT type_a=f16,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1 1.34 1.28 0.96
Metal MUL_MAT type_a=mxfp4,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1 1.25 1.21 0.97
Metal MUL_MAT type_a=mxfp4,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1 1.34 1.25 0.93
Metal MUL_MAT type_a=mxfp4,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1 1.74 1.61 0.92
Metal MUL_MAT type_a=mxfp4,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1 1.79 1.64 0.92
Metal MUL_MAT type_a=mxfp4,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1 1.70 1.60 0.94
Metal MUL_MAT type_a=mxfp4,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1 8.83 8.91 1.01
Metal MUL_MAT type_a=mxfp4,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1 1.70 1.64 0.97
Metal MUL_MAT type_a=q4_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1 1.28 1.28 1.00
Metal MUL_MAT type_a=q4_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1 1.57 1.47 0.93
Metal MUL_MAT type_a=q4_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1 1.55 1.50 0.97
Metal MUL_MAT type_a=q4_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1 1.37 1.47 1.07
Metal MUL_MAT type_a=q4_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1 1.32 1.28 0.97
Metal MUL_MAT type_a=q4_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1 8.76 8.58 0.98
Metal MUL_MAT type_a=q4_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1 1.48 1.44 0.97
Metal MUL_MAT type_a=q8_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1 0.70 0.69 0.99
Metal MUL_MAT type_a=q8_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1 1.34 1.32 0.99
Metal MUL_MAT type_a=q8_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1 1.65 1.54 0.93
Metal MUL_MAT type_a=q8_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1 1.51 1.47 0.97
Metal MUL_MAT type_a=q8_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1 1.43 1.36 0.95
Metal MUL_MAT type_a=q8_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1 8.31 8.46 1.02
Metal MUL_MAT type_a=q8_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1 1.48 1.44 0.97
M4

ggml_metal_library_init: using embedded metal library
ggml_metal_library_init: loaded in 5.415 sec
ggml_metal_device_init: GPU name: Apple M4
ggml_metal_device_init: GPU family: MTLGPUFamilyApple9 (1009)
ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_device_init: GPU family: MTLGPUFamilyMetal4 (5002)
ggml_metal_device_init: simdgroup reduction = true
ggml_metal_device_init: simdgroup matrix mul. = true
ggml_metal_device_init: has unified memory = true
ggml_metal_device_init: has bfloat = true
ggml_metal_device_init: use residency sets = true
ggml_metal_device_init: use shared buffers = true
ggml_metal_device_init: recommendedMaxWorkingSetSize = 12713.12 MB
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M4
ggml_metal_init: picking default device: Apple M4
ggml_metal_init: use bfloat = true
ggml_metal_init: use fusion = true
ggml_metal_init: use concurrency = true
ggml_metal_init: use graph optimize = true

Backend GGML op Op parameters TFLOPS gg/metal-mul-mm-rework TFLOPS master Speedup
Metal MUL_MAT type_a=bf16,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1 0.11 0.08 0.80
Metal MUL_MAT type_a=bf16,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1 0.22 0.20 0.92
Metal MUL_MAT type_a=bf16,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1 0.34 0.32 0.96
Metal MUL_MAT type_a=bf16,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1 0.35 0.31 0.89
Metal MUL_MAT type_a=bf16,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1 0.34 0.33 0.98
Metal MUL_MAT type_a=bf16,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1 3.22 3.30 1.03
Metal MUL_MAT type_a=bf16,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1 0.33 0.33 0.99
Metal MUL_MAT type_a=f16,type_b=f32,m=128,n=1,k=16416,bs=[8,1],nr=[4,1],per=[0,1,2,3],v=1,o=1 0.31 0.30 0.94
Metal MUL_MAT type_a=f16,type_b=f32,m=16416,n=1,k=128,bs=[8,1],nr=[4,1],per=[0,2,1,3],v=0,o=1 0.11 0.10 0.94
Metal MUL_MAT type_a=f16,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1 0.10 0.08 0.79
Metal MUL_MAT type_a=f16,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1 0.20 0.19 0.95
Metal MUL_MAT type_a=f16,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1 0.30 0.29 0.98
Metal MUL_MAT type_a=f16,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1 0.40 0.29 0.73
Metal MUL_MAT type_a=f16,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1 0.35 0.37 1.04
Metal MUL_MAT type_a=f16,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1 3.45 3.37 0.98
Metal MUL_MAT type_a=f16,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1 0.41 0.41 1.01
Metal MUL_MAT type_a=mxfp4,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1 0.39 0.37 0.96
Metal MUL_MAT type_a=mxfp4,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1 0.46 0.45 0.99
Metal MUL_MAT type_a=mxfp4,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1 0.60 0.59 0.98
Metal MUL_MAT type_a=mxfp4,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1 0.57 0.57 0.99
Metal MUL_MAT type_a=mxfp4,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1 0.51 0.51 1.00
Metal MUL_MAT type_a=mxfp4,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1 3.25 3.10 0.95
Metal MUL_MAT type_a=mxfp4,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1 0.56 0.57 1.00
Metal MUL_MAT type_a=q4_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1 0.36 0.33 0.93
Metal MUL_MAT type_a=q4_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1 0.50 0.46 0.94
Metal MUL_MAT type_a=q4_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1 0.50 0.50 1.00
Metal MUL_MAT type_a=q4_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1 0.46 0.47 1.01
Metal MUL_MAT type_a=q4_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1 0.35 0.35 0.99
Metal MUL_MAT type_a=q4_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1 3.37 3.34 0.99
Metal MUL_MAT type_a=q4_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1 0.44 0.44 1.00
Metal MUL_MAT type_a=q8_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1 0.20 0.18 0.91
Metal MUL_MAT type_a=q8_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1 0.39 0.38 0.98
Metal MUL_MAT type_a=q8_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1 0.55 0.44 0.81
Metal MUL_MAT type_a=q8_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1 0.47 0.49 1.03
Metal MUL_MAT type_a=q8_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1 0.39 0.39 1.00
Metal MUL_MAT type_a=q8_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1 3.14 3.12 0.99
Metal MUL_MAT type_a=q8_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1 0.45 0.45 1.00

@ggerganov ggerganov force-pushed the gg/metal-mul-mm-rework branch from 738b916 to e6aa68a Compare November 5, 2025 09:31
@ggerganov ggerganov force-pushed the gg/metal-mul-mm-rework branch from 678fa6d to afebf27 Compare November 5, 2025 19:54
@ggerganov
Copy link
Member Author

This should be ready to merge.

I've disable the MPP Tensors on old chips (i.e. pre-M5) because on my M2 Ultra, the current tensor implementation is slightly slower than the original simdgroup-based implementation. Later we can try to close the gap, but for now, I think it's fine to keep the old implementation as is.

@liuliu I wasn't able to test "compile on 26.1 -> deploy on 26.0" since my M2 Ultra upgraded directly to 26.1. However, I noticed something unusual - on M2 Ultra, the bfloat tensor kernels compile successfully, but attempting to run them will segfault. The maxTotalThreadsPerThreadgroup of the pipeline descriptor reports 0 for those bfloat kernels. I read somewhere that pre-M3 chips have some sort of "emulated" bfloat - maybe this is related. It's probably something to keep in mind if you decide to use MPP on older devices.

@ggerganov ggerganov merged commit 5b180c3 into master Nov 6, 2025
65 of 71 checks passed
@ggerganov ggerganov deleted the gg/metal-mul-mm-rework branch November 6, 2025 12:45
@ArjunDivecha
Copy link

ArjunDivecha commented Nov 8, 2025

URL: https://github.com/ArjunDivecha/llama.cpp/tree/metal4-test-harness

Metal-4 Tensor Performance Testing Results

Executive Summary

@ggerganov Tested Metal-4 Tensor implementation on iPhone 17 Pro Max (iOS 26.0.1)
with Mistral-7B-v0.1 Q4_0 model.

Key Result: Metal-4 Tensor delivers 23% performance improvement over
Legacy Metal backend.


Test Configuration

  • Device: iPhone 17 Pro Max
  • iOS Version: 26.0.1
  • Model: Mistral-7B-v0.1 (Q4_0, 3.8 GiB)
  • Test Date: 2025-11-07
  • Prompt: Complex multi-paragraph generation on neural network
    backpropagation

Performance Results

Comparison Table

Metal-4 Tensor

  • TTFT: 0.472s
  • Tokens/s: 13.66
  • Total Tokens: 409
  • Total Time: 29.943s
  • Memory: 4331.8 MB
  • Thermal: Fair

Metal Legacy

  • TTFT: 0.322s
  • Tokens/s: 11.08
  • Total Tokens: 409
  • Total Time: 36.926s
  • Memory: 4342.0 MB
  • Thermal: Fair

CPU

  • TTFT: 2.716s
  • Tokens/s: 5.26
  • Total Tokens: 400
  • Total Time: 76.091s
  • Memory: 4650.3 MB
  • Thermal: Serious

Performance Gains

  • Metal-4 vs Legacy Metal: +23% throughput (13.66 vs 11.08 t/s)
  • Metal-4 vs CPU: +160% throughput (2.6x faster)
  • End-to-end time: 19% faster than Legacy Metal (29.9s vs 36.9s)

Detailed Analysis

1. Throughput Performance

Metal-4 Tensor achieves 13.66 tokens/second sustained throughput,
representing a significant improvement over Legacy Metal's 11.08 t/s. This
23% gain demonstrates the effectiveness of the native tensor API
optimizations.

2. Time to First Token (TTFT)

Metal-4 shows 0.472s TTFT vs Legacy's 0.322s. Slightly higher
initialization latency is offset by superior sustained performance for
longer generations.

3. Thermal Management

Both Metal backends maintained "Fair" thermal state throughout
409-token generation, while CPU escalated to "Serious", indicating
better power efficiency of GPU implementations.

4. Memory Usage

Metal backends show similar memory footprint (~4.3GB), with CPU using
slightly more (4.65GB). Memory efficiency is comparable across GPU
implementations.


Technical Details

Implementation

The Metal-4 Tensor backend uses iOS 26.0.1's native MTLTensor API:

case .metalTensor:
    model_params.n_gpu_layers = 99
    print("Using Metal-4 Tensor backend")
    
case .metalLegacy:
    model_params.n_gpu_layers = 99
    print("Using Metal Legacy backend")
    
case .cpu:
    model_params.n_gpu_layers = 0
    print("Using CPU backend")

Key Optimizations

- Native MTLTensor API for optimized GPU memory layout
- Kernel fusion reducing memory bandwidth requirements
- Hardware-accelerated matrix multiplication for transformer layers
- Improved thermal management through efficient GPU utilization

---
Test Harness

Built comprehensive iOS testing application with:

- Backend Selection: Real-time switching between Metal-4 Tensor, Legacy
Metal, and CPU
- Metrics Collection: TTFT, tokens/sec, memory usage, thermal state
- Automated Comparison: Single-button A/B testing across all backends
- Export: Markdown-formatted results for sharing

Metrics Implementation

public struct InferenceMetrics {
    public var backend: Backend
    public var ttft: Double
    public var tokensPerSecond: Double
    public var totalTokens: Int32
    public var totalTime: Double
    public var memoryUsed: UInt64  // via mach_task_basic_info
    public var thermalState: String // via ProcessInfo
}



---
Conclusions

1. Performance Validated: Metal-4 Tensor shows measurable 23% improvement
over Legacy Metal
2. Production Ready: Maintains stable thermal state and memory usage
3. Real-world Impact: 19% faster end-to-end inference on 409-token
generation
4. Scalability: Benefits should increase with larger models and longer
contexts


---
Great work on this PR!  @ggerganov The Metal-4 Tensor implementation delivers real 
performance improvements on iOS 26.0.1. 🎉

As an aside I tested the Mistral 7B IT 7b v0.3 model on Private LLM with the same prompt 
and it generated at 5.78 t/s so its likely that its only using the CPU

@liuliu
Copy link

liuliu commented Nov 21, 2025

FYI: MLX merged their take on NA: ml-explore/mlx#2772 with OS 26.2 API.

OS 26.2 introduced API to set cooperative tensors for input directly, through the matmul_op.get_left_input_cooperative_tensor<inputLeftType, inputRightType, outputType>(); interface (needs Xcode 26.2 Beta 2). I think this will be useful for ggml since ggml's current implementation dequant to a threadgroup memory and with this new API can dequant to register file directly, without doing the barrier sync.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Apple Metal https://en.wikipedia.org/wiki/Metal_(API) ggml changes relating to the ggml tensor library for machine learning testing Everything test related

Projects

None yet

Development

Successfully merging this pull request may close these issues.