Releases: halide/Halide
v19.0.0
Major improvements
- Halide is now available for both C++ and Python usage via Pip. Try
pip install halide
today! - The Vulkan backend has matured substantially.
- The HTML "conceptual statement" output now supports dark mode viewing.
- For developers, CMake 3.28 is now required and we no longer require an internet connection during the build.
- Thread pool improvements mean that workloads that do a small number of small tasks in parallel (e.g. a cheap operation applied to a small image) are up to 3x faster. If you have schedules that do not use parallelism for small inputs because you found it didn't provide any speedup, you may want to re-benchmark.
- You can now query properties of the compiled-for target as Exprs, simplifying helper code that wants to do different things depending on the target architecture. Example:
f(x) = select(target_arch_is(Target::ARM), 3, 7)
. Helpers includetarget_arch_is
,target_os_is
,target_has_feature
,target_bits
, andtarget_natural_vector_size
. These are resolved to constants at compile-time and simplified away. Use with care, as this (intentionally) results in different behavior on different platforms.
Breaking changes
- We now distribute
libGenGen.a
rather thanGenGen.cpp
.- Downstream users should link to this library with
/WHOLEARCHIVE:
or-Wl,--whole-archive
rather than buildGenGen.cpp
themselves. - Users of the CMake package should be unaffected.
- Downstream users should link to this library with
- In keeping with our LLVM support policy, support for LLVM 16 has been removed.
- We no longer use the
le64
/le32
generic targets for compiling runtime modules to LLVM. These targets were removed in LLVM upstream.
What's Changed
Apps and tests
- Reschedule the matrix multiply performance app by @abadams in #8418
- Update lesson_22_jit_performance.cpp by @abadams in #8438
- Add threadpool performance test by @abadams in #8447
- Don't allow internal_error to pass an error test by @alexreinking in #8458
- Get more consistent distributions in parallel scenarios test by @abadams in #8451
Autoschedulers
Build system
Python_bindings
-test-as-installed by @LebedevRI in #8355- Bump Halide version to 19 in main branch by @steven-johnson in #8357
- Remove warning for unsupported compilers by @alexreinking in #8362
- Bump CMake minimum version to 3.28 by @alexreinking in #8363
- Quick CMake fixes enabled by 3.28 by @alexreinking in #8365
- Distribute GenGen as a static library by @alexreinking in #8367
- Clean up serialization build code by @alexreinking in #8369
- List headers with target_sources FILE_SETS by @alexreinking in #8370
- Clean up autoscheduler dependencies by @alexreinking in #8372
- Use a Find module for V8 by @alexreinking in #8373
- Use a Find module for NodeJS by @alexreinking in #8374
- Move dependencies/wasm to use sites by @alexreinking in #8377
- Replace FetchContent with a custom dependency provider by @alexreinking in #8378
- Two more build fixes by @LebedevRI in #8371
- Rework LLVM into Find module and enact new component policy. by @alexreinking in #8379
- Reflow src/CMakeLists.txt in logical groups by @alexreinking in #8383
- Introduce HalideFeatures system for optional components by @alexreinking in #8384
- Scan generated export files to determine dependencies. by @alexreinking in #8385
- Rewrite bundle_static to be much more efficient. by @alexreinking in #8386
- Support using vcpkg to build dependencies on all platforms by @alexreinking in #8387
- Fix bundling error on buildbots by @alexreinking in #8392
- Support CMAKE_OSX_ARCHITECTURES by @alexreinking in #8390
- Fix Homebrew LLVM 19 by @alexreinking in #8431
- Fix CPack package naming when cross-compiling by @alexreinking in #8492
- Fix Apple libtool detection in bundle_static by @alexreinking in #8495
CodeGen
- Select condition vector lanes must match the true and false value by @abadams in #8465
- Emit
vscale_range()
fn attribute in correct syntax by @steven-johnson in #8457 - Fix #8455 (in combination with #8457) by @steven-johnson in #8456
- Fix bonehead mistake in get_md_bool() by @steven-johnson in #8469
- Propagate some facts about inequalities with min/max by @shoaibkamil in #8475
- This fixed an issue where predicates in
.specialize()
directives weren't able to eliminateselect()
cases. #8443
- This fixed an issue where predicates in
Debugging
- Add LLDB pretty-printing by @alexreinking in #8460
- Print constants in scientific precision by @antonysigma in #8506
- Adaptive Dark colorscheme for Stmt HTML. Ability to programmatically export conceptual stmt files. by @mcourteaux in #8327
Documentation
- Update README.md by @abadams in #8404
- Big documentation update by @alexreinking in #8410
- Document how to find Halide from a pip installation by @alexreinking in #8411
- Link to PyPI from Doxygen index.html by @alexreinking in #8415
- Include our Markdown documentation in the Doxygen site. by @alexreinking in #8417
- Add missing backslash by @abadams in #8419
Frontend
- Don't let users disguise RVars as Vars by @abadams in #8441
- Add helper functions to query properties of the lowered Target (#8192) by @steven-johnson in #8359
Hardware backends
- Fix injection of GPU buffers that do not go by a Func name (i.e. alloc groups). by @mcourteaux in #8333
- Remove vestigial AMDGPU backend by @alexreinking in #8382
- Add ARMv8.x feature flags by @steven-johnson in #4489
- [vulkan] Fixes to address outstanding validation failures by @derek-gerstmann in #8448
- [vulkan] Reduce descriptor sets, use official headers, improve allocator, remove module destructor by @derek-gerstmann in #8452
- [vulkan] Skip
async_copy_chain
andgpu_allocation_cache
correctness tests on Windows by @derek-gerstmann in #8503
LLVM
- Don't use le32/le64 by @steven-johnson in #8344
- Fix for the removed DataLayout constructor. by @mcourteaux in #8391
- Drop support for LLVM 16 in main by @steven-johnson in #8358
- Allow LLVM 20 by @steven-johnson in #8352
- Fix for top-of-tree LLVM by @steven-johnson in #8421
- Fix for top-of-tree LLVM by @steven-johnson in #8425
- Fix for top-of-tree LLVM by @steven-johnson in #8442
- Fix datalayout for osx-arm-64 by @abadams in #8449
- Fix top of LLVM. by @mcourteaux in #8454
- Replace all use of getPointerTo() with PointerType::get() by @steven-johnson in #8473
Python
- Fix Numpy 2.0 compatibility bug in lesson 10 by @alexreinking in #8381
- Pip packaging at last! by @alexreinking in #8405
- Update pip package metadata by @alexreinking in #8412
- Fix classifier spelling by @alexreinking in #8413
- Upgrade LLVM to 19.1.0 in pip package by @alexreinking in #8423
- Update PIP LLVM to 19.1.4 by @alexreinking in #8488
- PythonExtensionGen: ~PyHalideBuffer should call device_free() (#8399) by @steven-johnson in #8439
Runtime
- Fix profiler to report time spent on GPU kernels again instead of on 'wait for parallel tasks'. by @mcourteaux in #8453
- Don't spin on the main mutex while waiting for new work by @abadams in #8433
Minor bugfixes / other cleanup
- Remove remaining dregs of tuple_select (oops) by @steven-johnson in https://github.com/halid...
Halide v18.0.0
Changes Of Note since Halide 17
- Ring-buffering now supported in schedules (
Func::ring_buffer()
). This is distinct from fold_storage in that it folds across time (the loop variables) rather than folding across space (the pure vars of the Func). - Fixed a longstanding bug in
lossless_cast()
- Lots of fixes for Vulkan backend
- OpenGLCompute is no longer supported
- Added support for ARM SVE2
- Added (basic) support for Intel APX and AVX10
- Added support for Hexagon HVX v68
- Added support for numpy's
.npy
format to.debug_to_file()
and the code in halide_image_io.h - Python bindings now support bfloat and int64 properly
- Hacky code that auto-named Funcs, Vars etc via DWARF introspection was removed
- The profiler was revamped to behave better when multiple Halide pipelines are in flight at the same time.
- Numerous lowering passes were sped up, resulting in faster compilation for large pipelines. However, time spent in LLVM is still the long pole for most pipelines.
- Fixed-point instruction selection has been improved via tracking constant integer bounds of expressions.
- Adds feature detection for ARM CPUs to the runtime library and to the host target feature computation. Supports Windows, macOS,
Linux, iOS, and Android.
Deprecations / Removals
tuple_select()
has been removed in favor of overloads toselect()
.- Various fixed-point operators have been removed from the
Halide::Internal
namespace and are now in the publicHalide
namespace.
What's Changed
- Detect ARM CPU features for host target and in runtime (#8298)
- Scheduling directive to support ring buffering by @vksnk in #7967
- Don't add ring_buffer semaphores if the function is not scheduled as async by @vksnk in #8015
- Quick fix for crash that is occurring in SVE2 tests. by @zvookin in #8020
- Don't use variable-length arrays by @steven-johnson in #8021
- Set warnings on tests as well as src by @steven-johnson in #8022
- Stronger chain detection in LoopCarry pass by @vksnk in #8016
- adds mappings for f16 variants of halide float math by @mikewoodworth in #8029
- Require LLVM >= 16.0 by @steven-johnson in #8003
- Add test for #8029 by @steven-johnson in #8032
- Tweak the Printer code in runtime for smaller code by @steven-johnson in #8023
- Fix bounds_of_nested_lanes by @abadams in #8039
- Track whether or not let expressions failed to solve in solver by @abadams in #7982
- Fix type error in VectorizeLoops by @abadams in #8055
- Update makefile to use test/common/terminate_handler.cpp by @abadams in #8066
- add unsafe_promise_clamped by @wraith1995 in #8071
- Don't require Halide_WebGPU when using wasm (#8063) by @steven-johnson in #8065
- Outsmart the LLVM optimizer by @steven-johnson in #8073
- Add hexagon_benchmarks app for CMake builds by @prasmish in #8069
- Fix bool conversion bug in Vulkan code generator by @derek-gerstmann in #8067
- Better validation of gpu schedules by @abadams in #8068
- Add an easy way to print vectors in debug output. by @zvookin in #8072
- [WebGPU] Update to latest native headers by @jrprice in #8081
- Remove OpenGLCompute by @steven-johnson in #8077
- Add checks to prevent people from using negative split factors by @abadams in #8076
- Fix rfactor adding too many pure loops by @abadams in #8086
- Forward the partition methods from generator outputs by @abadams in #8090
- Parallelize some tests by @abadams in #8078
- Allow disabling of mutlithreading in simd op check by @steven-johnson in #8096
- clang does not support
_Float16
when targeting i386 by @LebedevRI in #8085 - tests: correctness/float16_t: mark
__extendhfsf2
with default visibility by @LebedevRI in #8084 - Fix reduce_expr_modulo of vector in Solve.cpp by @abadams in #8089
- [Vulkan] Region allocator fixes for memory requirements and allocations by @derek-gerstmann in #8087
- Ensure string(REPLACE) is called with the right number of arguments by @alexreinking in #8097
- Strip asserts right at the end of lowering by @abadams in #8094
- Fix clang-tidy error in runtime.printer.h (parameter shadows member) by @steven-johnson in #8074
- Fix an issue where the Halide compiler hits an internal error for bool types in widening intrinsics. by @zvookin in #8099
- Small Tutorial Fix by @2022tgoel in #8111
- Optionally print the time taken by each lowering pass by @abadams in #8116
- Do less redundant work in UnpackBuffers by @abadams in #8104
- Avoid redundant scope lookups by @abadams in #8103
- Add Intel APX and AVX10 target flags and LLVM attribute setting. by @zvookin in #8052
- Use a caching version of stmt_uses_vars in TightenProducerConsumer nodes by @abadams in #8102
- Fix hoist_storage not handling condition correctly. by @abadams in #8123
- Rewrite the skip stages lowering pass by @abadams in #8115
- Remove two dead vars from the Makefile by @abadams in #8125
- Add support for setting the default allocator and deallocator functions in Halide::Runtime::Buffer. by @mcourteaux in #8132
- Make realization order invariant to unique_name suffixes by @abadams in #8124
- Make gpu thread and block for loop names opaque by @abadams in #8133
- Add class template type deduction guides to avoid CTAD warning. by @zvookin in #8135
- [vulkan] Add conform API methods to memory allocator to fix block allocations by @derek-gerstmann in #8130
- Add sobel in hexagon benchmarks app for CMake builds by @prasmish in #8127
- Handle loads of broadcasts in FlattenNestedRamps by @abadams in #8139
- Use python itself to get the extension suffix, not python-config by @abadams in #8148
- Rewrite the pass that adds mutexes for atomic nodes by @abadams in #8105
- Feature: mark a Func as no_profiling, to prevent injection of profiling. (2nd implementation) by @mcourteaux in #8143
- Bound allocation extents for hoist_storage using loop variables one-by-one by @vksnk in #8154
- Support for ARM SVE2. by @zvookin in #8051
- Fix two compute_with bugs. by @abadams in #8152
- Python bindings:
add_python_test()
: do setHL_JIT_TARGET
too by @LebedevRI in #8156 - fix ub in lower rounding shift right by @abadams in #8173
- Add some missing _Float16 support by @steven-johnson in #8174
- Add conversion code for Float16 that was missed in #8174 by @steven-johnson in #8178
- Tighten bounds of abs() by @rootjalex in #8168
- Clarify the meaning of Shuffle::is_broadcast() by @abadams in #8158
- Add .npy support to halide_image_io by @steven-johnson in #8175
- Update Hexagon Install Instructions by @FabianSchuetze in #8182
- Add .npy support to debug_to_file() by @steven-johnson in #8177
- Don't print on parallel task entry/exit with -debug flag by @abadams in #8185
- Fix corner case in if_then_else simplification by @abadams in #8189
- Rewrite IREquality to use a more compact stack instead of deep recursion by @abadams in #8198
- [HEXAGON] Keep support for hexagon_remote/Makefile by @aankit-quic in #8186
- Faster substitute_facts by @abadams in #8200
- Make Interval::is_single_point check for deep equality by @abadams in #8202
- Refactor ConstantInterval by @abadams in #8179
- Faster vars used tracking in simplify let visitor by @abadams in #8205
- M...
Halide v17.0.2
What's Changed
- Backport a fix for the simpler bug in lossless_cast by @abadams in #8264
- Fix Vulkan SIMT mappings for GPU loop vars; avoid formatting the GPU kernel to a string for Vulkan (since it's binary SPIR-V needs to remain intact). @derek-gerstmann in #8270
Full Changelog: v17.0.1...v17.0.2
Halide v17.0.1
What's Changed
- Changes to make WebGPU code compliant with recent versions of Emscripten (#8106)
- Fix rfactor adding too many pure loops (#8107)
- Forward the partition methods from generator outputs (#8090)
- Fix reduce_expr_modulo of vector in Solve.cpp (#8107)
Full Changelog: v17.0.0...v17.0.1
Halide v17.0.0
Changes Of Note
ParamMap
has been removed entirely from the public API. All users ofParamMap
should migrate toCallable
instead.Halide::Parameter
has been moved to the public Halide API (it was formerly "internal" and not intended for public use).- New scheduling primitives:
Func::partition()
and friends: Set the loop partition policy, which controls how/whether a loop is split into three loops (prologue/steady-state/epilogue). Loop partitioning can be useful to optimize boundary conditions (e.g. clamp_edge).Func::hoist_storage()
and friends: allows a functions's storage to be moved to a given loop level. UnlikeFunc::store_at()
, no optimizations are triggered (e.g. sliding window).
- New
TailStrategy
options for for existing scheduling directives:ShiftInwardsAndBlend
: Equivalent to ShiftInwards, but protects values that would be re-evaluated by loading the memory location that would be stored to, modifying only the elements not contained within the overlap, and then storing the blended result. Unlike ShiftInwards, this is valid to use in update definitions.RoundUpAndBlend
: Equivalent to RoundUp, but protects values that would be written beyond the end by loading the memory location that would be stored to, modifying only the elements within the region being computed, and then storing the blended result. Unlike RoundUp, this is valid to use on non-outermost splits in update definitions.
- Substantially improved performance and display in the VizIR output.
- Profiler improvements:
- Substantially nicer text output
- Injects timing into calls for
copy_to_host
andcopy_to_device
so you can measure host<->device copy overhead - Allows option sorting via
HL_PROFILER_SORT
env var
- Substantially faster codegen for several GPU backends.
- Experimental serialization/deserialization feature allows for saving of Halide IR code.
- Various bug fixes and improvements in the
Anderson2021
autoscheduler. - Improved ARM codegen, including: better patterns for sdot/udot; improved shift/mul codegen.
- Support for Zen4 architecture in the x86 backend.
- Updates to the ONNX app.
- Various fixes and improvements to sliding-window and storage-folding.
- Improvements to slow gather operations for some x86 variants.
- Improvements to correctness for the
.async()
scheduling directive. - Improved codegen for float16 conversion, especially on x86.
- Several compile-time warnings of dubious usefulness disabled.
- WebAssembly codegen now defaults to assuming that saturating-float-to-int and sign-extension instructions sets are always available.
Target
now does some reality-checking that it doesn't contain obviously nonsensicalFeature
combinations
What's Changed
- Misc changes and fixes to RISCV codegen
- Revise LLVM fix to work when no V8 or WABT available by @steven-johnson in #7635
- Be more careful about overflow in trim_bounds_using_alignment by @abadams in #7645
- Add a compositing example app by @abadams in #7646
- Get the ASAN toolchain working again by @steven-johnson in #7604
- Upgrade clang-format and clang-tidy to use v16 by @steven-johnson in #7660
- Enable the misc-use-anonymous-namespace clang-tidy check by @steven-johnson in #7661
- Enable clang-tidy's modernize-use-default-member-init check by @steven-johnson in #7662
- Update onnx app to Adams2019 autoscheduler and new autoscheduler API by @abadams in #7673
- Remove ParamMap by @steven-johnson in #7675
- Fix correctness_float16_t for ASAN builds by @steven-johnson in #7687
- Add a select overload for tuples by @abadams in #7672
- Add Sanitizer details to README_cmake.md by @steven-johnson in #7688
- Fix quadratic algorithm in simplify_correlated_differences by @abadams in #7686
- Fix float16 under asan, attempt #2 by @steven-johnson in #7691
- Add a warning if a Generator declares any Outputs before the final Input (Fixes #7669) by @steven-johnson in #7697
- Fixed the regularization for BGU. by @mcourteaux in #7684
- Fix clang and llvm versions in scripts by @TH3CHARLie in #7702
- Fix leaks caused by self-referential parameter constraints by @abadams in #7700
- Fix float16 warning for older clangs by @abadams in #7701
- Upgrade Halide main branch for LLVM18 by @steven-johnson in #7710
- Improved profiler result printing. by @mcourteaux in #7709
- Default WITH_TEST_FUZZ to OFF by @steven-johnson in #7695
- Throw an erorr if split is called with the same older and inner var name by @TH3CHARLie in #7715
- Making HLSL code-gen a couple orders of magnitude faster... by @slomp in #7719
- Making Metal code-gen a bit faster by @slomp in #7720
- Fix handling of thread features for scalars in Anderson2021 by @aekul in #7726
- Change default generator timeout to infinite by @abadams in #7718
- Remove unused using decl by @abadams in #7730
- [Hexagon] - Fix problems in sim_host.cpp by @pranavb-ca in #7725
- Fix RDom usage in anderson2021_test_apps_autoscheduler (Fixes #7729) by @steven-johnson in #7734
- Fix leak on cloning functions with update defs by @abadams in #7735
- Ignore code in src/runtime/hexagon_remote/bin/src for clang-format by @steven-johnson in #7736
- Clean up really long line lengths in Anderson2021 by @steven-johnson in #7728
- Revise labels on autoscheduler tests by @steven-johnson in #7732
- Speedup the VizIR HTML. by @mcourteaux in #7713
- Run clang-tidy on macOS runners instead of Linux by @steven-johnson in #7746
- Fix infinite recursion in loop partitioning by @abadams in #7743
- Fix leaks in test/correctness/memoize.cpp by @abadams in #7705
- Allow optional sorting of profiler output via HL_PROFILER_SORT env var (Fixes #7638) by @steven-johnson in #7639
- Permit llvm 15 on windows by @abadams in #7744
- Revert accidental typo change in #7746 by @steven-johnson in #7747
- [vulkan] Fix heap buffer overflow in Vulkan extension handling discovered by ASAN by @derek-gerstmann in #7740
- [vulkan] Fix SPIR-V IR references causing leaks by @derek-gerstmann in #7739
- Improve error-handling in Anderson2021, and ensure build deps are cor… by @steven-johnson in #7748
- StmtViz: Search for tooltip only in the child node by @antonysigma in #7754
- Experimental serializer by @TH3CHARLie in #7594
- Define
cast<i32>(u32)
overflow behavior by @rootjalex in #7769 - Fix vector reduce HTML by @mcourteaux in #7773
- Remove fragile simd_op_check test for mlal/mlsl on ARM by @rootjalex in #7775
- Speedup page loading of VizStmt. by @mcourteaux in #7755
- Try to fix remaining ASAN-reported leaks by @steven-johnson in #7767
- Fix out of bounds access in anderson2021_test_apps_autoscheduler by @aekul in #7771
- Don't introduce reinterprets in find/lower intrinsics by @rootjalex in #7776
- [Hexagon] -Build Hexagon runtime components using the Hexagon SDK (Clone of #7671) by @pranavb-ca in #7741
- slice IRMatcher should only match on slices by @abadams in #7772
- Don't inject undef() in the simplifier by @abadams in #7791
- Fix for top-of-tree LLVM by @steven-johnson in #7798
- [ARM] Distribute shifts as muls by @rootjalex in #7790
- [ARM] support new udot/sdot patterns by @rootjalex in #7800
- Remove some unused includes by @abadams in #7799
- Add support to the makefile for serialization by @abadams in #7762
- [wasm] Enable PIC for WebAssembly on LLVM v18.x by @derek-gerstmann in #7803
- Update WebGPU to latest Emscripten/Dawn API by @steven-johnson in #7804
- Add jump-buttons to get fro Stmt directly to Assembly by @mcourteaux in #7793
- Update clang-tidy action to stop breaking by @...
Halide v16.0.0
What's Changed
General Notes
- Support for the Vulkan API (w/SPIR-V codegen)
- Support for WebGPU (experimental)
- Improved Halide IR HTML Visualization
- Fixed a regression in the Adams2019 auto-scheduler that disabled sub-tiling
- Added GPU auto-scheduler (Anderson2021)
Efficient Automatic Scheduling of Imaging and Vision Pipelines for the GPU
Luke Anderson, Andrew Adams, Karima Ma, Tzu-Mao Li, Tian Jin, Jonathan Ragan-Kelley
Proceedings of the ACM on Programming Languages (OOPSLA 2021)
Deprecations / Removals
OpenGLCompute
has been deprecatedParamMap
has been deprecated- Deprecated
HVX_shared_object
feature has been removed - References to deprecated fixed-point operators have been removed
- Deprecated
halide_target_feature_disable_llvm_loop_opt
has been removed - Deprecated
MIPS
device support has been removed
Notable Fixes & Changes
- Generate dot() in the Metal backend by @vksnk in #7085
- Add evaluate() and evaluate_may_gpu() to Python bindings by @steven-johnson in #7108
- Add support for generating LLVM vector predication intrinsics. by @zvookin in #7111
- RISC V vector predication support intrinsics support by @zvookin in #7119
- Add range-checking to Buffer objects in Python by @steven-johnson in #7128
- Fix Python buffer handling by @steven-johnson in #7125
- [WASM] Use rounding_mul_shift_right for q15mulr_sat_s pattern by @rootjalex in #7134
- [x86] Generate AVX512 fixed-point instructions by @rootjalex in #7129
- Fix readnone attribute for llvm 16 by @abadams in #7152
- Call cache.clear between internal functions in CG_C by @steven-johnson in #7155
- Add
bfloat
support tohalide_type_to_string()
by @steven-johnson in #7154 - Factor simd_op_check into separate files by architecture. by @zvookin in #7163
- Slightly improve error message for non-integer RDom min/extent by @abadams in #7151
- Migrate from MCJIT to ORC JIT by @dkurt in #7166
- Use n32:64 in RISC-V data layout by @dkurt in #7175
- Don't attempt to use makecontext()/swapcontext() on Android by @steven-johnson in #7196
- Add bridging for clang _Float16 type. by @zvookin in #7201
- Fix issue with vector predicated comparison and select instructions. by @zvookin in #7205
- Add RISC V zvl flag for LLVM version 16 or greater. by @zvookin in #7209
- Extend LLVM IR type mangling to handle scalars. by @zvookin in #7212
- Fix bitrot in PowerPC testing by @steven-johnson in #7211
- Use aligned_alloc() as default allocator for HalideBuffer.h on most platforms by @steven-johnson in #7190
- Tighten alignment promises for halide_malloc() by @steven-johnson in #7222
- Fix some sources of signed integer overflow in the compiler by @abadams in #7231
- Explicitly stage strided loads by @abadams in #7230
- Remove deprecated halide_target_feature_disable_llvm_loop_opt by @steven-johnson in #7247
- Conditional allocations shouldn't fail for size=0 in C++ backend (#7255) by @steven-johnson in #7256
- Inline into extern function args during bounds inference by @abadams in #7261
- Use ::aligned_alloc() instead of std::aligned_alloc() in HalideBuffer.h by @steven-johnson in #7268
- Optimize Module::compile() for some edge cases by @steven-johnson in #7269
- Drop support for MIPS (#7287) by @steven-johnson in #7289
- Emit prototypes for destructor functions in C Backend by @steven-johnson in #7296
- [HVX] Fix EliminateInterleaves by @rootjalex in #7279
- Remove dependency on platform threads library by @alexreinking in #7297
- Fix error of add_halide_generator in cross-compilation by @stevesuzuki-arm in #7283
- Fix issue in add_halide_runtime in cross-compilation by @stevesuzuki-arm in #7284
- Add workaround for the const-or-not user_context issue (#635) by @steven-johnson in #7291
- [x86 & wasm] Split up double saturating-narrows from i32 by @rootjalex in #7280
- Hoist vector slices using rewrite rules by @abadams in #7243
- Improved halide_popcount by @Aelphy in #7225
- halide_popcount<uint64_t> is broken by @steven-johnson in #7313
- Fix segfault by nonconstant bound in Adams2019 by @stevesuzuki-arm in #7321
- Make auto scheduler libs available in HalideHelpers package by @stevesuzuki-arm in #7285
- Improve support for Arm baremetal compilation and runtime by @stevesuzuki-arm in #7286
- Remove deprecated
HVX_shared_object
feature by @steven-johnson in #7331 - Fix a subtle uninitialized-memory-read in Buffer::for_each_value() by @steven-johnson in #7330
- Add a hook to Codegen_C::compile() by @steven-johnson in #7335
- Tiny improvements in codegen in C backend by @steven-johnson in #7337
- Devirtualize the protected compile() methods in Codegen_C by @steven-johnson in #7341
- Fix tuple output bounds checks by @abadams in #7345
- Change early-bound default args in Python bindings to late-bound by @steven-johnson in #7347
- Fix Python error handling by @steven-johnson in #7352
- Permit vectorization of non-recursive atomic operations by @abadams in #7346
- Update WABT to 1.0.32; Increase stack size for WASM AOT apps by @steven-johnson in #7373
- Bounds visitors for min/max were missing single_point mutated case by @abadams in #7377
- Fix overflow in x86 absd lowering by @abadams in #7407
- Add initial support for WebGPU by @jrprice in #6492
- Use pmaddubsw for non-RDom horizontal widening adds by @abadams in #7440
- Compute comparison masks in narrower types if possible by @abadams in #7392
- Fix bugs in PyTorch codegen. by @Yongqi-Zhuo in #7443
- Remove references to deprecated variants of fixed-point operators by @steven-johnson in #7457
- Add GPU autoscheduler by @aekul in #6856
- d3d12 runtime: replacing spinlocks by mutex objects by @slomp in #7489
- Feature Enhancement: Halide IR HTML Visualization by @maaz139 in #7421
- Deprecate ParamMap (#7121) by @steven-johnson in #7357
- Forbid assigning to Buffer(Expr) by introducing an intermediate type. by @abadams in #7517
- [vulkan phase2] Vulkan Runtime by @derek-gerstmann in #6924
- Add libfuzzer compatible fuzz harness by @silvergasp in #7512
- fuzz: Port correctness/cse fuzzer over to libfuzzer by @silvergasp in #7543
- metal : replacing spinlock by mutex by @slomp in #7532
- Fix save_tiff() PlanarConfig assignment for monochrome inputs by @philboske in #7568
- Fix various compilation errors with AppleClang 14.0.3 by @steven-johnson in #7578
- fuzz: Add libfuzzer compatible bounds fuzzer by @silvergasp in #7549
- Significant change to RISC V and scalable vector code generation. by @zvookin in #7616
- Fix inverted may_subtile checks by @abadams in #7626
- Deprecate OpenGLCompute for Halide 16 by @shoaibkamil in #7627
New Contributors
- @sashashura made their first contribution in #7136
- @twesterhout made their first contribution in #7315
- @terryheo made their first contribution in #7323
- @adrian-lebioda made their first contribution in #7379
- @Ttayu made their first contribution in #7402
- @Yongqi-Zhuo made their f...
Halide v15.0.1
What's Changed
- The Python binding of
compile_to_callable()
was not properly copying from device to host for output buffers, so output was typically black (or garbage) when used with a GPU target. (#7213) - The
bin
directory was missing from the installs. - Upgraded LLVM to 15.0.7
- New in 15.0.0, but restated here for visibility: The target flag disable_llvm_loop_opt is deprecated, as it's now the default behavior. This means that we have turned off llvm's autovectorization and loop unrolling. This should not affect any schedules with manually-specified vectorization and unrolling, other than trimming code size a little. However, schedules that do not vectorize or unroll may slow down because they were (intentionally or not) relying on llvm to do it automatically. If you see a performance regression with Halide 15, try turning on the enable_llvm_loop_opt target flag.
Halide v15.0.0
What's Changed
General Notes
-
Support for RISC V Vector architectures.
-
Python-related:
- Halide builds for Python are now being built and provided to PyPI, so it is now possible to use the Halide Python bindings simply by
pip install halide
- Major improvements were made to the Python bindings, with many missing or incomplete sections of the API added or filled in.
- We now support the use of Generators from Python (for both JIT and AOT usage).
- The standard CMake rules now support generating a Python extension directly.
- Support for Python was removed from Halide's Makefiles; you must use CMake to build the Python bindings
- Halide builds for Python are now being built and provided to PyPI, so it is now possible to use the Halide Python bindings simply by
-
Halide::Func now allows you to (optionally) constrain the type(s) of Exprs that the Func can contain, and/or the dimensionality of the Func.
-
Added a new way to use the JIT (
compile_to_callable
) that allows calling a jitted function with the same syntax as for AOT-compiled functions, allowing more control over JIT lifespan, as well as thread-safe arguments without requiring ParamMap -
General improvements to SIMD codegen
-
Several rarely-used parts of the C++ Generator API were deprecated, and the way that autoschedulers are specified for AOT compilation is now completely different (but better for future expandability).
-
CMake builds now require >= v3.22
-
WABT usage requires >= v1.0.30
-
LLVM 12 is no longer supported
-
The target flag disable_llvm_loop_opt is deprecated, as it's now the default behavior. This means that we have turned off llvm's autovectorization and loop unrolling. This should not affect any schedules with manually-specified vectorization and unrolling, other than trimming code size a little. However, schedules that do not vectorize or unroll may slow down because they were (intentionally or not) relying on llvm to do it automatically. If you see a performance regression with Halide 15, try turning on the enable_llvm_loop_opt target flag.
Notable bug fixes
- Make Halide::round behave as documented (#7012)
- Incorrect folding of saturating_sub (#6883)
- The check for race conditions didn't consider where clauses (#6808)
- Performance regression for x86 for certain LLVM versions (#6783)
- Fusing a specialization drops compute_withs from generated code (#6770)
- Incorrect output when realize condition depends on tuple call (#6915)
- Python extensions should default to throwing exceptions rather than calling abort() for errors (#6986)
- Python bindings didn't support
bool
buffers (#7006) - Python bindings didn't support
float16
buffers (#7060) - Python extensions that executed on GPU didn't copy back to host properly (#6869)
- Fix bugs in
div_round_to_zero
andfast_integer_divide_round_to_zero
(#7008) - Bugs in
add_requirement()
(#7045)
Major changes
- Augment Halide::Func to allow for constraining Type and Dimensionality by @steven-johnson in #6734 and #6735
- Add Target support for architectures with implementation specific vector size. by @zvookin in #6786
- Add support for vscale vector code generation. by @zvookin in #6802
- Remove Python bindings from Makefiles by @alexreinking in #6821
- Add a new, alternate JIT-call convention by @steven-johnson in #6777
- Pip packaging by @alexreinking in #6886 and #6938
- Define a Generator framework in Python by @steven-johnson in #6764
- Make Halide::round behave as documented by @abadams in #7012
Minor changes
-mtune=
/-mcpu=
support for x86 AMD CPU's by @LebedevRI in #6655- Enable deprecations warnings by @steven-johnson in #6555
- Fix GPU depredication/scalarization by @shoaibkamil in #6669
- Allow PyPipeline and PyFunc to realize() scalar buffers by @steven-johnson in #6674
- Future-proof 'processor
to
tune processor` by @LebedevRI in #6673 - Fix ctors for Realization by @steven-johnson in #6675
-mtune=native
CPU autodetection for AMD Zen 3 CPU by @LebedevRI in #6648- Clean up Python extensions in python_bindings by @steven-johnson in #6670
- Halide::Tools::save_image() should accept buffers with
const
types by @steven-johnson in #6679 - Fix "set but not used" warnings/errors by @steven-johnson in #6683
- Drop support for LLVM12 by @steven-johnson in #6686
- Upgrade to clang-format 13 by @steven-johnson in #6689
- Always mark _ucon as 'unused' in Codegen_C by @steven-johnson in #6691
- Add
break
to avoid 'possible unintentional fallthru' warning by @steven-johnson in #6694 - Silence "unknown warning" in Clang 13 by @steven-johnson in #6693
- Fixes for top-of-tree LLVM by @steven-johnson in #6697
- Python: make Func implicitly convertible to Stage (#6702) by @steven-johnson in #6704
- llvm no longer wants a type suffix on vst intrinsics by @abadams in #6701
- Fix type-mangling for vst on arm32 for LLVM15 by @steven-johnson in #6705
- Remove the last remaining call to getPointerElementType() by @steven-johnson in #6715
- ARM vst mangling needs to be conditional on opaque ptrs by @steven-johnson in #6716
- Combine string constants in combine_strings() by @steven-johnson in #6717
- Update CodeGen_PTX_Dev to use new PassManager by @steven-johnson in #6718
- Closure functions for parallel tasks should be internal, not external by @steven-johnson in #6720
- Smarten type_of<> for fn ptrs; fix async_parallel for C backend by @steven-johnson in #6719
- Remove legacy::FunctionPassManager usage in Codegen_PTX_Dev by @steven-johnson in #6722
get_amd_processor()
: implement detection for the rest of supported AMD CPU's by @LebedevRI in #6711- Add Func::output_type() method by @steven-johnson in #6724
- Grab-bag of minor Python fixes by @steven-johnson in #6725
- Remove
rounding_halving_sub
and non-existent arm rhsub instructions by @rootjalex in #6723 - Faster
widening_mul(int16x, int16x) -> int32x
for x86 (AVX2 and SSE2) by @rootjalex in #6677 - Add missing #include in ThreadPool.h by @steven-johnson in #6738
- Fix regression from #6734 by @steven-johnson in #6739
- Add forwarding for the recently-added Func::output_type() method by @steven-johnson in #6741
- Silence "unscheduled update stage" warnings in msan_generator.cpp by @steven-johnson in #6740
- Add pycache to toplevel .gitignore file by @steven-johnson in #6743
- Silence "may be used uninitialized" in Buffer::for_each_element() by @steven-johnson in #6747
- Update WABT to 1.0.29 by @steven-johnson in #6748
- Update hannk README link to hosted models page by @steven-johnson in #6749
- Add a
HalideError
base class to Python bindings by @steven-johnson in #6750 - Add GeneratorFactoryProvider to generate_filter_main() by @steven-johnson in #6755
- Minor metadata-related cleanups by @steven-johnson in #6759
- Expand the x86 SIMD variants tested in correctness_vector_reductions by @steven-johnson in #6762
- Fix Param::set_estimate for T=void by @steven-johnson in #6766
- add_python_aot_extension should use FUNCTION_NAME for the .so output … by @steven-johnson in #6767
- Fix fundamental confusion about target/tune CPU by @LebedevRI in #6765
- Fix annoying typo in Func.h by @steven-johnson in #6774
- Add execute_generator() API by @steven-johnson in #6771
- Allow overriding of ...
Halide 14.0.0
What's Changed
Major changes
- @abadams
- @alexreinking
- Add helper for cross-compiling Halide generators. (#6366)
- @LebedevRI
- @steven-johnson
- @zvookin
- Timer based profiler (#6642)
Minor changes
- @abadams
- Deprecate JIT runtime override methods that take void * (#6344)
- Allow users to use their own cuda contexts and streams in JIT mode (#6345)
- Add --help flag to rungenmain, fixing #5323 (#6354)
- Do target-specific lowering of lerp (#6432)
- Reduce overhead of sampling profiler by having only one thread do it (#6433)
- Skip custom cuda context test on older GPUs (#6437)
- Avoid needless gather in fast_integer_divide lowering (#6441)
- Fixes for c++20 (#6446)
- Add a fast integer divide that rounds to zero (#6455)
- Let lerp lowering incorporate a final cast. (#6480)
- Try removing optional buffer added to closure (#6481)
- rounding shift rights should use rounding halving add (#6494)
- Make random faster by putting the innermost var last (#6504)
- Make it possible to interpret a wide type as multiple smaller elements (#6506)
- Handle mixed-width args to mul-shift-right (#6526)
- Attempted redo of faster noise (#6539)
- Better default lowering of absd (#6545)
- Make HALIDE_REGISTER_GENERATOR work with multiple template args (#6556)
- Rename Output to OutputFileType and deprecated Output (#6568)
- Remove incorrect not-multiple-of-16 claim (#6573)
- Fix bug in mul_shift_right matching (#6610)
- @alexreinking
- @ashishUthama
- Include LICENSE.txt in package (#6428)
- @dsharletg
- Fix description of rounding_shift_left/rounding_shift_right (#6549)
- @Elarnon
- Only commutative reductions can be parallelized (#6609)
- @jinderek
- Support new warp shuffle intrinsics after CUDA Volta architecture (#6505)
- @knzivid
- python_bindings: Fix SIGSEGV in HalidePythonCompileTimeErrorReporter (#6635)
- @LebedevRI
- [CMake] Deduplicate
Halide_LLVM_VERSION
andLLVM_PACKAGE_VERSION
(#6646)
- [CMake] Deduplicate
- @masahi
- [APP] Fix
hexagon_benchmarks
build (use two-var prefetch) (#6563)
- [APP] Fix
- @mcleary
- Add support for AMX instructions (#5818)
- @mcourteaux
- @mgharbi
- Fixes the Pytorch Wrapper Codegen for CPU-only machines. (#6590)
- @OmarEmaraDev
- @rootjalex
- Make bounds of let visitor use unique_name() (#6583)
- Remove incorrect docs on widening_add (#6625)
- Disallow
Type::narrow()
andType::widen()
from producing bitwidths between 1 and 8 bits (#6622) - Wild match object should not be foldable (#6623)
- Clear bounds info on casts when value bounds are undefined for overflow types (#6640)
- @slomp
- decommissioning StackPrinter (#6470)
- @steven-johnson
- [hannk] Fix MeanOp (#6336)
- Add
using OpVisitor::visit;
to various OpVisitors to avoid overload warnings for some compilers (#6337) - [hannk] Add a prepare() method for ops and interp (#6338)
- Fix WASM datalayout for top-of-tree LLVM (#6339)
- Make halide_type_t and halide_type_of constexpr (#6340)
- Harvest IWYU changes for LLVM, WABT (#6341)
- Fix HelloWasm (#6342)
- Fix Makefile for LLVM11 (injection from #5818) (#6343)
- [hannk] requantize() should never skip the operation (#6350)
- [hannk] augment SoftmaxOp to allow specifying axis (#6351)
- Use Node instead of d8 for Wasm AOT testing (#6356)
- [hannk] Add missing call to Interpreter::prepare in benchmark app (#6358)
- [hannk] Allow disabling TFLite+Delegate build in CMake (#6360)
- [hannk] Add support for building/running for wasm (#6361)
- Update Emscripten settings (#6362)
- [hannk] Clean up aliasing (v2) (#6364)
- [hannk] tests should only process .tflite files (#6368)
- Revamp Hannk IR (#6379)
- Fix for top-of-tree LLVM (#6380)
- Remove halide_assert() from halide_default_device_wrap_native (#6381)
- Rename halide_assert -> halide_abort_if_false (#6382)
- Convert various halide_assert -> static_assert (#6383)
- Fix for top-of-tree LLVM (#6386)
- Check results of all runtime function calls (#6389)
- Add halide_debug_assert() macro (#6390)
- [hannk] Have CMake emit .s, .stmt, .ll files (#6392)
- [hannk] Upgrade hannk to use TFLite 2.7.0 by default (#6393)
- Clean up CodeGen_LLVM names to match ASAN nomenclature changes (#6395)
- Drop support for LLVM11 (#6396)
- Move PyTorch test into standalone tests (#6397)
- Remove halide_abort_if_false() usage in runtime/metal (#6398)
- Fix OGLC debug builds (#6399)
- Add defensive checks to halide_buffer_copy_already_locked (#6401)
- _halide_buffer_crop() needs to check for runtime failures (v2) (#6403)
- Fix broken ASAN code (#6408)
- [hannk] Pacify clang-tidy (#6412)
- One more ASAN fix (#6413)
- [hannk] Fix lower_tflite_fullyconnected (#6414)
- Fix Introspection issues (#6424)
- Don't remap the function name or the target in the metadata (#6430)
- Set up SANITIZER_FLAGS and OPTIMIZE for apps/Makefile.inc (#6435)
- Ensure that halide_start_clock() is called before halide_current_time… (#6438)
- Codegen_C: buffer compilation needs to special-case scalar buffers (#6442)
- Add operator<< for Closure (#6443)
- Re-enable performance_async_gpu for D3D12Compute (#6450)
- Tweak Hexagon codegen output (#6461)
- Add LinkageType::ExternalPlusArgv (#6452) (#6463)
- Fix Closure API (#6464)
- Move null check from Printer to halide_string_to_string() (#6467)
- Deal with Printer::scratch (#6469) (#6472)
- Restore support for using V8 as the Wasm JIT interpreter (#6478)
- Fail if no_bounds_query specified for HL_JIT_TARGET (#6489)
- Document the usage of llvm::legacy::PassManager (#6491)
- Update WABT to 1.0.25 (#6497)
- Grab Bag of minor cleanups to LowerParallelTasks (#6498)
- Update simd_op_check for arm64 upz1 code generation (#6499) (#6500)
- Fix size_t -> int conversion warning (#6501)
- Fix simd-op-check for top-of-tree LLVM (#6529)
- Revert "Make random faster by putting the innermost var last" (#6538)
- Fix GeneratorOutput_Buffer::set_estimates() (#6540)
- Revert "Make it possible to interpret a wide type as multiple smaller elements" (#6541)
- Convert apps/hannk/Elementwise to use generate() (#6543)
- Fixes for top-of-tree LLVM (#6546) (#6548)
- Fix deprecation warnings in Python tutorials (#6552)
- Use add_halide_generator() everywhere in apps/ (#6554)
- Fix for top-of-tree LLVM (#6561)
- Enable simd_op_check test for wasm i8x16.popcnt (#6562)
- Revert "Fix for top-of-tree LLVM" (#6564)
- wasm simd cleanup (#6566)
- Add support for wasm-simd ops for integer-integer widening (#6567)
- Add
explicit
to a handful of Generator-related ctors. (#6569) - Fix typo in comment in HalideBuffer.h (#6570)
- Allow calling scheduling methods on Output<Buffer[]> (#6577)
- Fix for top-of-tree LLVM (#6579)
- Fix Win32-specific breakage in top-of-tree LLVM (#6581)
- Convert apps/ to use static Buffer dims where useful (#6585)
- Various fixes to static-dimensioned Buffer (#6589)
- Convert Buffer<> usage in python_bindings/ to use static dimensions (#6591)
- Convert Buffer<> usage in test/generators to use static dimensions (#6592)
- Rename BufferDimsUnconstrained -> AnyDims (#6594)
- Allow building with LLVM15 (#6603)
- Update WasmExecutor for WABT API changes (#6612)
- Minor Generator cleanup (#6613)
- Unbreak WABT again by using main instead of a commit (#6614)
- Update apps/hannk to use TFLite 2.8.0 (#6617)
- Update WABT version to the just-released 1.027 (instead of main) (#6619)
- Clean up python_binding Makefile (#6634)
- Fix const-correctness in C/C++ backend (Issue #6636) (#6638)
- Convert most remaining Generators to prefer statically-dimensioned In… (#6641)
- Allow profiler feature under wasm iff wasm_threads is enabled (#6643)
- Fix UB in hannk FillWithRandom operation. (#6645)
- Update initialization of WABT
store
field to work with top-of-tree (#6649) - Fix apparent typo in PR #6294 (#6653)
- Eliminate some unnecessary clamping in ClampUnsafeAccesses (#6297) (#6654)
- Python Bindings: fix Python
bool
->Expr
implicit conversion (#6657) - Fix 'variable set but not used` warning/error (#6658)
- Allow
make test_apps
to work with ASAN (#6659) - Add optional runtime H::R::Buffer access checks (#6660)
- Add ldscript code for Python extensions in CMake (#6665)
- Remove the nobuild/partialbuildmethod tests from...
Halide 13.0.4
This is a patch release that fixes a single bug relating to multiple outputs that depend on each other (#6375).