Releases: NVIDIA/MatX
Releases · NVIDIA/MatX
v0.8.0
Release highlights:
- Features
- Updated cuTENSOR and cuTensorNet versions
- Added configurable print formatting
- ARM FFT support via NVPL
- New operators: abs2(), outer(), isnan(), isinf()
- Many more unit tests for CPU tests
- Bug fixes for matmul on Hopper, 2D FFTs, and more
Full changelist:
What's Changed
- Increase cublas workspace to 32 MiB for Hopper+ by @tbensonatl in #545
- matmul bug fixes. by @luitjens in #547
- Added missing synchronization by @luitjens in #552
- Refine some file I/O functions' doxygen comments by @AtomicVar in #549
- Update docs by @tmartin-gh in #551
- Export used environment variables in sphinx config by @tmartin-gh in #553
- Import os by @tmartin-gh in #554
- Add version info by @tmartin-gh in #555
- Fix typo by @tmartin-gh in #556
- Adds IsNan and IsInf Operators by @nvjonwong in #557
- Use cmake project version info in sphinx config by @tmartin-gh in #560
- outer() operator for outer product by @cliffburdick in #559
- Fix nans in QR and SVD. by @luitjens in #558
- Update CMakeLists.txt by @cliffburdick in #548
- Fix CMake to allow multiple rapids-cmake to coexist by @cliffburdick in #562
- Return 0D arrays for 0D shape in operators by @cliffburdick in #561
- Fix NVTX3 include path by @AtomicVar in #564
- Add .npy File I/O by @AtomicVar in #565
- SVD & QR improvements by @luitjens in #563
- chore: Fix typo s/whereever/wherever/ by @hugo-syn in #566
- Add rapids-cmake-dir, if defined, to CMAKE_MODULE_PATH by @tbensonatl in #567
- Add abs2() operator for squared abs() by @tbensonatl in #568
- Fixed issue on g++13 with nullptr dereference that cannot happen at r… by @cliffburdick in #571
- Force max(min) size of direct convolution dimension to be < 1024 by @cliffburdick in #573
- Remove incorrect warning check for any compiler other than gcc by @cliffburdick in #577
- stream memory cleanup by @cliffburdick in #579
- Update reshape indices by @cliffburdick in #580
- Update matlabpython.rst by @cliffburdick in #583
- Prevent potential oob read in matxOpTDKernel by @tbensonatl in #586
- Broadcast lower-rank tensors during batched matmul by @tbensonatl in #585
- Fix bugs in 2D FFTs and add tests by @benbarsdell in #587
- Added ARM FFT Support by @cliffburdick in #576
- Various bug fixes for older compilers by @cliffburdick in #588
- Renamed rmin/rmax functions to min/max and element-wise are now minimum/maximum to match Python by @cliffburdick in #589
- Fix clang macro by @cliffburdick in #592
- Fix misplaced sentence in README by @lucifer1004 in #594
- Add configurable print formatting types by @tmartin-gh in #593
- Fixing return types to allow either prvalue or lvalue in operator() by @cliffburdick in #598
- Rework einsum for new cache style. Fix for issue #597 by @tmartin-gh in #599
- Updated cutensornet to 24.03 and cutensor to 2.0.1 by @cliffburdick in #600
- adding file name and line number to ease debug by @bhaskarrakshit in #601
- Updating versions and notes for v0.8.0 by @cliffburdick in #602
New Contributors
- @hugo-syn made their first contribution in #566
- @benbarsdell made their first contribution in #587
- @lucifer1004 made their first contribution in #594
- @bhaskarrakshit made their first contribution in #601
Full Changelog: v0.7.0...v0.8.0
v0.7.0
Features
- Convert libcudacxx to CCCL by @cliffburdick in #501
- Add PreRun and tests for at/clone/diag operators by @tbensonatl in #502
- Add explicit FFT length to fft_conv example by @tbensonatl in #503
- Add Pre/PostRun support for collapse, concat ops by @tbensonatl in #506
- polyval operator by @cliffburdick in #508
- Optimize resample poly kernels by @tbensonatl in #512
- Allow negative indexing on slices by @cliffburdick in #516
- Automatically publish docs to GH Pages on merge to main by @tmartin-gh in #520
- Add configurable precision support of
print()
. by @AtomicVar in #521 - Make matxHalf trivially copyable by @tbensonatl in #513
- Added operator for matvec by @cliffburdick in #514
- New rapids and nvbench by @cliffburdick in #529
Fixes
- Add FFT1D tensor size checks by @tbensonatl in #499
- Fix errors which caused some unit tests failed to compile. by @AtomicVar in #504
- Fix upsample output size by @cliffburdick in #507
- removing print characters accidently left behind by @tylera-nvidia in #510
- Renamed host executor and prepared for multi-threaded additions by @cliffburdick in #511
- removing old hardcoded limit for repmat rank size by @tylera-nvidia in #515
- Avoid async alloc in some Cholesky decomp cases by @tbensonatl in #517
- Workaround for maybe_unused parse bug in old gcc by @tbensonatl in #522
- Fix matvec output dims to match A rather than B by @tbensonatl in #523
- Remove CUDA system include by @cliffburdick in #525
- Zero-initialize batches field in CUB params by @tbensonatl in #527
- Fixing host include guard on resample poly by @cliffburdick in #528
- Update device.h for host compiler by @cliffburdick in #530
- Made allocator an inline function by @cliffburdick in #532
- Build and publish documentation on merge to main by @tmartin-gh in #533
- Remove doxygen parameter to match tensor_t constructor signature by @tmartin-gh in #534
- Update iterator.h by @cliffburdick in #536
- Update Bug Report Issue Template by @AtomicVar in #539
- Fix CCCL libcudacxx path by @cliffburdick in #537
- Check matmul types and error at compile-time if the backend doesn't support them by @cliffburdick in #540
- Fix batched cov transform by @tbensonatl in #541
- Update caching for transforms to fixing all leaks reported by compute-sanitizer by @cliffburdick in #542
- Update docs for v0.7.0 by @cliffburdick in #544
Full Changelog: v0.6.0...v0.7.0
v0.6.0
Notable Updates
- Transforms as operators by @cliffburdick in #452
- resample_poly optimizations and operator support by @tbensonatl in #465
Full changelog below:
What's Changed
- Added upsample and downsample operators by @cliffburdick in #442
- Added lvalue semantics to operators that needed it by @cliffburdick in #443
- Added operator support to solver functions by @cliffburdick in #444
- Added shapeless version of diag() and eye() by @cliffburdick in #445
- Deprecated random interface by @cliffburdick in #446
- Updated cuTENSOR/cuTensorNet and added example for trace by @cliffburdick in #447
- Fixing host compilation where device code snuck in by @cliffburdick in #453
- Added Protections for Shift Operator inputs and fixed issues with size/Shape returns for certain input sizes by @tylera-nvidia in #454
- Added isclose and allclose functions by @cliffburdick in #448
- Adds normalization options for
fft
andifft
by @nvjonwong in #456 - Updated 0D tensor syntax and expanded simple radar pipeline by @cliffburdick in #458
- Add initial polyphase channelizer operator by @tbensonatl in #459
- Fixed inverse from stomping on input by @cliffburdick in #461
- Fix cache issue with strides by @cliffburdick in #460
- Added const to Pre/PostRun by @cliffburdick in #462
- Revert inv by @cliffburdick in #463
- Added proper LHS handling for transforms by @cliffburdick in #464
- Updated incorrect license by @cliffburdick in #466
- Use device mem instead of managed for fft workbuf by @tbensonatl in #467
- Added at() and percentile() operators by @cliffburdick in #471
- Add overlap operator by @cliffburdick in #472
- Support stride 0 A/B batches for GEMMs by @cliffburdick in #473
- Added FFT-based convolution to conv1d() by @cliffburdick in #475
- Documentation cleanup by @tmartin-gh in #477
- Adding FFT convolution benchmarks by @cliffburdick in #476
- Fixed rank of output in matmul operator when A/B had 0 stride by @cliffburdick in #478
- Updating header image by @cliffburdick in #480
- Add pwelch operator by @tmartin-gh in #479
- Docs cleanup. Enforce warning-as-error for doxygen and sphinx. by @tmartin-gh in #481
- Fixes for CUDA 12.3 compiler by @cliffburdick in #483
- Update pwelch.h by @cliffburdick in #486
- Fixes for new compiler issues by @cliffburdick in #488
- Fixing sample Cmake Project by @tylera-nvidia in #489
- Update base_operator.h by @cliffburdick in #490
- Add window operator input to pwelch by @tmartin-gh in #491
- Add PreRun methods for slice/fftshift operators by @tbensonatl in #493
- PreRun support for r2c and other fft related fixes by @tbensonatl in #494
New Contributors
- @tmartin-gh made their first contribution in #477
Full Changelog: v0.5.0...v0.6.0
v0.5.0
Notable Updates
- Documentation rewritten to include working examples for every function based on unit tests
- Polyphase resampler based on SciPy/cuSignal's
resample_poly
Full changelog below:
What's Changed
- Modifies TensorViewToNumpy and NumpyToTensorView for rank = 5 by @nvjonwong in #427
- NumpyToTensorView overload which returns new TensorView by @nvjonwong in #428
- Added fftfreq() generator by @cliffburdick in #430
- Latest NumpyToTensorView function requires complex conversion for complex types by @nvjonwong in #431
- Fixed print function to work on device in certain cases by @cliffburdick in #436
- Fixed unused variable warning by @cliffburdick in #435
- Adding initial polyphase resampler transform by @tbensonatl in #437
- Revamped documentation by @cliffburdick in #438
- Fixing typo in Cholesky docs by @cliffburdick in #439
- Added broadcasting documentation by @cliffburdick in #440
- Broadcast docs by @cliffburdick in #441
New Contributors
- @nvjonwong made their first contribution in #427
Full Changelog: v0.4.1...v0.5.0
v0.4.1
This is a minor release mostly focused on bug fixes for different compilers and CUDA versions. One major feature added was all reductions are supported on the host using a single threaded executor. Multi-threaded executor support coming soon.
What's Changed
- Host reductions by @cliffburdick in #385
- Reduced cuBLASLt workspace size by @cliffburdick in #404
- Fix benchmarks that broke with new executors by @cliffburdick in #405
- All operator tests converted to use host and device, and improved 16b by @cliffburdick in #403
- Add single argument copy() and copy() tests by @tbensonatl in #407
- Add rank0 tensor remap support by @tbensonatl in #408
- Add Mutex to support multithread NVTX markers by @tylera-nvidia in #406
- Fix a few issues highlighted by linters/clang by @tbensonatl in #409
- Fixed compilation for Pascal by @cliffburdick in #412
- Fixed issue with constructor when passing strides and sizes by @cliffburdick in #413
- CMake fixes found by user by @cliffburdick in #416
- Update libcudacxx to 2.1.0 by @cliffburdick in #417
- Fixed cupy check for unit tests, default constructors, and file IO by @cliffburdick in #419
- Added delta degrees of freedom on var() to mimic Python by @cliffburdick in #421
- Adding correct license on files that were wrong by @cliffburdick in #423
- Fixed two issues with release mode and DLPack and reductions on the host by @cliffburdick in #424
Full Changelog: v0.4.0...v0.4.1
v0.4.0
New Features
- slice optimization to use builtin tensor function when possible by @luitjens in #360
- Slice support for std::array shapes by @luitjens in #363
- svd power iteration example, benchmark and unit tests. by @luitjens in #366
- matmul: support real/complex tensors by @kshitij12345 in #362
- Adding sign/index operators: by @luitjens in #369
- optimized cast and conj op to return a tensor view when possible. by @luitjens in #371
- implement QR for small batched matrices. by @luitjens in #373
- Implement block power iteration (qr iterations) for svd by @luitjens in #375
- Added output iterator support for CUB sums, and converted all sum() by @cliffburdick in #380
- Removing inheritance from std::iterator by @cliffburdick in #381
- DLPack support by @cliffburdick in #392
- Adding ref-count for DLPack by @cliffburdick in #394
- updating cub optimization selection for >= 2.0 by @tylera-nvidia in #395
- Refactored make_tensor to allow lvalue init by @cliffburdick in #397
- Updated notebook documentation and refactored some code by @cliffburdick in #398
- Allow 0-stride dimensions for cublas input/output by @tbensonatl in #400
- 16-bit float reductions + updated softmax by @cliffburdick in #399
Bug Fixes
- Fix Duplicate Print and remove member prints by @tylera-nvidia in #364
- cublasLT col major detection fix. by @luitjens in #368
- Fixes for 32b mode by @cliffburdick in #388
- Fixed a bogus maybe-unitialized warning/error in release mode by @cliffburdick in #389
- Fixed issue with using const pointers by @cliffburdick in #393
- Generator Printing Patch by @tylera-nvidia in #370
New Contributors
- @kshitij12345 made their first contribution in #362
- @tbensonatl made their first contribution in #400
Full Changelog: v0.3.0...v0.4.0
v0.3.0
v0.3.0 marks a major release with over 100 features and bug fixes. Release cadence will occur more frequently after this release to support users not living at the HEAD.
What's Changed
- Added squeeze operator by @cliffburdick in #163
- Change name of squeeze to flatten by @cliffburdick in #164
- Updated version of cuTENSOR and fixed paths by @cliffburdick in #166
- Added reduction example with einsum by @cliffburdick in #168
- Fixed bug with wrong type on argmin/max by @cliffburdick in #170
- Fixed missing return on operator() for sum by @cliffburdick in #171
- Fixed error with reduction with invalid indices. Only shows up on Jetson by @cliffburdick in #172
- Fixed bug with matmul use-after-free by @cliffburdick in #173
- Added test for batches GEMMs by @cliffburdick in #174
- Throw an exception if using SetVals on non-managed pointer by @cliffburdick in #176
- Added missing assert in release mode by @cliffburdick in #178
- Fixed einsum in release mode by @cliffburdick in #179
- Updates to docs by @cliffburdick in #180
- Added unit test for transpose and fixed bug with grid size by @cliffburdick in #181
- Fix grid dimensions for transpose. by @galv in #182
- Added missing include by @cliffburdick in #184
- Remove CUB from sum reduction while bug is being investigated by @cliffburdick in #186
- Fix for cub reductions by @luitjens in #187
- Reenable CUB tests by @cliffburdick in #188
- Fixing incorrect parameter to CUB sort for 2D tensors by @cliffburdick in #190
- Remove 4D restriction on Clone by @cliffburdick in #191
- Added support for N-D convolutions by @cliffburdick in #189
- Download RAPIDS.cmake only if it does not exist. by @cwharris in #192
- Fix 11.4 compilation issues by @cliffburdick in #195
- Improve FFT batching by @cliffburdick in #196
- Fixed argmax initialization value by @cliffburdick in #198
- Fix issue #199 by @pkestene in #200
- Fix type on concatenate by @cliffburdick in #201
- Fix documentation type-o by @dagardner-nv in #202
- Missing host annotation on some generators by @cliffburdick in #203
- Fixed TotalSize on cub operators by @cliffburdick in #204
- Implementing remap operator. by @luitjens in #205
- Update reverse/shift APIs by @luitjens in #207
- batching conv1d across filters. by @luitjens in #208
- Added Print for operators by @cliffburdick in #211
- Complex div by @cliffburdick in #213
- Added lcollapse and rcollapse operator by @luitjens in #212
- Baseops by @luitjens in #214
- Only allow View() on contigious tensors. by @luitjens in #215
- Remove caching on some CUB types temporarily by @cliffburdick in #216
- Fixed convolution mode SAME and added unit tests by @cliffburdick in #217
- Added convolution VALID support by @cliffburdick in #218
- Allow operators on cumsum by @cliffburdick in #219
- Using async allocation in median() by @cliffburdick in #220
- Various CUB fixes -- got rid of offset pointers (async allocation + copy), allowed operators on more types, and fixed caching on sort by @cliffburdick in #222
- Fixed memory leak on CUB cache bypass by @cliffburdick in #223
- Update to pipe type through for scalars on set operation by @tylera-nvidia in #225
- Added complex version of mean and variance by @cliffburdick in #227
- Fixed FFT batching for non-contiguous tensors by @cliffburdick in #228
- Added fmod operator by @cliffburdick in #230
- Fmod by @cliffburdick in #231
- Changing name to fmod by @cliffburdick in #232
- Cloneop by @luitjens in #233
- Making the shift parameter in shift an operator by @luitjens in #234
- Change sign of shift to match python/matlab. by @luitjens in #235
- Changing output operator type to by value to allow temporary operators to be used as an output type. by @luitjens in #236
- Adding slice() operator. by @luitjens in #237
- Fix cuTensorNet workspace size by @leofang in #241
- adding permute operator by @luitjens in #239
- Cleaning up operators/transforms. by @luitjens in #243
- Rapids cmake no fetch by @cliffburdick in #245
- Cleanup of include directory by @luitjens in #246
- Fixed conv SAME mode by @cliffburdick in #248
- Use singleton on GIL interpreter by @cliffburdick in #249
- make owning a runtime parameter by @luitjens in #247
- Fixed bug with batched 1D convoultion size by @cliffburdick in #250
- Adding 2d convolution tests by @luitjens in #251
- Properly initialize pybind object by @cliffburdick in #252
- Fixed sum() using wrong iterator type by @cliffburdick in #253
- g++11 fixes by @cliffburdick in #254
- Fixed size on conv and added benchmarks by @cliffburdick in #256
- Adding unit tests for collapse with remap by @luitjens in #255
- Collapse tests by @luitjens in #257
- adding madd function to improve convolution throughput by @luitjens in #258
- Conv opt by @luitjens in #259
- Fixed compiler errors in release mode by @cliffburdick in #261
- Add streaming make_tensor APIs. by @luitjens in #262
- adding random benchmark by @luitjens in #264
- remove depricated APIs in make_tensor by @luitjens in #266
- Host unit tests by @luitjens in #267
- Fixed bug with FFT size shorter than length of tensor by @cliffburdick in #270
- removing unused pybind call made before pybind initialize by @tylera-nvidia in #271
- Fixed visualization tests by @cliffburdick in #275
- Fix cmake function check_python_libs. by @pkestene in #274
- Support CubSortSegmented by @tylera-nvidia in #272
- Executor cleanup. by @luitjens in #277
- Transpose operators changes by @luitjens in #278
- Remove Deprecated Shape and add metadata to Print by @tylera-nvidia in #280
- Update Documentation by @tylera-nvidia in #282
- NVTX Macros by @tylera-nvidia in #276
- Adding throw to file reading by @tylera-nvidia in #281
- Adding str() function to generators and operators by @luitjens in #283
- Added reshape op by @luitjens in #287
- 0D tensor printing was broken since they don't have a stride by @cliffburdick in #289
- Allow hermitian to take any rank by @cliffburdick in #292
- Hermitian nd by @cliffburdick in #293
- Fixed batched inverse by @cliffburdick in #294
- Added 4D matmul unit test and fixed batching bug by @cliffburdick in #297
- Fixing batched half precision complex GEMM by @cliffburdick in #298
- Rename simple_pipeline to simple_radar_pipeline for added clarity by @awthomp in #299
- Remove cuda::std::min/max by @cliffburdick in #301
- Fixed chained concatenations by @cliffburdick ...
Minor fix on name collision
v0.2.5 Changed MAX name to not collide with other libraries (#162)
Minor fix
Fixed argmin initialization issue that gave wrong results sometimes
v0.2.3
- Improved error messages
- Added support for
einsum
function. Includes tensor contractions, GEMMs with transposed outputs, dot products, and trace - Integrated cuTENSOR library
- Added real/imag/r2c operators
- Added
chirp
function - Added file readers for .mat files
- Fixes to conv2, fft2
- Switched to CUB for certain reductions. Results in a 4x speedup in some cases
- Added
find()
andfind_idx()
functions - Added
unique()
function - Many CMake fixes to clean up transitive target
- Added casting operators
- Added negate operator