Skip to content

[origami] Major refactoring of codebase#2718

Merged
bnemanich merged 70 commits into
developfrom
users/nelumala/origami/origami-refactor
Dec 9, 2025
Merged

[origami] Major refactoring of codebase#2718
bnemanich merged 70 commits into
developfrom
users/nelumala/origami/origami-refactor

Conversation

@NaveenElumalaiAMD
Copy link
Copy Markdown
Contributor

@NaveenElumalaiAMD NaveenElumalaiAMD commented Nov 17, 2025

Technical Details

Rebase @neoblizz (Muhammad) changes (#1859) with latest develop

Refactor

  • New file types.hpp, consolidates various origami types.
  • New structs to replace the growing tuple.
  • API updates to make it scalable, down from 20+ parameters to a few.
  • Lots of redundant code removal.
  • Reorganize into types.hpp; data types (see hardware.hpp for what needs to be moved)
  • Refactor extract APIs
  • Add enums for transpose
  • Refactor debug/log reporting
  • Remove mutable and statics
  • Decouple latency out of config_t
  • Rebase develop into this branch

Python APIs & Unit Tests

  • Update the tests
  • Update rocroller
  • hipblaslt/tensilelite/scripts to use the new API

Testing Infrastructure

  • Replace YAML-based tests with Catch2 C++ tests
  • Replace GTest with Catch2 framework
  • Convert YAML-driven parameterized tests to pure C++ tests
  • Update common.hpp with direct C++ helper functions
  • Update CMakeLists.txt to use Catch2 instead of GTest
  • Remove Boost dependency (was only for YAML)
  • All test data now hardcoded in C++ for type safety
  • Better IDE support, debugging, and error messages

Questions

  • Rocroller: coordinate on the API design, should they be reused?
  • Reuse dim3_t from hip_runtime.h (does HIP's dim3 have the same functionality)?
  • Should transpose/data-types be part of config as well for precomputing?

Motivation

  1. Make it simpler to make model updates.
  2. Well-defined, scalable, clean interface.

@math-ci
Copy link
Copy Markdown

math-ci Bot commented Dec 4, 2025

perfci run on commit 5ac1a36

math-ci run

Comment thread shared/origami/include/origami/types.hpp Outdated
Comment thread shared/origami/include/origami/log.hpp Outdated
Comment thread shared/origami/include/origami/streamk.hpp
Comment thread shared/origami/include/origami/streamk.hpp Outdated
Comment thread shared/origami/python/origami_module.cpp
Comment thread shared/origami/src/origami/types.cpp Outdated
Comment thread projects/hipblaslt/tensilelite/src/ContractionSolution.cpp
Comment thread shared/origami/tests/include/common.hpp
Comment thread shared/origami/include/origami/types.hpp
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 40 out of 40 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread shared/origami/tests/CMakeLists.txt
Comment thread shared/origami/CMakeLists.txt
@math-ci
Copy link
Copy Markdown

math-ci Bot commented Dec 5, 2025

perfci run on commit ba05c37

math-ci run

@math-ci
Copy link
Copy Markdown

math-ci Bot commented Dec 9, 2025

perfci run on commit 2d68ed0

math-ci run

@stellaraccident
Copy link
Copy Markdown
Contributor

This patch needs to be reverted: ROCm/TheRock#2522 (comment)

Crashes the entire product and was clearly not tested. It cost devops a full day and a half to bisect and stalled everyone in the project for a week.

The level of detail around a commit of this magnitude is completely unacceptable and I will be calling a post mortem.

@jayhawk-commits
Copy link
Copy Markdown
Collaborator

jayhawk-commits commented Dec 16, 2025

It was determined that this commit introduces memory corruption issues with TheRock python packaging build. See ROCm/TheRock#2522 and need this commit reverted to move the rocm-libraries submodule pointer in TheRock forward.

Error

corrupted size vs. prev_size in fastbins

Affected Versions

  • python 3.11, 3.12, 3.13
  • pytorch 2.7, 2.8, 2.9, main branches
  • TheRock with rocm-libraries submodule pointed to this commit or newer

Ways to Reproduce

  1. Build ROCm via TheRock.
  2. Build pytorch wheels via TheRock.
  3. Run rocm-sdk test.
  4. Run python ./external-builds/pytorch/run_pytorch_smoke_tests.py.

Error happens during steps 2-4. I have not seen an instance of a workflow getting past step 4.

Test Results A

Test Results B

Test Results A show corrupted size vs. prev_size in fastbins throughout the different python 3.11-3.13 runs.
Test Results B do not show the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.