[hipblaslt] Refactor Parallel.py to drop joblib, decimate resource usage by LunNova · Pull Request #2073 · ROCm/rocm-libraries

LunNova · 2025-10-12T00:41:34Z

Fixes #2072
Fixes #288
Improves #316. Memory consumption is down by ~1/3rd but still too high.

Motivation

hipblaslt is too resource intensive to build.
This PR drastically improves peak build directory space usage by unlinking assembly and object files sooner.
Peak build dir space usage has been reduced from >240GB to 25GB for an all ISA build

joblib dep seems like overkill, multiprocessing is builtin to python and adequate as long as chunking is set adequately. I recall an earlier discussion on one of the AMD issue trackers indicating you were open to removing joblib, but don't recall where exactly.

Technical Details

swapped joblib to multiprocessing forkpool/spawn depending on platform
- use imap_unordered
- support list and generator results
- updated call sites
taught TensileCreateLibrary to unlink intermediary files after use
aggressively eliminate redundant copies of solution related structures to reduce memory usage and logic loading time

Test Plan

Build hipblaslt for all ISAs as part of nixpkgs rocmPackage set.
Watch tensile build dir space usage.

clear; while true; sudo du -h -d0 (echo /nix/var/nix/builds/*)/build/source/projects/hipblaslt/build/Tensile; sleep 90s; end

Load logic from largest yaml file on each commit in patch series and compare time and peak memory.

Run full ISA TensileCreateLibrary.

Test Result

Builds successfully.

Peak space usage has been decimated to ≈25G just before TensileCreateLibrary finishes and deletes temporary files.

…
25G     /nix/var/nix/builds/nix-1862072-2932725840/build/source/projects/hipblaslt/build/Tensile
…

Large YAML Load Logic Perf

How expensive is it to load aquavanjaram_Cijk_Ailk_Bjlk_SB_Bias_HAS_SAV_UserArgs.yaml which is 80MiB?

Subject	Time	Peak Memory	vs Baseline
[hipblaslt] Use multiprocessing.Pool for TensileCreateLibrary	18.107s	879.7 MB	baseline
[hipblaslt] CustomKernels: lru_cache for 20x speedup of some logic files	18.301s	878.9 MB	0% faster, 0% less memory
[hipblaslt] reduce memory usage during logic load	16.902s	879.9 MB	5% faster, 0% less memory
[hipblaslt] Remove unused key arg for getPrimitiveParameterValueAbbreviation	16.823s	878.6 MB	7.5% faster, 0% less memory
[hipblaslt] intern strings to reduce duplicate memory for solution keys	16.295s	729.1 MB	10% faster, 17.5% less memory
[hipblaslt] tensilelite: teach state_key_ordering slots	16.141s	728.1 MB	10% faster, 17.5% less memory
[hipblaslt] tensilelite: intern FreeIndex, BatchIndex, BoundIndex and SizeMapping	16.422s	709.2 MB	10% faster, 20% less memory
[hipblaslt] tensilelite: remove unused targetObjFilename	16.399s	709.2 MB	10% faster, 20% less memory
[hipblaslt] WIP JANK tensilelite: record code object file index without mutations	15.795s	590.3 MB	12.5% faster, 32.5% less memory

Full ISA TensileCreateLibrary

Time before

Total time (s): 4630.94
Total kernels processed: 196454
Kernels processed per second: 42.42
KernelHelperObjs: 328
# Peak memory ~34GB not logged

Time After

Total time (s): 3418.73
Total kernels processed: 196454
Kernels processed per second: 57.46
KernelHelperObjs: 328
Peak memory usage (MB): 24,266.2
Current memory usage (MB): 21,930.3

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

LunNova · 2025-10-12T01:25:21Z

projects/hipblaslt/tensilelite/Tensile/Toolchain/Assembly.py

+      # Build reference count map for .o files to handle shared object files
+      # FIXME: why are some .o files are shared between multiple .co files?
+      objFileRefCount = collections.Counter()
+      for coFileRaw, objFiles in coFileMap.items():
+        for objFile in objFiles:
+          objFileRefCount[objFile] += 1
+
+      sharedObjFiles = {objFile: count for objFile, count in objFileRefCount.items() if count > 1}
+      if sharedObjFiles:
+        print1(f"Found {len(sharedObjFiles)} .o files shared across multiple code objects:")
+        for objFile, count in sharedObjFiles.items():
+          print1(f"  {Path(objFile).name}: used by {count} code objects")


There are a lot of .o files that are included in multiple .co files. Is that expected?

If there weren't we could simply unlink after calling linker() and drop this collections.Counter.

Haven't checked it but it should be one arch per co, so the o files shouldn't be shared.

Well that's not good 😅 I'm seeing thousands that are shared. Will investigate further.

My mistake, it's per problem type per arch for one co file. Shouldn't be shared either.

These are the reused .o files I'm seeing with this logging

Marking draft to investigate.

I think reusing some .o files is existing behavior:

rocm-libraries/projects/hipblaslt/tensilelite/Tensile/TensileCreateLibrary/Run.py

Lines 358 to 370 in c3d858f

visited = set()

duplicates = 0

splitGSU = False

for k in asmKernels:

base = getKernelFileBase(splitGSU, k)

k["BaseName"] = base

k.duplicate = True if base in visited else False

duplicates += k.duplicate

print2(f"Duplicate: {base}")

visited.add(base)

print1(f"Number of duplicate kernels: {duplicates}")

uniqueAsmKernels = [k for k in asmKernels if not k.duplicate]

The assembly files with .duplicate set to true are duplicates. This logic skips assembling and creating objects for them, and then in later steps they're used multiple times.

So I guess I should remove the verbose logging here and keep the Counter usage to allow unlinking on last use.

No I think I got that wrong.

These duplicates look like they're from cases where the same solution exists in logic files for two different CU counts. There are a few of these for gfx90a for example.

Testing file: aldebaran_Cijk_Ailk_Bjlk_SB_Bias_HA_SAV.yaml Loading from 104CU directory... Loading from 110CU directory... 104CU solutions: 1122 110CU solutions: 1364 Solutions with matching str(): 2 Solutions only in 104CU: 1120 Solutions only in 110CU: 1362 Generating assembly for first common solution Cijk_Ailk_Bjlk_S_B_Bias_HA_S_SAV_UserArgs_MT128x128x16_MI32x32x1_SN_LDSB0_AA0_AFC1_AF1_AAIGTEn1_AAILTEn1_AFEM1_AFEM1_ASEM1_BL1_BS1_CLR1_CADS0_DSK0_DU16_DTL0_DTVA0_DTVB0_DTVSM0_EPS0_FDSI0_GRPM1_GRVWA4_GRVWB4_GSU1_GSUAMB_GSUC0_GSUWGMRR0_GLS0_ISA90a_IU1_IA0_KLA_LDSTI0_LBSPPA0_LBSPPB0_LBSPPM0_LPA0_LPB0_LPM0_LRVW1_LWPMn1_MIAV0_MIWT2_2_MDA2_MI32_32_2_1_MLDS65536_MO40_MPM0_NR0_NTn1_NTA0_NTB0_NTC0_NTD0_NTE0_NTM0_NTWS0_NEPBS2_NLCA1_NLCB1_ONLL1_PGR2_PLR1_PKA1_SGR1_SIA3_SLW1_SS1_SU0_SUM0_SUS64_SPO1_SRVW0_SSO0_SVW1_SK0_SKA0_SKFTR0_SKXCCM0_SNLL0_TT2_64_TLDS0_ULSGRO0_USL1_UCMLS0_UIOFGRO0_USFGROn1_VSn1_VWA1_VWB1_WSGRA0_WSGRB0_WSK0_WS64_WG64_4_1_WGM8_WGMXCC1_WGMXCCGn1_WGR0 104CU CUCount: 104 110CU CUCount: None Solutions have differing _state entries (beyond CUCount): SolutionIndex: 104CU=1122, 110CU=1362 Assembler params: 104CU: gfx=gfx90a, wavefront=64 110CU: gfx=gfx90a, wavefront=64 Assembly identical after filtering out random labels (905432 bytes)

KKyang · 2025-10-12T01:29:38Z

Thanks for the PR. Did you test the build time with this change?

LunNova · 2025-10-14T03:15:48Z

cc @davidd-amd since you've attempted some of this before.

JohnRTitor · 2025-10-16T13:47:46Z

CC @AviralGoelAMD @vidyasagar-amd @spolifroni-amd for review

We need this downstream as without this it takes too much build space on our builders.

davidd-amd · 2025-10-17T13:11:16Z

projects/hipblaslt/tensilelite/Tensile/Common/Parallel.py

+    # 2. Parse MAKEFLAGS for -jN
+    makeflags = os.environ.get('MAKEFLAGS', '')
+    match = re.search(r'-j\s*(\d+)', makeflags)
+    if match:
+        return int(match.group(1))


We aren't gaurunteed to have MAKEFLAGS e.g. what if ninja or another generator is used. Why require that the parallel level is set so you don't have to reach into the environment for this information. We can easily accommodate this in the build system.

My goal here was to inherit from the build system if set, otherwise fall back to the existing logic. That way when I'm building with CMAKE_BUILD_PARALLEL_LEVEL=64 it doesn't try to run with 128t.

davidd-amd · 2025-10-17T13:19:51Z

projects/hipblaslt/tensilelite/Tensile/Common/Utilities.py

+        value = sourceDictionary[key]
+    else:
+        value = defaultDictionary[key]
+
+    if isinstance(value, (list, dict, set)):
+        destinationDictionary[key] = deepcopy(value)
    else:
-        destinationDictionary[key] = deepcopy(defaultDictionary[key])
+        destinationDictionary[key] = value
+
+
+# Keys in defaultSolution that contain list values
+_SOLUTION_LIST_KEYS = frozenset({'WorkGroup', 'ThreadTile', 'MatrixInstruction'})


So the idea here is only deepcopy if an instance exists? That's a nice optimization but we have to be very careful when eliminating deepcopies in this library as we have some very convoluted incidental data structures. I've been surprised in the past when making these changes that seem ok on the surface and then after scrutinizing the build you find missing symbols etc. We will have to do additional testing to be sure because the existing testing isn't sufficient for changes like there.

The commit this was in had a marginal impact on memory so if you think it's risky probably best if I drop that change.

- Drastically improves peak build directory space usage by unlinking assembly files sooner - Improves build time by processing in chunks with lower submission overhead

100s -> 5s for aquavanjaram_Cijk_Ailk_Bljk_F8NH_HHS_BH_Bias_HAS_SAB_SAV_freesize_custom_GSUs

…viation

… SizeMapping

…ut mutations

eiis1000 · 2026-02-19T23:13:58Z

As someone currently sitting on 64GB of memory usage consumed by an infinitude of python3.13 processes while loading logics in an attempt to build hipblaslt-7.1.1, this would be a godsend :)

GZGavinZhao · 2026-02-20T05:52:17Z

I tried rebasing this PR on top of rocm-7.2.0 and patch tells me some of the changes have already been applied. On top of rocm-7.2.0, after the interning patch (GZGavinZhao/rocm-libraries@87ac38f) it gets down to a reasonable memory usage. I forgot how much memory it used, but IIRC it didn't exceed 128GB when running with 32 jobs.

LunNova · 2026-02-23T03:34:19Z

Yeah looks like some parts of this have been upstreamed (or independently rederived, it's not rocket science!) eg f27f340

tensilelite isn't fun to work on and I haven't had time to progress it. It's possible to unlink the .os too with proper tracking of them being used twice, but the reuse to me kinda indicates something's fundamentally a bit weird with solution libraries having overlapping kernels and I am uneasy about it.

github-actions bot added the project: hipblaslt label Oct 12, 2025

LunNova mentioned this pull request Oct 12, 2025

rocmPackages.hipblaslt: massively reduce peak disk space usage NixOS/nixpkgs#451188

Merged

13 tasks

assistant-librarian bot added the external contribution Code contribution from users community.. label Oct 12, 2025

LunNova force-pushed the tcl-refactor-space-saving branch from 98a6820 to 5660637 Compare October 12, 2025 00:54

LunNova mentioned this pull request Oct 12, 2025

[Issue]: hipblaslt requires 240GB of space in build directory for all ISA build #2072

Open

LunNova marked this pull request as ready for review October 12, 2025 01:01

LunNova requested a review from a team as a code owner October 12, 2025 01:01

LunNova commented Oct 12, 2025

View reviewed changes

LunNova marked this pull request as draft October 12, 2025 02:15

[hipblaslt] Use parallelism level from build system if available

9ddfc48

LunNova force-pushed the tcl-refactor-space-saving branch 2 times, most recently from 16caf87 to 44ad192 Compare October 12, 2025 03:40

This comment was marked as outdated.

Sign in to view

LunNova force-pushed the tcl-refactor-space-saving branch from 44ad192 to f4681f4 Compare October 14, 2025 03:02

LunNova changed the title ~~[hipblaslt] Refactor Parallel.py to drop joblib, massively reduce peak disk space usage~~ [hipblaslt] Refactor Parallel.py to drop joblib, decimate resource usage Oct 14, 2025

davidd-amd reviewed Oct 17, 2025

View reviewed changes

LunNova force-pushed the tcl-refactor-space-saving branch from f4681f4 to b0a1242 Compare October 17, 2025 15:03

LunNova added 4 commits October 17, 2025 08:05

[hipblaslt] Add FIXME to Linker clang_args.txt for parallelism safety

51e14c8

[hipblaslt] TensileCreateLibrary: log memory usage stats on completion

5b672fa

[hipblaslt] Use multiprocessing.Pool for TensileCreateLibrary

c3a8157

- Drastically improves peak build directory space usage by unlinking assembly files sooner - Improves build time by processing in chunks with lower submission overhead

[hipblaslt] CustomKernels: lru_cache for 20x speedup of some logic files

1388575

100s -> 5s for aquavanjaram_Cijk_Ailk_Bljk_F8NH_HHS_BH_Bias_HAS_SAB_SAV_freesize_custom_GSUs

LunNova added 7 commits October 17, 2025 08:05

[hipblaslt] reduce memory usage during logic load

bb078c6

[hipblaslt] Remove unused key arg for getPrimitiveParameterValueAbbre…

0c6c156

…viation

[hipblaslt] intern strings to reduce duplicate memory for solution keys

2df9240

[hipblaslt] tensilelite: teach state_key_ordering __slots__

0785037

[hipblaslt] tensilelite: intern FreeIndex, BatchIndex, BoundIndex and…

29928aa

… SizeMapping

[hipblaslt] tensilelite: remove unused targetObjFilename

3de9efd

[hipblaslt] WIP JANK tensilelite: record code object file index witho…

c794f13

…ut mutations

LunNova force-pushed the tcl-refactor-space-saving branch from b0a1242 to c794f13 Compare October 17, 2025 15:06

eiis1000 mentioned this pull request Feb 20, 2026

rocmPackages: 7.0.2 -> 7.1.1 NixOS/nixpkgs#481349

Merged

13 tasks

	visited = set()
	duplicates = 0
	splitGSU = False
	for k in asmKernels:
	base = getKernelFileBase(splitGSU, k)
	k["BaseName"] = base
	k.duplicate = True if base in visited else False
	duplicates += k.duplicate
	print2(f"Duplicate: {base}")
	visited.add(base)
	print1(f"Number of duplicate kernels: {duplicates}")

	uniqueAsmKernels = [k for k in asmKernels if not k.duplicate]

Conversation

LunNova commented Oct 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Technical Details

Test Plan

Test Result

Large YAML Load Logic Perf

Full ISA TensileCreateLibrary

Submission Checklist

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

KKyang commented Oct 12, 2025

Uh oh!

This comment was marked as outdated.

LunNova commented Oct 14, 2025

Uh oh!

JohnRTitor commented Oct 16, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eiis1000 commented Feb 19, 2026

Uh oh!

GZGavinZhao commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LunNova commented Feb 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

LunNova commented Oct 12, 2025 •

edited

Loading

GZGavinZhao commented Feb 20, 2026 •

edited

Loading