Skip to content

[hipblaslt] Refactor Parallel.py to drop joblib, decimate resource usage#2073

Draft
LunNova wants to merge 12 commits intoROCm:developfrom
LunNova:tcl-refactor-space-saving
Draft

[hipblaslt] Refactor Parallel.py to drop joblib, decimate resource usage#2073
LunNova wants to merge 12 commits intoROCm:developfrom
LunNova:tcl-refactor-space-saving

Conversation

@LunNova
Copy link

@LunNova LunNova commented Oct 12, 2025

Fixes #2072
Fixes #288
Improves #316. Memory consumption is down by ~1/3rd but still too high.

Motivation

hipblaslt is too resource intensive to build.
This PR drastically improves peak build directory space usage by unlinking assembly and object files sooner.
Peak build dir space usage has been reduced from >240GB to 25GB for an all ISA build

joblib dep seems like overkill, multiprocessing is builtin to python and adequate as long as chunking is set adequately. I recall an earlier discussion on one of the AMD issue trackers indicating you were open to removing joblib, but don't recall where exactly.

Technical Details

  • swapped joblib to multiprocessing forkpool/spawn depending on platform
    • use imap_unordered
    • support list and generator results
    • updated call sites
  • taught TensileCreateLibrary to unlink intermediary files after use
  • aggressively eliminate redundant copies of solution related structures to reduce memory usage and logic loading time

Test Plan

Build hipblaslt for all ISAs as part of nixpkgs rocmPackage set.
Watch tensile build dir space usage.

clear; while true; sudo du -h -d0 (echo /nix/var/nix/builds/*)/build/source/projects/hipblaslt/build/Tensile; sleep 90s; end

Load logic from largest yaml file on each commit in patch series and compare time and peak memory.

Run full ISA TensileCreateLibrary.

Test Result

Builds successfully.

Peak space usage has been decimated to ≈25G just before TensileCreateLibrary finishes and deletes temporary files.

…
25G     /nix/var/nix/builds/nix-1862072-2932725840/build/source/projects/hipblaslt/build/Tensile
…

Large YAML Load Logic Perf

How expensive is it to load aquavanjaram_Cijk_Ailk_Bjlk_SB_Bias_HAS_SAV_UserArgs.yaml which is 80MiB?

Subject Time Peak Memory vs Baseline
[hipblaslt] Use multiprocessing.Pool for TensileCreateLibrary 18.107s 879.7 MB baseline
[hipblaslt] CustomKernels: lru_cache for 20x speedup of some logic files 18.301s 878.9 MB 0% faster, 0% less memory
[hipblaslt] reduce memory usage during logic load 16.902s 879.9 MB 5% faster, 0% less memory
[hipblaslt] Remove unused key arg for getPrimitiveParameterValueAbbreviation 16.823s 878.6 MB 7.5% faster, 0% less memory
[hipblaslt] intern strings to reduce duplicate memory for solution keys 16.295s 729.1 MB 10% faster, 17.5% less memory
[hipblaslt] tensilelite: teach state_key_ordering slots 16.141s 728.1 MB 10% faster, 17.5% less memory
[hipblaslt] tensilelite: intern FreeIndex, BatchIndex, BoundIndex and SizeMapping 16.422s 709.2 MB 10% faster, 20% less memory
[hipblaslt] tensilelite: remove unused targetObjFilename 16.399s 709.2 MB 10% faster, 20% less memory
[hipblaslt] WIP JANK tensilelite: record code object file index without mutations 15.795s 590.3 MB 12.5% faster, 32.5% less memory

Full ISA TensileCreateLibrary

Time before

Total time (s): 4630.94
Total kernels processed: 196454
Kernels processed per second: 42.42
KernelHelperObjs: 328
# Peak memory ~34GB not logged

Time After

Total time (s): 3418.73
Total kernels processed: 196454
Kernels processed per second: 57.46
KernelHelperObjs: 328
Peak memory usage (MB): 24,266.2
Current memory usage (MB): 21,930.3

Submission Checklist

Comment on lines +95 to +106
# Build reference count map for .o files to handle shared object files
# FIXME: why are some .o files are shared between multiple .co files?
objFileRefCount = collections.Counter()
for coFileRaw, objFiles in coFileMap.items():
for objFile in objFiles:
objFileRefCount[objFile] += 1

sharedObjFiles = {objFile: count for objFile, count in objFileRefCount.items() if count > 1}
if sharedObjFiles:
print1(f"Found {len(sharedObjFiles)} .o files shared across multiple code objects:")
for objFile, count in sharedObjFiles.items():
print1(f" {Path(objFile).name}: used by {count} code objects")
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a lot of .o files that are included in multiple .co files. Is that expected?

If there weren't we could simply unlink after calling linker() and drop this collections.Counter.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't checked it but it should be one arch per co, so the o files shouldn't be shared.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well that's not good 😅 I'm seeing thousands that are shared. Will investigate further.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My mistake, it's per problem type per arch for one co file. Shouldn't be shared either.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Marking draft to investigate.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think reusing some .o files is existing behavior:

visited = set()
duplicates = 0
splitGSU = False
for k in asmKernels:
base = getKernelFileBase(splitGSU, k)
k["BaseName"] = base
k.duplicate = True if base in visited else False
duplicates += k.duplicate
print2(f"Duplicate: {base}")
visited.add(base)
print1(f"Number of duplicate kernels: {duplicates}")
uniqueAsmKernels = [k for k in asmKernels if not k.duplicate]

The assembly files with .duplicate set to true are duplicates. This logic skips assembling and creating objects for them, and then in later steps they're used multiple times.

So I guess I should remove the verbose logging here and keep the Counter usage to allow unlinking on last use.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No I think I got that wrong.

These duplicates look like they're from cases where the same solution exists in logic files for two different CU counts. There are a few of these for gfx90a for example.

Testing file: aldebaran_Cijk_Ailk_Bjlk_SB_Bias_HA_SAV.yaml
Loading from 104CU directory...
Loading from 110CU directory...

104CU solutions: 1122
110CU solutions: 1364
Solutions with matching str(): 2
Solutions only in 104CU: 1120
Solutions only in 110CU: 1362

Generating assembly for first common solution Cijk_Ailk_Bjlk_S_B_Bias_HA_S_SAV_UserArgs_MT128x128x16_MI32x32x1_SN_LDSB0_AA0_AFC1_AF1_AAIGTEn1_AAILTEn1_AFEM1_AFEM1_ASEM1_BL1_BS1_CLR1_CADS0_DSK0_DU16_DTL0_DTVA0_DTVB0_DTVSM0_EPS0_FDSI0_GRPM1_GRVWA4_GRVWB4_GSU1_GSUAMB_GSUC0_GSUWGMRR0_GLS0_ISA90a_IU1_IA0_KLA_LDSTI0_LBSPPA0_LBSPPB0_LBSPPM0_LPA0_LPB0_LPM0_LRVW1_LWPMn1_MIAV0_MIWT2_2_MDA2_MI32_32_2_1_MLDS65536_MO40_MPM0_NR0_NTn1_NTA0_NTB0_NTC0_NTD0_NTE0_NTM0_NTWS0_NEPBS2_NLCA1_NLCB1_ONLL1_PGR2_PLR1_PKA1_SGR1_SIA3_SLW1_SS1_SU0_SUM0_SUS64_SPO1_SRVW0_SSO0_SVW1_SK0_SKA0_SKFTR0_SKXCCM0_SNLL0_TT2_64_TLDS0_ULSGRO0_USL1_UCMLS0_UIOFGRO0_USFGROn1_VSn1_VWA1_VWB1_WSGRA0_WSGRB0_WSK0_WS64_WG64_4_1_WGM8_WGMXCC1_WGMXCCGn1_WGR0
  104CU CUCount: 104
  110CU CUCount: None
  Solutions have differing _state entries (beyond CUCount):
     SolutionIndex: 104CU=1122, 110CU=1362
  Assembler params:
     104CU: gfx=gfx90a, wavefront=64
     110CU: gfx=gfx90a, wavefront=64
  Assembly identical after filtering out random labels (905432 bytes)

@KKyang
Copy link
Contributor

KKyang commented Oct 12, 2025

Thanks for the PR. Did you test the build time with this change?

@LunNova LunNova marked this pull request as draft October 12, 2025 02:15
@LunNova LunNova force-pushed the tcl-refactor-space-saving branch 2 times, most recently from 16caf87 to 44ad192 Compare October 12, 2025 03:40
@LunNova

This comment was marked as outdated.

@LunNova LunNova force-pushed the tcl-refactor-space-saving branch from 44ad192 to f4681f4 Compare October 14, 2025 03:02
@LunNova
Copy link
Author

LunNova commented Oct 14, 2025

cc @davidd-amd since you've attempted some of this before.

@LunNova LunNova changed the title [hipblaslt] Refactor Parallel.py to drop joblib, massively reduce peak disk space usage [hipblaslt] Refactor Parallel.py to drop joblib, decimate resource usage Oct 14, 2025
@JohnRTitor
Copy link

CC @AviralGoelAMD @vidyasagar-amd @spolifroni-amd for review

We need this downstream as without this it takes too much build space on our builders.

Comment on lines +44 to +48
# 2. Parse MAKEFLAGS for -jN
makeflags = os.environ.get('MAKEFLAGS', '')
match = re.search(r'-j\s*(\d+)', makeflags)
if match:
return int(match.group(1))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We aren't gaurunteed to have MAKEFLAGS e.g. what if ninja or another generator is used. Why require that the parallel level is set so you don't have to reach into the environment for this information. We can easily accommodate this in the build system.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My goal here was to inherit from the build system if set, otherwise fall back to the existing logic. That way when I'm building with CMAKE_BUILD_PARALLEL_LEVEL=64 it doesn't try to run with 128t.

Comment on lines +336 to +347
value = sourceDictionary[key]
else:
value = defaultDictionary[key]

if isinstance(value, (list, dict, set)):
destinationDictionary[key] = deepcopy(value)
else:
destinationDictionary[key] = deepcopy(defaultDictionary[key])
destinationDictionary[key] = value


# Keys in defaultSolution that contain list values
_SOLUTION_LIST_KEYS = frozenset({'WorkGroup', 'ThreadTile', 'MatrixInstruction'})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the idea here is only deepcopy if an instance exists? That's a nice optimization but we have to be very careful when eliminating deepcopies in this library as we have some very convoluted incidental data structures. I've been surprised in the past when making these changes that seem ok on the surface and then after scrutinizing the build you find missing symbols etc. We will have to do additional testing to be sure because the existing testing isn't sufficient for changes like there.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The commit this was in had a marginal impact on memory so if you think it's risky probably best if I drop that change.

@LunNova LunNova force-pushed the tcl-refactor-space-saving branch from f4681f4 to b0a1242 Compare October 17, 2025 15:03
- Drastically improves peak build directory space usage by unlinking assembly files sooner
- Improves build time by processing in chunks with lower submission overhead
100s -> 5s for aquavanjaram_Cijk_Ailk_Bljk_F8NH_HHS_BH_Bias_HAS_SAB_SAV_freesize_custom_GSUs
@LunNova LunNova force-pushed the tcl-refactor-space-saving branch from b0a1242 to c794f13 Compare October 17, 2025 15:06
@eiis1000
Copy link

As someone currently sitting on 64GB of memory usage consumed by an infinitude of python3.13 processes while loading logics in an attempt to build hipblaslt-7.1.1, this would be a godsend :)

@GZGavinZhao
Copy link
Contributor

GZGavinZhao commented Feb 20, 2026

I tried rebasing this PR on top of rocm-7.2.0 and patch tells me some of the changes have already been applied. On top of rocm-7.2.0, after the interning patch (GZGavinZhao/rocm-libraries@87ac38f) it gets down to a reasonable memory usage. I forgot how much memory it used, but IIRC it didn't exceed 128GB when running with 32 jobs.

@LunNova
Copy link
Author

LunNova commented Feb 23, 2026

Yeah looks like some parts of this have been upstreamed (or independently rederived, it's not rocket science!) eg f27f340

tensilelite isn't fun to work on and I haven't had time to progress it. It's possible to unlink the .os too with proper tracking of them being used twice, but the reuse to me kinda indicates something's fundamentally a bit weird with solution libraries having overlapping kernels and I am uneasy about it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

external contribution Code contribution from users community.. project: hipblaslt

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Issue]: hipblaslt requires 240GB of space in build directory for all ISA build [Issue]: failure to handle missing python module joblib

6 participants