[hipblaslt] Refactor Parallel.py to drop joblib, decimate resource usage#2073
[hipblaslt] Refactor Parallel.py to drop joblib, decimate resource usage#2073LunNova wants to merge 12 commits intoROCm:developfrom
Conversation
98a6820 to
5660637
Compare
| # Build reference count map for .o files to handle shared object files | ||
| # FIXME: why are some .o files are shared between multiple .co files? | ||
| objFileRefCount = collections.Counter() | ||
| for coFileRaw, objFiles in coFileMap.items(): | ||
| for objFile in objFiles: | ||
| objFileRefCount[objFile] += 1 | ||
|
|
||
| sharedObjFiles = {objFile: count for objFile, count in objFileRefCount.items() if count > 1} | ||
| if sharedObjFiles: | ||
| print1(f"Found {len(sharedObjFiles)} .o files shared across multiple code objects:") | ||
| for objFile, count in sharedObjFiles.items(): | ||
| print1(f" {Path(objFile).name}: used by {count} code objects") |
There was a problem hiding this comment.
There are a lot of .o files that are included in multiple .co files. Is that expected?
If there weren't we could simply unlink after calling linker() and drop this collections.Counter.
There was a problem hiding this comment.
Haven't checked it but it should be one arch per co, so the o files shouldn't be shared.
There was a problem hiding this comment.
Well that's not good 😅 I'm seeing thousands that are shared. Will investigate further.
There was a problem hiding this comment.
My mistake, it's per problem type per arch for one co file. Shouldn't be shared either.
There was a problem hiding this comment.
There was a problem hiding this comment.
I think reusing some .o files is existing behavior:
The assembly files with .duplicate set to true are duplicates. This logic skips assembling and creating objects for them, and then in later steps they're used multiple times.
So I guess I should remove the verbose logging here and keep the Counter usage to allow unlinking on last use.
There was a problem hiding this comment.
No I think I got that wrong.
These duplicates look like they're from cases where the same solution exists in logic files for two different CU counts. There are a few of these for gfx90a for example.
Testing file: aldebaran_Cijk_Ailk_Bjlk_SB_Bias_HA_SAV.yaml
Loading from 104CU directory...
Loading from 110CU directory...
104CU solutions: 1122
110CU solutions: 1364
Solutions with matching str(): 2
Solutions only in 104CU: 1120
Solutions only in 110CU: 1362
Generating assembly for first common solution Cijk_Ailk_Bjlk_S_B_Bias_HA_S_SAV_UserArgs_MT128x128x16_MI32x32x1_SN_LDSB0_AA0_AFC1_AF1_AAIGTEn1_AAILTEn1_AFEM1_AFEM1_ASEM1_BL1_BS1_CLR1_CADS0_DSK0_DU16_DTL0_DTVA0_DTVB0_DTVSM0_EPS0_FDSI0_GRPM1_GRVWA4_GRVWB4_GSU1_GSUAMB_GSUC0_GSUWGMRR0_GLS0_ISA90a_IU1_IA0_KLA_LDSTI0_LBSPPA0_LBSPPB0_LBSPPM0_LPA0_LPB0_LPM0_LRVW1_LWPMn1_MIAV0_MIWT2_2_MDA2_MI32_32_2_1_MLDS65536_MO40_MPM0_NR0_NTn1_NTA0_NTB0_NTC0_NTD0_NTE0_NTM0_NTWS0_NEPBS2_NLCA1_NLCB1_ONLL1_PGR2_PLR1_PKA1_SGR1_SIA3_SLW1_SS1_SU0_SUM0_SUS64_SPO1_SRVW0_SSO0_SVW1_SK0_SKA0_SKFTR0_SKXCCM0_SNLL0_TT2_64_TLDS0_ULSGRO0_USL1_UCMLS0_UIOFGRO0_USFGROn1_VSn1_VWA1_VWB1_WSGRA0_WSGRB0_WSK0_WS64_WG64_4_1_WGM8_WGMXCC1_WGMXCCGn1_WGR0
104CU CUCount: 104
110CU CUCount: None
Solutions have differing _state entries (beyond CUCount):
SolutionIndex: 104CU=1122, 110CU=1362
Assembler params:
104CU: gfx=gfx90a, wavefront=64
110CU: gfx=gfx90a, wavefront=64
Assembly identical after filtering out random labels (905432 bytes)
|
Thanks for the PR. Did you test the build time with this change? |
16caf87 to
44ad192
Compare
This comment was marked as outdated.
This comment was marked as outdated.
44ad192 to
f4681f4
Compare
|
cc @davidd-amd since you've attempted some of this before. |
|
CC @AviralGoelAMD @vidyasagar-amd @spolifroni-amd for review We need this downstream as without this it takes too much build space on our builders. |
| # 2. Parse MAKEFLAGS for -jN | ||
| makeflags = os.environ.get('MAKEFLAGS', '') | ||
| match = re.search(r'-j\s*(\d+)', makeflags) | ||
| if match: | ||
| return int(match.group(1)) |
There was a problem hiding this comment.
We aren't gaurunteed to have MAKEFLAGS e.g. what if ninja or another generator is used. Why require that the parallel level is set so you don't have to reach into the environment for this information. We can easily accommodate this in the build system.
There was a problem hiding this comment.
My goal here was to inherit from the build system if set, otherwise fall back to the existing logic. That way when I'm building with CMAKE_BUILD_PARALLEL_LEVEL=64 it doesn't try to run with 128t.
| value = sourceDictionary[key] | ||
| else: | ||
| value = defaultDictionary[key] | ||
|
|
||
| if isinstance(value, (list, dict, set)): | ||
| destinationDictionary[key] = deepcopy(value) | ||
| else: | ||
| destinationDictionary[key] = deepcopy(defaultDictionary[key]) | ||
| destinationDictionary[key] = value | ||
|
|
||
|
|
||
| # Keys in defaultSolution that contain list values | ||
| _SOLUTION_LIST_KEYS = frozenset({'WorkGroup', 'ThreadTile', 'MatrixInstruction'}) |
There was a problem hiding this comment.
So the idea here is only deepcopy if an instance exists? That's a nice optimization but we have to be very careful when eliminating deepcopies in this library as we have some very convoluted incidental data structures. I've been surprised in the past when making these changes that seem ok on the surface and then after scrutinizing the build you find missing symbols etc. We will have to do additional testing to be sure because the existing testing isn't sufficient for changes like there.
There was a problem hiding this comment.
The commit this was in had a marginal impact on memory so if you think it's risky probably best if I drop that change.
f4681f4 to
b0a1242
Compare
- Drastically improves peak build directory space usage by unlinking assembly files sooner - Improves build time by processing in chunks with lower submission overhead
100s -> 5s for aquavanjaram_Cijk_Ailk_Bljk_F8NH_HHS_BH_Bias_HAS_SAB_SAV_freesize_custom_GSUs
b0a1242 to
c794f13
Compare
|
As someone currently sitting on 64GB of memory usage consumed by an infinitude of python3.13 processes while loading logics in an attempt to build hipblaslt-7.1.1, this would be a godsend :) |
|
I tried rebasing this PR on top of |
|
Yeah looks like some parts of this have been upstreamed (or independently rederived, it's not rocket science!) eg f27f340 tensilelite isn't fun to work on and I haven't had time to progress it. It's possible to unlink the .os too with proper tracking of them being used twice, but the reuse to me kinda indicates something's fundamentally a bit weird with solution libraries having overlapping kernels and I am uneasy about it. |
Fixes #2072
Fixes #288
Improves #316. Memory consumption is down by ~1/3rd but still too high.
Motivation
hipblaslt is too resource intensive to build.
This PR drastically improves peak build directory space usage by unlinking assembly and object files sooner.
Peak build dir space usage has been reduced from >240GB to 25GB for an all ISA build
joblib dep seems like overkill, multiprocessing is builtin to python and adequate as long as chunking is set adequately. I recall an earlier discussion on one of the AMD issue trackers indicating you were open to removing joblib, but don't recall where exactly.
Technical Details
Test Plan
Build hipblaslt for all ISAs as part of nixpkgs rocmPackage set.
Watch tensile build dir space usage.
Load logic from largest yaml file on each commit in patch series and compare time and peak memory.
Run full ISA TensileCreateLibrary.
Test Result
Builds successfully.
Peak space usage has been decimated to ≈25G just before TensileCreateLibrary finishes and deletes temporary files.
Large YAML Load Logic Perf
How expensive is it to load
aquavanjaram_Cijk_Ailk_Bjlk_SB_Bias_HAS_SAV_UserArgs.yamlwhich is 80MiB?Full ISA TensileCreateLibrary
Time before
Time After
Submission Checklist