Skip to content

Conversation

@achew010
Copy link
Contributor

@achew010 achew010 commented Jun 4, 2024

Description

This PR shifts all GPU memory computation from the end of each experiment to the end of the benchmarking script. This avoids the need to rerun experiments, instead the raw values are saved and the aggregated values are computed at the end across all the experiments in gather_report.

@achew010 achew010 requested a review from fabianlim as a code owner June 4, 2024 03:01
gpu_logs = pd.read_csv(gpu_log_filename, skipinitialspace=True)
peak_nvidia_mem_by_device_id, device_name = get_peak_mem_usage_by_device_id(gpu_logs)
experiment_stats[tag].update({
RESULT_FIELD_RESERVED_GPU_MEM: peak_nvidia_mem_by_device_id.mean(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs a comment what we are taking the mean over.

except FileNotFoundError:
pass

if script_args['log_nvidia_smi'] is True:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dont need is True

RESULT_FIELD_DEVICE_NAME: device_name,
})

if script_args['log_memory_hf'] is True and tag in experiment_stats.keys():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see above

k: v for k, v in experiment_stats[tag].items()
if any([prefix in k for prefix in memory_metrics_prefixes])
}
if len(memory_metrics.keys())>0:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls lint the file with tox -e lint

@fabianlim fabianlim merged commit bfde526 into foundation-model-stack:dev Jun 7, 2024
@achew010 achew010 deleted the shifted_gpu_mem_compute branch July 26, 2024 04:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants