[Elastic] add performance monitor #892

wanglei19991004 · 2025-10-30T10:49:27Z

Overview

Performance Monitor is the performance monitoring module of FlagScale, designed to track and record key performance metrics in real time during the training process.

Key Features

TFLOPS Calculation – Real-time computation of floating-point operations per second during model training
Throughput Monitoring - Tracks metrics such as samples/sec and tokens/sec
Memory Tracking - Monitors GPU memory usage
Performance Breakdown - Records detailed timing for each phase, including forward and backward propagation
File Logging - Saves performance data independently from main training logs to avoid interference
Multi-format Output - Supports both text logs and JSON format outputs

Supported Models

GPT
LLaMA (support GQA)
Qwen
Mixtral (MoE)
Aquila

Quick Start

1. Launch Using the `run.py` Script

Enable via Command-line Arguments (Add New Parameters with the + Prefix)

python run.py \
  --config-path ./examples/aquila/conf \
  --config-name train \
  action=run \
  train.data.data_path=../data/pile_wikipedia_demo \
  +train.system.enable_perf_monitor=true \
  +train.system.perf_log_interval=10 \
  +train.system.perf_log_dir=./outputs/logs/perf_monitor

2. Enable in the yaml configuration file

system:
  # other system config...

  # performance monitor config
  enable_perf_monitor: True
  perf_log_interval: 10
  perf_log_dir: ./outputs/logs/perf_monitor
  perf_console_output: False
  perf_memory_tracking: True
  perf_breakdown: False
  perf_max_log_files: 10

Command-line Arguments

Argument	Description	Default
`--enable-perf-monitor`	Enable performance monitoring	False
`--perf-log-interval N`	Logging interval (in steps)	10
`--perf-log-dir PATH`	Directory for log files	logs/perf_monitor
`--perf-console-output`	Also output logs to console	False
`--perf-memory-tracking`	Enable memory tracking	True
`--perf-breakdown`	Show detailed performance breakdown	False
`--perf-max-log-files N`	Maximum number of log files to keep	10

Log File Description

The performance monitor generates the following files:

logs/perf_monitor/
├── perf_metrics_20240129_103000.log # Performance log in text format
├── perf_summary_20240129_103000.json # Summary data in JSON format
└── perf_realtime.log # Real-time updating log file

Log Format Example

Text Log (perf_metrics_*.log):

================================================================================
Performance Monitor Session Started: 2024-01-29 10:30:00
================================================================================
Timestamp            Step     TFLOPS/GPU   TFLOPS     Samples/s    Tokens/s     Time(ms)   Memory(GB)
--------------------------------------------------------------------------------
2024-01-29 10:30:15  10       125.34       1002.72    512.0        1048576      235.5      42.50
2024-01-29 10:30:30  20       128.12       1024.96    520.0        1064960      230.2      42.75

JSON SUMMARY (perf_summary_*.json)：

{
  "session_info": {
    "start_time": "20240129_103000",
    "end_time": "2024-01-29T11:30:00",
    "total_iterations": 100
  },
  "final_statistics": {
    "avg_tflops_per_gpu": 127.5,
    "max_tflops_per_gpu": 135.2,
    "min_tflops_per_gpu": 120.1,
    "avg_throughput_tokens": 1050000,
    "peak_memory_gb": 45.2
  },
  "iteration_logs": [...]
}

gemini-code-assist · 2025-10-30T10:49:48Z

Summary of Changes

Hello @wanglei19991004, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances FlagScale's training capabilities by introducing a dedicated performance monitoring system. The new module allows users to gain deep insights into their model's training efficiency by tracking crucial metrics like TFLOPS, throughput, and GPU memory consumption. It offers flexible logging, detailed breakdowns of computational operations, and supports a wide range of modern large language models, making it easier to optimize and understand training performance.

Highlights

New Performance Monitoring Module: Introduced a comprehensive performance monitoring module to track key training metrics in real-time.
Key Metrics Tracking: Monitors TFLOPS (Tera Floating Point Operations Per Second), throughput (samples/sec, tokens/sec), and GPU memory usage.
Detailed Performance Breakdown: Records detailed timing for various phases, including forward and backward propagation, to identify bottlenecks.
Flexible Logging Options: Supports outputting performance data to text logs and JSON summary files, with options for console output and log rotation.
Model Agnostic FLOPS Calculation: Includes a robust FLOPS calculator with formulas tailored for various large language models such as GPT, LLaMA (with GQA), Qwen, Mixtral (MoE), and Aquila.
Seamless Integration: Provides command-line arguments and YAML configuration options, along with integration hooks, to easily enable and configure the monitor within existing training loops.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a comprehensive performance monitoring module to FlagScale, which is a valuable addition for tracking and optimizing model training. The module is well-structured, with separate components for FLOPS calculation, logging, and integration. The inclusion of examples and unit tests is also commendable. My review focuses on improving the accuracy of some FLOPS formulas, enhancing maintainability by addressing potential issues like global state and code duplication, and refining error handling. Overall, this is a strong contribution that significantly enhances the project's monitoring capabilities.

gemini-code-assist · 2025-10-30T10:53:24Z

flagscale/runner/monitor/flops_calculator.py

+    def layernorm_flops(batch_size: int, seq_length: int, hidden_size: int) -> float:
+        """
+        Calculate FLOPS for LayerNorm.
+        LayerNorm involves:
+        1. Computing mean and variance
+        2. Normalization
+        3. Scale and shift
+        Args:
+            batch_size: Batch size
+            seq_length: Sequence length
+            hidden_size: Hidden dimension
+        Returns:
+            Total FLOPS for LayerNorm
+        """
+        elements = batch_size * seq_length * hidden_size
+
+        # Mean computation: hidden_size - 1 additions per element
+        mean_flops = elements
+
+        # Variance computation: 2 * hidden_size operations per element
+        variance_flops = 2 * elements
+
+        # Normalization: 2 operations per element (subtract mean, divide by std)
+        norm_flops = 2 * elements
+
+        # Scale and shift: 2 operations per element
+        affine_flops = 2 * elements
+
+        return mean_flops + variance_flops + norm_flops + affine_flops


The calculation for layernorm_flops appears to be a rough approximation, particularly for mean_flops and variance_flops. For instance, mean_flops is set to elements, but a more precise calculation would be (batch_size * seq_length) * (hidden_size - 1). Similarly, variance_flops is 2 * elements, but it involves subtractions, squares, and a summation, which would be closer to 3 * elements. Using more accurate formulas will improve the fidelity of the overall FLOPS estimation.

def layernorm_flops(batch_size: int, seq_length: int, hidden_size: int) -> float: """ Calculate FLOPS for LayerNorm. LayerNorm involves: 1. Computing mean and variance 2. Normalization 3. Scale and shift Args: batch_size: Batch size seq_length: Sequence length hidden_size: Hidden dimension Returns: Total FLOPS for LayerNorm """ elements = batch_size * seq_length * hidden_size tokens = batch_size * seq_length # Mean: (H-1) adds per token. mean_flops = tokens * (hidden_size - 1) # Variance: H subs, H muls, (H-1) adds per token. variance_flops = tokens * (hidden_size + hidden_size + hidden_size - 1) # Normalization: H subs, H divs. norm_flops = elements * 2 # Scale and shift: H muls, H adds. affine_flops = elements * 2 return mean_flops + variance_flops + norm_flops + affine_flops

gemini-code-assist · 2025-10-30T10:53:24Z

flagscale/runner/monitor/flops_calculator.py

+    def gradient_accumulation_factor(micro_batch_size: int, global_batch_size: int) -> float:
+        """
+        Calculate the gradient accumulation factor.
+        This doesn't add FLOPS but affects memory usage.
+        Args:
+            micro_batch_size: Micro-batch size per GPU
+            global_batch_size: Global batch size
+        Returns:
+            Number of accumulation steps
+        """
+        return global_batch_size / micro_batch_size


The function name gradient_accumulation_factor is slightly misleading, as it calculates the number of accumulation steps, not a factor. Renaming it to gradient_accumulation_steps would improve clarity. Additionally, using float division / could produce a non-integer result if global_batch_size is not perfectly divisible by micro_batch_size. It would be more robust to use integer division // and return an int.

Suggested change

def gradient_accumulation_factor(micro_batch_size: int, global_batch_size: int) -> float:

"""

Calculate the gradient accumulation factor.

This doesn't add FLOPS but affects memory usage.

Args:

micro_batch_size: Micro-batch size per GPU

global_batch_size: Global batch size

Returns:

Number of accumulation steps

"""

return global_batch_size / micro_batch_size

def gradient_accumulation_steps(micro_batch_size: int, global_batch_size: int) -> int:

"""

Calculate the number of gradient accumulation steps.

This doesn't add FLOPS but affects memory usage.

Args:

micro_batch_size: Micro-batch size per GPU

global_batch_size: Global batch size

Returns:

Number of accumulation steps

"""

if micro_batch_size <= 0:

return 1

return global_batch_size // micro_batch_size

gemini-code-assist · 2025-10-30T10:53:25Z

flagscale/runner/monitor/hooks.py

+from flagscale.runner.monitor.perf_metrics import FLOPSMeasurementCallback
+
+# Global variable to store the performance monitor callback
+_perf_monitor_callback: Optional[FLOPSMeasurementCallback] = None


The use of a global variable _perf_monitor_callback introduces global state, which can make the code harder to maintain, reason about, and test. A more robust approach would be to pass the performance monitor instance explicitly where it's needed, for instance, through a context object or as a function argument. This would improve modularity and make dependencies clearer. The integration.py file already provides a good example of this pattern.

gemini-code-assist · 2025-10-30T10:53:25Z

flagscale/runner/monitor/perf_logger.py

+                    )
+                    if json_file.exists():
+                        json_file.unlink()
+                except Exception as e:


Catching a broad Exception can hide unexpected errors and mask bugs. It's better to catch more specific exceptions that you anticipate might occur during file operations, such as OSError. This will make your error handling more robust and predictable.

Suggested change

except Exception as e:

except OSError as e:

gemini-code-assist · 2025-10-30T10:53:25Z

flagscale/runner/monitor/perf_metrics.py

+    def get_flops_breakdown(self) -> Dict[str, float]:
+        """
+        Get detailed FLOPS breakdown by component.
+        Returns:
+            Dictionary with FLOPS for different components
+        """
+        if get_num_microbatches is not None:
+            num_micro_batches = get_num_microbatches()
+        else:
+            num_micro_batches = getattr(
+                self.args, 'num_micro_batches', getattr(self.args, 'gradient_accumulation_steps', 1)
+            )
+        micro_batch_size = getattr(self.args, 'micro_batch_size', 1)
+        batch_size = micro_batch_size * num_micro_batches if micro_batch_size else num_micro_batches
+
+        if self.model_type in ['gpt', 'llama', 'qwen', 'aquila']:
+            args = self.args
+            # Extract configuration with safe access
+            seq_length = getattr(args, 'seq_length', 512)
+            hidden_size = getattr(args, 'hidden_size', 768)
+            num_layers = getattr(args, 'num_layers', 12)
+            num_attention_heads = getattr(args, 'num_attention_heads', 12)
+
+            # Calculate component FLOPS
+            attention_flops = (
+                self.formulas.attention_flops(
+                    batch_size=batch_size,
+                    seq_length=seq_length,
+                    hidden_size=hidden_size,
+                    num_attention_heads=num_attention_heads,
+                )
+                * num_layers
+            )
+
+            ffn_hidden_size = getattr(args, 'ffn_hidden_size', 4 * hidden_size)
+            use_swiglu = getattr(args, 'swiglu', False)
+            ffn_flops = (
+                self.formulas.ffn_flops(
+                    batch_size=batch_size,
+                    seq_length=seq_length,
+                    hidden_size=hidden_size,
+                    ffn_hidden_size=ffn_hidden_size,
+                    use_swiglu=use_swiglu,
+                )
+                * num_layers
+            )
+
+            total_forward = attention_flops + ffn_flops
+
+            return {
+                'attention': attention_flops,
+                'ffn': ffn_flops,
+                'forward': total_forward,
+                'backward': total_forward * 2,  # Backward is approximately 2x forward
+                'total': total_forward * 3,
+            }
+
+        return {}


The get_flops_breakdown method provides a breakdown of FLOPS by component, but it omits the FLOPS from the embedding and output projection layers. These can contribute significantly to the total computation, especially for models with large vocabularies. Including them in the breakdown would offer a more complete performance picture.

def get_flops_breakdown(self) -> Dict[str, float]: """ Get detailed FLOPS breakdown by component. Returns: Dictionary with FLOPS for different components """ if get_num_microbatches is not None: num_micro_batches = get_num_microbatches() else: num_micro_batches = getattr( self.args, 'num_micro_batches', getattr(self.args, 'gradient_accumulation_steps', 1) ) micro_batch_size = getattr(self.args, 'micro_batch_size', 1) batch_size = micro_batch_size * num_micro_batches if micro_batch_size else num_micro_batches if self.model_type in ['gpt', 'llama', 'qwen', 'aquila']: args = self.args # Extract configuration with safe access seq_length = getattr(args, 'seq_length', 512) hidden_size = getattr(args, 'hidden_size', 768) num_layers = getattr(args, 'num_layers', 12) num_attention_heads = getattr(args, 'num_attention_heads', 12) vocab_size = getattr(args, 'vocab_size', getattr(args, 'padded_vocab_size', 50257)) # Calculate component FLOPS attention_flops = ( self.formulas.attention_flops( batch_size=batch_size, seq_length=seq_length, hidden_size=hidden_size, num_attention_heads=num_attention_heads, ) * num_layers ) ffn_hidden_size = getattr(args, 'ffn_hidden_size', 4 * hidden_size) use_swiglu = getattr(args, 'swiglu', False) ffn_flops = ( self.formulas.ffn_flops( batch_size=batch_size, seq_length=seq_length, hidden_size=hidden_size, ffn_hidden_size=ffn_hidden_size, use_swiglu=use_swiglu, ) * num_layers ) embedding_flops = self.formulas.embedding_flops( batch_size=batch_size, seq_length=seq_length, vocab_size=vocab_size, hidden_size=hidden_size, ) total_forward = attention_flops + ffn_flops + embedding_flops return { 'attention': attention_flops, 'ffn': ffn_flops, 'embedding': embedding_flops, 'forward': total_forward, 'backward': total_forward * 2, # Backward is approximately 2x forward 'total': total_forward * 3, } return {}

gemini-code-assist · 2025-10-30T10:53:25Z

flagscale/train/train.py

+        if perf_monitor is not None:
+            try:
+                metrics = perf_monitor.calculate_metrics(iteration)
+                if metrics.tflops_per_gpu > 0:
+                    log_string += f' TFLOPS/GPU (monitored): {metrics.tflops_per_gpu:.2f} |'
+                    if metrics.tokens_per_second > 0:
+                        log_string += f' tokens/sec: {metrics.tokens_per_second:.0f} |'
+
+                    # Log to TensorBoard/WandB
+                    if args.log_timers_to_tensorboard:
+                        if writer:
+                            writer.add_scalar('performance/tflops_per_gpu', metrics.tflops_per_gpu, iteration)
+                            writer.add_scalar('performance/tokens_per_second', metrics.tokens_per_second, iteration)
+                            if perf_monitor.peak_memory_gb > 0:
+                                writer.add_scalar('memory/peak_gb', perf_monitor.peak_memory_gb, iteration)
+                        if wandb_writer:
+                            wandb_writer.log({
+                                'performance/tflops_per_gpu': metrics.tflops_per_gpu,
+                                'performance/tokens_per_second': metrics.tokens_per_second,
+                                'memory/peak_gb': perf_monitor.peak_memory_gb
+                            }, iteration)
+            except Exception as e:
+                # Don't let monitoring errors affect training
+                if torch.distributed.get_rank() == 0:
+                    print(f"Warning: Performance monitor error: {e}")


The logic for logging performance metrics to TensorBoard and WandB appears to be duplicated here and within the PerformanceMonitor.log_metrics method. This redundancy can lead to inconsistencies if one implementation is updated but the other is not. To improve maintainability, consider centralizing the logging logic by calling perf_monitor.log_metrics(iteration, writer, wandb_writer) and allowing the PerformanceMonitor class to handle all logging aspects (file, console, TensorBoard, and WandB).

gemini-code-assist · 2025-10-30T10:53:25Z

flagscale/train/train_gpt.py

+def add_extra_args(parser):
+    """Add extra arguments including performance monitoring and ModelOpt."""
+    # Add performance monitoring args
+    parser = add_performance_args(parser)
+
+    # Chain with ModelOpt args if available
+    if has_nvidia_modelopt:
+        parser = add_modelopt_args(parser)
+
+    return parser


The function add_extra_args is defined but not used anywhere in the codebase. Its implementation is very similar to combine_extra_args_providers, but it calls a different argument-adding function (add_performance_args vs. add_perf_monitor_args). This is confusing and creates redundant code. To improve clarity and maintainability, this unused function should be removed.

gemini-code-assist · 2025-10-30T10:53:26Z

flagscale/runner/monitor/monitor_example.py

+    # Initialize distributed if needed (mock for example)
+    if not dist.is_initialized():
+        import os
+
+        os.environ['MASTER_ADDR'] = 'localhost'
+        os.environ['MASTER_PORT'] = '12355'
+        os.environ['RANK'] = '0'
+        os.environ['WORLD_SIZE'] = '1'
+
+        if torch.cuda.is_available():
+            dist.init_process_group(backend='nccl', rank=0, world_size=1)
+        else:
+            dist.init_process_group(backend='gloo', rank=0, world_size=1)


The setup for torch.distributed in this example is a bit confusing. It sets environment variables like RANK and WORLD_SIZE, but then calls dist.init_process_group with hardcoded values (rank=0, world_size=1), effectively ignoring the environment variables. For better clarity and correctness, it would be best to either use the environment variables consistently or remove them if they are not needed for this single-process example.

Suggested change

# Initialize distributed if needed (mock for example)

if not dist.is_initialized():

import os

os.environ['MASTER_ADDR'] = 'localhost'

os.environ['MASTER_PORT'] = '12355'

os.environ['RANK'] = '0'

os.environ['WORLD_SIZE'] = '1'

if torch.cuda.is_available():

dist.init_process_group(backend='nccl', rank=0, world_size=1)

else:

dist.init_process_group(backend='gloo', rank=0, world_size=1)

if not dist.is_initialized():

import os

os.environ['MASTER_ADDR'] = 'localhost'

os.environ['MASTER_PORT'] = '12355'

# For this single-process example, we can hardcode rank and world_size.

rank = 0

world_size = 1

if torch.cuda.is_available():

dist.init_process_group(backend='nccl', rank=rank, world_size=world_size)

else:

dist.init_process_group(backend='gloo', rank=rank, world_size=world_size)

CLAassistant · 2025-11-18T03:44:49Z

All committers have signed the CLA.

add performance monitor and unit tests

4dbfa8a

wanglei19991004 requested review from a team and aoyulong as code owners October 30, 2025 10:49

gemini-code-assist bot reviewed Oct 30, 2025

View reviewed changes

wanglei19991004 added 2 commits November 10, 2025 09:53

fix some bugs

50000c0

fix perf bug

20bea71

wanglei19991004 added 2 commits November 20, 2025 18:34

fix timezone and add enable per monitor

d0db938

fix unit_test

9b757f2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Elastic] add performance monitor #892

[Elastic] add performance monitor #892

Uh oh!

wanglei19991004 commented Oct 30, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Oct 30, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Oct 30, 2025

Uh oh!

gemini-code-assist bot Oct 30, 2025

Uh oh!

gemini-code-assist bot Oct 30, 2025

Uh oh!

gemini-code-assist bot Oct 30, 2025

Uh oh!

gemini-code-assist bot Oct 30, 2025

Uh oh!

gemini-code-assist bot Oct 30, 2025

Uh oh!

gemini-code-assist bot Oct 30, 2025

Uh oh!

gemini-code-assist bot Oct 30, 2025

Uh oh!

CLAassistant commented Nov 18, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[Elastic] add performance monitor #892

Are you sure you want to change the base?

[Elastic] add performance monitor #892

Uh oh!

Conversation

wanglei19991004 commented Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Key Features

Supported Models

Quick Start

1. Launch Using the run.py Script

Enable via Command-line Arguments (Add New Parameters with the + Prefix)

2. Enable in the yaml configuration file

Command-line Arguments

Log File Description

Log Format Example

Uh oh!

gemini-code-assist bot commented Oct 30, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

CLAassistant commented Nov 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

wanglei19991004 commented Oct 30, 2025 •

edited

Loading

1. Launch Using the `run.py` Script

CLAassistant commented Nov 18, 2025 •

edited

Loading