How to correctly profile JAX projects? #25874

drewjin · 2025-01-14T12:49:10Z

drewjin
Jan 14, 2025

I am currently running AlphaFold3 (implemented by JAX), and I want to profile the whole project to find out the bottleneck in it.

Following the official document:

https://jax.readthedocs.io/en/latest/profiling.html

I wrote the following code:

if __name__ == '__main__':
  flags.mark_flags_as_required([
      'output_dir',
  ])
  with jax.profiler.trace('/tmp/jax-trace'):
    app.run(main)

It has been perfect since then.

However, when I used tensorboard to visualize it, unlike the official document, there is nothing visualized at all.

The running output is:

bash experiments/run_test.sh 
2025-01-14 20:26:36.691284: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1736857596.708528  370661 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1736857596.713608  370661 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
I0114 20:26:41.098771 139841220543104 folding_input.py:1044] Detected /home/user3/alphafold/data/processed/301aa_3DB6.json is an AlphaFold 3 JSON since the top-level is not a list.
Running AlphaFold 3. Please note that standard AlphaFold 3 model parameters are
only available under terms of use provided at
https://github.com/google-deepmind/alphafold3/blob/main/WEIGHTS_TERMS_OF_USE.md.
If you do not agree to these terms and are using AlphaFold 3 derived model
parameters, cancel execution of AlphaFold 3 inference with CTRL-C, and do not
use the model parameters.
Skipping running the data pipeline.
Found local devices: [CudaDevice(id=0), CudaDevice(id=1)]
Building model from scratch...
Processing 1 fold inputs.
Processing fold input 301aa_3DB6
Checking we can load the model parameters...
2025-01-14 20:26:41.192283: W external/xla/xla/service/gpu/nvptx_compiler.cc:930] The NVIDIA driver's CUDA version is 12.2 which is older than the PTX compiler version 12.6.77. Because the driver is older than the PTX compiler version, XLA is disabling parallel compilation, which may slow down compilation. You should update your NVIDIA driver or use the NVIDIA-provided CUDA forward compatibility packages.
Skipping data pipeline...
Output directory: /home/user3/alphafold/workspaces/jyj/output/gpu/baseline/301aa_3db6
Writing model input JSON to /home/user3/alphafold/workspaces/jyj/output/gpu/baseline/301aa_3db6
Predicting 3D structure for 301aa_3DB6 for seed(s) (1,)...
Featurising data for seeds (1,)...
Featurising 301aa_3DB6 with rng_seed 1.
I0114 20:26:50.429166 139841220543104 pipeline.py:160] processing 301aa_3DB6, random_seed=1
Featurising 301aa_3DB6 with rng_seed 1 took 7.66 seconds.
Featurising data for seeds (1,) took  11.94 seconds.
Running model inference for seed 1...
Running model inference for seed 1 took  77.58 seconds.
Extracting output structures (one per sample) for seed 1...
/home/user3/alphafold/workspaces/jyj/alphafold3-3.0.0/src/alphafold3/model/confidences.py:332: RuntimeWarning: invalid value encountered in divide
  return np.nanmean(value * mask_with_nan, axis=axis) / np.nanmean(
/home/user3/alphafold/workspaces/jyj/alphafold3-3.0.0/src/alphafold3/model/confidences.py:522: RuntimeWarning: Mean of empty slice
  xchain = np.nanmean(
/home/user3/alphafold/workspaces/jyj/alphafold3-3.0.0/src/alphafold3/model/confidences.py:548: RuntimeWarning: Mean of empty slice
  return np.nanmean(np.stack([xchain_row_agg, xchain_col_agg], axis=0), axis=0)
Extracting output structures (one per sample) for seed 1 took  0.38 seconds.
Running model inference and extracting output structures for seed 1 took  77.96 seconds.
Running model inference and extracting output structures for seeds (1,) took  77.96 seconds.
Writing outputs for 301aa_3DB6 for seed(s) (1,)...
Done processing fold input 301aa_3DB6.
Done processing 1 fold inputs.

While loading profile data using TensorBoard, the output is:

tensorboard --logdir /tmp/alphafold-run/jax-trace/
2025-01-14 19:28:54.258701: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-01-14 19:28:54.275882: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1736854134.297137  353978 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1736854134.305643  353978 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-01-14 19:28:54.323284: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI AVX512_BF16 AVX512_FP16 AVX_VNNI AMX_TILE AMX_INT8 AMX_BF16 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
W0114 19:28:59.056233 139871098802816 server_ingester.py:187] Failed to communicate with data server at localhost:38081: <_InactiveRpcError of RPC that terminated with:
        status = StatusCode.UNAVAILABLE
        details = "failed to connect to all addresses; last error: UNKNOWN: ipv4:59.79.4.198:41595: Endpoint is neither UDS or TCP loopback address."
        debug_error_string = "UNKNOWN:Error received from peer  {created_time:"2025-01-14T19:28:59.055938619+08:00", grpc_status:14, grpc_message:"failed to connect to all addresses; last error: UNKNOWN: ipv4:59.79.4.198:41595: Endpoint is neither UDS or TCP loopback address."}"
>
Serving TensorBoard on localhost; to expose to the network, use a proxy or pass --bind_all
TensorBoard 2.18.0 at http://localhost:6006/ (Press CTRL+C to quit)

In the TensorBoard it turns out like this:

As shown in the output, the inference time is 77s, preprocess time is 7 + 12 s, however in the figure above, its a lot shorter...

In the ops_profile, it turns out to be without any FLOPS staff:

In the kernel stats it seems to be fine, but I don't know.

I really don't know what happened.

For further validation, I also tested the given example in JAX Doc:

import jax

with jax.profiler.trace('/tmp/jax-trace'):
    # Run the operations to be profiled
    key = jax.random.key(0)
    x = jax.random.normal(key, (5000, 5000))
    y = x @ x
    y.block_until_ready()

Its running output is:

python ../test/test-tensorboard.py 
2025-01-14 19:32:25.844744: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1736854345.862460  355628 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1736854345.867625  355628 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-01-14 19:32:27.389759: W external/xla/xla/service/gpu/nvptx_compiler.cc:930] The NVIDIA driver's CUDA version is 12.2 which is older than the PTX compiler version 12.6.77. Because the driver is older than the PTX compiler version, XLA is disabling parallel compilation, which may slow down compilation. You should update your NVIDIA driver or use the NVIDIA-provided CUDA forward compatibility packages.

Guessing I forgot using the y.block_until_ready() staff?

I am a rookie, and don't know much about this, I really don't know if I am doing right or not.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to correctly profile JAX projects? #25874

{{title}}

Replies: 0 comments

Select a reply

How to correctly profile JAX projects? #25874

drewjin Jan 14, 2025

Replies: 0 comments

drewjin
Jan 14, 2025