Skip to content

Conversation

@vrdn-23
Copy link
Contributor

@vrdn-23 vrdn-23 commented Nov 7, 2025

From the discussion here: #1926 (comment)

So I tried a couple of things, but wasn't making any progress and came across this post: https://users.rust-lang.org/t/high-performance-operations/19739

Essentially changing the constant operation from a division to a multiplication of the reciprocal does seem to have some improvement (difference between the 2 activations has reduced by 20ms)

Including some benchmark results here:
1000 RUNS, 1 ITERATION:

Before PR:
Gelu_erf: mean = 0.117081s, std dev = 0.001087s
Gelu    : mean = 0.081724s, std dev = 0.000818s

After PR:
Gelu_erf: mean = 0.098340s, std dev = 0.001153s
Gelu    : mean = 0.081848s, std dev = 0.000844s

20 RUNS, 100 ITERATIONS:

Before PR:
Gelu_erf: mean = 11.707783s, std dev = 0.040589s
Gelu    : mean = 8.166275s, std dev = 0.015691s

After PR:
Gelu_erf: mean = 9.907764s, std dev = 0.059777s
Gelu    : mean = 8.227856s, std dev = 0.054679s
Script to reproduce results:
use candle_core::{Device, Tensor};
use indicatif::ProgressBar;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    const NUM_RUNS: usize = 20;

    let mut gelu_erf_times: Vec<f64> = Vec::with_capacity(NUM_RUNS);
    let mut gelu_times: Vec<f64> = Vec::with_capacity(NUM_RUNS);

    let pb = ProgressBar::new(NUM_RUNS as u64);
    for _ in 0..NUM_RUNS {
        pb.inc(1);

        let a = Tensor::rand(0f64, 1.0, (32, 512, 768), &Device::Cpu)?;

        // Benchmark gelu_erf
        let start = std::time::Instant::now();
        for _ in 0..100 {
            let _ = a.gelu_erf();
        }
        let gelu_erf_duration = start.elapsed().as_secs_f64();
        gelu_erf_times.push(gelu_erf_duration);

        // Benchmark gelu
        let start = std::time::Instant::now();
        for _ in 0..100 {
            let _ = a.gelu();
        }
        let gelu_duration = start.elapsed().as_secs_f64();
        gelu_times.push(gelu_duration);
    }
    pb.finish();

    // Calculate statistics for gelu_erf
    let gelu_erf_mean = gelu_erf_times.iter().sum::<f64>() / NUM_RUNS as f64;
    let gelu_erf_variance = gelu_erf_times.iter()
        .map(|&x| (x - gelu_erf_mean).powi(2))
        .sum::<f64>() / NUM_RUNS as f64;
    let gelu_erf_std_dev = gelu_erf_variance.sqrt();

    // Calculate statistics for gelu
    let gelu_mean = gelu_times.iter().sum::<f64>() / NUM_RUNS as f64;
    let gelu_variance = gelu_times.iter()
        .map(|&x| (x - gelu_mean).powi(2))
        .sum::<f64>() / NUM_RUNS as f64;
    let gelu_std_dev = gelu_variance.sqrt();

    // Print results
    println!("\n=== Results over {} runs ===", NUM_RUNS);
    println!("Gelu_erf: mean = {:.6}s, std dev = {:.6}s", gelu_erf_mean, gelu_erf_std_dev);
    println!("Gelu    : mean = {:.6}s, std dev = {:.6}s", gelu_mean, gelu_std_dev);

    Ok(())
}

@vrdn-23
Copy link
Contributor Author

vrdn-23 commented Nov 7, 2025

@ivarflakstad @EricLBuehler

I've added the changes and included a script I used to benchmark results. I only ran the results on a CPU, so if anyone can check any performance numbers on CUDA settings, that would be great!

@vrdn-23
Copy link
Contributor Author

vrdn-23 commented Nov 7, 2025

So the next apparent bottleneck was in the erf function itself and by using the one that is used by libm, I was able to improve the Gelu_erf timings to be better than the Gelu implementation. The libm implementation seemed to be using shift operators which are much more optimized than what statrs is doing and seems to be performing much faster!

New results below:
1000 RUNS, 1 ITERATION (For a matrix of size 32x512x768):

Before PR:
Gelu_erf: mean = 0.119173s, std dev = 0.012756s
Gelu    : mean = 0.082936s, std dev = 0.003862s

After PR:
Gelu_erf: mean = 0.046703s, std dev = 0.001548s
Gelu    : mean = 0.082681s, std dev = 0.002030s


All the tests pass locally, so it does look like it is functionally correct. Please do let me know if there is anything else we need to verify @ivarflakstad

cc @LaurentMazare

@vrdn-23 vrdn-23 changed the title Add sqrt2 as constant for gelu_erf Add sqrt2 as constant for gelu_erf and use libm erf Nov 7, 2025
Copy link
Member

@ivarflakstad ivarflakstad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great!
Especially like the libm usage. We should check if there are other opportunities to use it as well.

@ivarflakstad ivarflakstad merged commit ade0918 into huggingface:main Nov 7, 2025
9 checks passed
@vrdn-23
Copy link
Contributor Author

vrdn-23 commented Nov 7, 2025

Thanks for getting back on this so quickly @ivarflakstad ! Is there an ETA on when the next tag release is going to be cut?

@ivarflakstad
Copy link
Member

Very soon™️!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants