Add sqrt2 as constant for gelu_erf and use `libm` erf #3168

vrdn-23 · 2025-11-07T06:59:00Z

From the discussion here: #1926 (comment)

So I tried a couple of things, but wasn't making any progress and came across this post: https://users.rust-lang.org/t/high-performance-operations/19739

Essentially changing the constant operation from a division to a multiplication of the reciprocal does seem to have some improvement (difference between the 2 activations has reduced by 20ms)

Including some benchmark results here:
1000 RUNS, 1 ITERATION:

Before PR:
Gelu_erf: mean = 0.117081s, std dev = 0.001087s
Gelu    : mean = 0.081724s, std dev = 0.000818s

After PR:
Gelu_erf: mean = 0.098340s, std dev = 0.001153s
Gelu    : mean = 0.081848s, std dev = 0.000844s

20 RUNS, 100 ITERATIONS:

Before PR:
Gelu_erf: mean = 11.707783s, std dev = 0.040589s
Gelu    : mean = 8.166275s, std dev = 0.015691s

After PR:
Gelu_erf: mean = 9.907764s, std dev = 0.059777s
Gelu    : mean = 8.227856s, std dev = 0.054679s

Script to reproduce results:

use candle_core::{Device, Tensor};
use indicatif::ProgressBar;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    const NUM_RUNS: usize = 20;

    let mut gelu_erf_times: Vec<f64> = Vec::with_capacity(NUM_RUNS);
    let mut gelu_times: Vec<f64> = Vec::with_capacity(NUM_RUNS);

    let pb = ProgressBar::new(NUM_RUNS as u64);
    for _ in 0..NUM_RUNS {
        pb.inc(1);

        let a = Tensor::rand(0f64, 1.0, (32, 512, 768), &Device::Cpu)?;

        // Benchmark gelu_erf
        let start = std::time::Instant::now();
        for _ in 0..100 {
            let _ = a.gelu_erf();
        }
        let gelu_erf_duration = start.elapsed().as_secs_f64();
        gelu_erf_times.push(gelu_erf_duration);

        // Benchmark gelu
        let start = std::time::Instant::now();
        for _ in 0..100 {
            let _ = a.gelu();
        }
        let gelu_duration = start.elapsed().as_secs_f64();
        gelu_times.push(gelu_duration);
    }
    pb.finish();

    // Calculate statistics for gelu_erf
    let gelu_erf_mean = gelu_erf_times.iter().sum::<f64>() / NUM_RUNS as f64;
    let gelu_erf_variance = gelu_erf_times.iter()
        .map(|&x| (x - gelu_erf_mean).powi(2))
        .sum::<f64>() / NUM_RUNS as f64;
    let gelu_erf_std_dev = gelu_erf_variance.sqrt();

    // Calculate statistics for gelu
    let gelu_mean = gelu_times.iter().sum::<f64>() / NUM_RUNS as f64;
    let gelu_variance = gelu_times.iter()
        .map(|&x| (x - gelu_mean).powi(2))
        .sum::<f64>() / NUM_RUNS as f64;
    let gelu_std_dev = gelu_variance.sqrt();

    // Print results
    println!("\n=== Results over {} runs ===", NUM_RUNS);
    println!("Gelu_erf: mean = {:.6}s, std dev = {:.6}s", gelu_erf_mean, gelu_erf_std_dev);
    println!("Gelu    : mean = {:.6}s, std dev = {:.6}s", gelu_mean, gelu_std_dev);

    Ok(())
}

vrdn-23 · 2025-11-07T07:41:23Z

@ivarflakstad @EricLBuehler

I've added the changes and included a script I used to benchmark results. I only ran the results on a CPU, so if anyone can check any performance numbers on CUDA settings, that would be great!

vrdn-23 · 2025-11-07T09:57:43Z

So the next apparent bottleneck was in the erf function itself and by using the one that is used by libm, I was able to improve the Gelu_erf timings to be better than the Gelu implementation. The libm implementation seemed to be using shift operators which are much more optimized than what statrs is doing and seems to be performing much faster!

New results below:
1000 RUNS, 1 ITERATION (For a matrix of size 32x512x768):

Before PR:
Gelu_erf: mean = 0.119173s, std dev = 0.012756s
Gelu    : mean = 0.082936s, std dev = 0.003862s

After PR:
Gelu_erf: mean = 0.046703s, std dev = 0.001548s
Gelu    : mean = 0.082681s, std dev = 0.002030s

All the tests pass locally, so it does look like it is functionally correct. Please do let me know if there is anything else we need to verify @ivarflakstad

cc @LaurentMazare

ivarflakstad

This is great!
Especially like the libm usage. We should check if there are other opportunities to use it as well.

vrdn-23 · 2025-11-07T16:51:23Z

Thanks for getting back on this so quickly @ivarflakstad ! Is there an ETA on when the next tag release is going to be cut?

ivarflakstad · 2025-11-07T16:58:08Z

Very soon™️!

vrdn-23 added 2 commits November 6, 2025 22:57

Add sqrt2 as constant for gelu_erf

1f52c62

fix formatting

a45ce8e

Use a better erf function

4cce171

vrdn-23 changed the title ~~Add sqrt2 as constant for gelu_erf~~ Add sqrt2 as constant for gelu_erf and use libm erf Nov 7, 2025

ivarflakstad approved these changes Nov 7, 2025

View reviewed changes

ivarflakstad merged commit ade0918 into huggingface:main Nov 7, 2025
9 checks passed

vrdn-23 mentioned this pull request Nov 10, 2025

[Bugfix] Make Gelu Activations consistent across frameworks huggingface/text-embeddings-inference#753

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add sqrt2 as constant for gelu_erf and use `libm` erf #3168

Add sqrt2 as constant for gelu_erf and use `libm` erf #3168

Uh oh!

vrdn-23 commented Nov 7, 2025 •

edited

Loading

Uh oh!

vrdn-23 commented Nov 7, 2025

Uh oh!

vrdn-23 commented Nov 7, 2025 •

edited

Loading

Uh oh!

ivarflakstad left a comment

Uh oh!

Uh oh!

vrdn-23 commented Nov 7, 2025

Uh oh!

ivarflakstad commented Nov 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add sqrt2 as constant for gelu_erf and use libm erf #3168

Add sqrt2 as constant for gelu_erf and use libm erf #3168

Uh oh!

Conversation

vrdn-23 commented Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vrdn-23 commented Nov 7, 2025

Uh oh!

vrdn-23 commented Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ivarflakstad left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

vrdn-23 commented Nov 7, 2025

Uh oh!

ivarflakstad commented Nov 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add sqrt2 as constant for gelu_erf and use `libm` erf #3168

Add sqrt2 as constant for gelu_erf and use `libm` erf #3168

vrdn-23 commented Nov 7, 2025 •

edited

Loading

vrdn-23 commented Nov 7, 2025 •

edited

Loading