Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[X86] Worse runtime performance on Zen 4 CPU when optimizing for znver4 or skylake #91370

Open
Systemcluster opened this issue May 7, 2024 · 13 comments
Assignees

Comments

@Systemcluster
Copy link

The following code runs around 300% slower on Zen 4 when optimized for znver4 or skylake than when optimized for znver3 or other targets.

pub fn sum(a: &[i64]) -> i64 {
    let mut sum = 0;
    a.chunks_exact(8).for_each(|x| {
        for i in x {
            sum += i;
        }
    });
    sum
}
Full code
pub fn sum(a: &[i64]) -> i64 {
    let mut sum = 0;
    a.chunks_exact(8).for_each(|x| {
        for i in x {
            sum += i;
        }
    });
    sum
}

fn main() {
    let nums = std::hint::black_box(generate());
    let now = std::time::Instant::now();
    let sum = sum(&nums);
    println!("{:?} / {}", now.elapsed(), sum);
}

fn generate() -> Vec<i64> {
    let mut v = Vec::new();
    for i in 0..1000000000 {
        v.push(i);
    }
    v
}

Running on a Ryzen 7950X:

> rustc.exe -Ctarget-cpu=x86-64-v4 -Copt-level=3 .\src\main.rs && ./main.exe
138.7342ms / 499999999500000000

> rustc.exe -Ctarget-cpu=x86-64-v3 -Copt-level=3 .\src\main.rs && ./main.exe
136.2689ms / 499999999500000000

> rustc.exe -Ctarget-cpu=x86-64 -Copt-level=3 .\src\main.rs && ./main.exe   
136.0648ms / 499999999500000000

> rustc.exe -Ctarget-cpu=znver4 -Copt-level=3 .\src\main.rs && ./main.exe   
543.1562ms / 499999999500000000

> rustc.exe -Ctarget-cpu=znver3 -Copt-level=3 .\src\main.rs && ./main.exe   
137.4426ms / 499999999500000000

> rustc.exe -Ctarget-cpu=skylake -Copt-level=3 .\src\main.rs && ./main.exe
588.4743ms / 499999999500000000

> rustc.exe -Ctarget-cpu=haswell -Copt-level=3 .\src\main.rs && ./main.exe
138.5313ms / 499999999500000000

Disassembly here: https://godbolt.org/z/fzaGhGdWW

The tested optimization targets all generate different assembly with different levels of unrolling, but the znver4 and skylake targets seem to be outliers.

I don't know whether the skylake target has the same issue or whether it's just caused by optimization target / CPU mismatch, but both result in the long list of constant values and show similar runtime performance. I also didn't test other targets than the above listed.

Split from #90985 (comment)

@llvmbot
Copy link
Collaborator

llvmbot commented May 7, 2024

@llvm/issue-subscribers-backend-x86

Author: Chris (Systemcluster)

The following code runs around 300% slower on Zen 4 when optimized for `znver4` or `skylake` than when optimized for `znver3` or other targets.
pub fn sum(a: &amp;[i64]) -&gt; i64 {
    let mut sum = 0;
    a.chunks_exact(8).for_each(|x| {
        for i in x {
            sum += i;
        }
    });
    sum
}

<details>
<summary>Full code</summary>

pub fn sum(a: &amp;[i64]) -&gt; i64 {
    let mut sum = 0;
    a.chunks_exact(8).for_each(|x| {
        for i in x {
            sum += i;
        }
    });
    sum
}

fn main() {
    let nums = std::hint::black_box(generate());
    let now = std::time::Instant::now();
    let sum = sum(&amp;nums);
    println!("{:?} / {}", now.elapsed(), sum);
}

fn generate() -&gt; Vec&lt;i64&gt; {
    let mut v = Vec::new();
    for i in 0..1000000000 {
        v.push(i);
    }
    v
}

</details>

Running on a Ryzen 7950X:

&gt; rustc.exe -Ctarget-cpu=x86-64-v4 -Copt-level=3 .\src\main.rs &amp;&amp; ./main.exe
138.7342ms / 499999999500000000

&gt; rustc.exe -Ctarget-cpu=x86-64-v3 -Copt-level=3 .\src\main.rs &amp;&amp; ./main.exe
136.2689ms / 499999999500000000

&gt; rustc.exe -Ctarget-cpu=x86-64 -Copt-level=3 .\src\main.rs &amp;&amp; ./main.exe   
136.0648ms / 499999999500000000

&gt; rustc.exe -Ctarget-cpu=znver4 -Copt-level=3 .\src\main.rs &amp;&amp; ./main.exe   
543.1562ms / 499999999500000000

&gt; rustc.exe -Ctarget-cpu=znver3 -Copt-level=3 .\src\main.rs &amp;&amp; ./main.exe   
137.4426ms / 499999999500000000

&gt; rustc.exe -Ctarget-cpu=skylake -Copt-level=3 .\src\main.rs &amp;&amp; ./main.exe
588.4743ms / 499999999500000000

&gt; rustc.exe -Ctarget-cpu=haswell -Copt-level=3 .\src\main.rs &amp;&amp; ./main.exe
138.5313ms / 499999999500000000

Disassembly here: https://godbolt.org/z/fzaGhGdWW

The tested optimization targets all generate different assembly with different levels of unrolling, but the znver4 and skylake targets seem to be outliers.

I don't know whether the skylake target has the same issue or whether it's just caused by optimization target / CPU mismatch, but both result in the long list of constant values and show similar runtime performance. I also didn't test other targets than the above listed.

Split from #90985 (comment)

@chriselrod
Copy link

chriselrod commented May 12, 2024

It is needlessly using gather instructions with -Ctarget-cpu=skylake and with -Ctarget-cpu=znver4.
It is not using gather instructions for any of the other tested architectures.

Gather instructions are extremely slow: 1 uop per loaded element.

Someone should look into why it is using them when it does not need to.

@Systemcluster
Copy link
Author

Systemcluster commented May 12, 2024

It seems to generate gather instructions when enabling the avx512f target-feature with specific target-cpus. This reproduces the regression with e.g. the znver3 target: https://godbolt.org/z/48nq66rzc

> rustc.exe -Ctarget-cpu=znver3 -Ctarget-feature=+avx512f -Copt-level=3 .\src\main.rs && ./main.exe
548.638ms / 499999999500000000

It doesn't however when adding it to the x86-64-v4 target (it should already have it enabled): https://godbolt.org/z/WWoTTs8b3

> rustc.exe -Ctarget-cpu=x86-64-v4 -Ctarget-feature=+avx512f -Copt-level=3 .\src\main.rs && ./main.exe
142.0749ms / 499999999500000000

But it does when adding it to x86-64-v3: https://godbolt.org/z/7YPr7Y6qE

> rustc.exe -Ctarget-cpu=x86-64-v3 -Ctarget-feature=+avx512f -Copt-level=3 .\src\main.rs && ./main.exe
562.7708ms / 499999999500000000

I assume the issue might be here:

int X86TTIImpl::getGatherOverhead() const {
// Some CPUs have more overhead for gather. The specified overhead is relative
// to the Load operation. "2" is the number provided by Intel architects. This
// parameter is used for cost estimation of Gather Op and comparison with
// other alternatives.
// TODO: Remove the explicit hasAVX512()?, That would mean we would only
// enable gather with a -march.
if (ST->hasAVX512() || (ST->hasAVX2() && ST->hasFastGather()))
return 2;
return 1024;
}

Every AVX512-supporting target seems to be treated as having a miniscule gather overhead, which must be wrong.

(I'm not sure why x86-64-v4 is treated differently, how are the x86-64-vN targets defined?)

@RKSimon
Copy link
Collaborator

RKSimon commented May 12, 2024

The costs are low for CPUs we assume to have fast gather/scatter (including anything that uses avx512 instructions). The numbers are definitely closer to top end Intel CPUs than Zen4, but are optimistic even then. The overhead costs are just a fudge, and are likely to only help with throughout costs for vectorisation, not sizelatency costs for unrolling etc.

@RKSimon
Copy link
Collaborator

RKSimon commented May 12, 2024

@ganeshgit The znver4 scheduler model seems to be missing gather/scatter instruction entries. This alone won't fix the problem but helps with analysis to determine optimal cost model table numbers.

@chriselrod
Copy link

It generates gather for skylake which does not have AVX512, but not skylake-avx512 (which does).
Skylake hasAVX2(), and I'm guessing hasFastGather().

Zen4's gathers are just as fast as Skylake's: https://uops.info/html-instr/VGATHERQPD_YMM_VSIB_YMM_YMM.html
although requires 24 uops for Zen4, vs only 5 for Skylake.

At a throughput of 1/4 cycles, that is 8x slower than vmovupd (which is 2/1 cycle).

Note that this means even the scalar vmovsd, at 2 loads/1 cycles, would take 2 cycles to load 4 double, giving it twice the throughput in loads as vgatherqpd on Skylake (4 cycles).

If it speeds the rest of the code up by enabling vectorization, it can still be profitable. It is better than vmovsd + all the shufflevectors needed to assemble a SIMD vector. Especially if the loop is bottlenecked by compute even in the presence of gathers (in which case the gathers would essentially be free).
But it should be heavily penalized.

I don't know if the loop vectorizer (without vplan) can compare different options at all, but if any vectorization options exist that avoid gather and scatter (as is the case here), those should be taken.
If vectorization is only possible in the presence of gather/scatter, ideally there'd be a check to see if the code is bottlenecked by load/store throughput. It so, gather/scatter actually hurt these -- even on fastGather() CPUs like Skylake, and thus should likely still be avoided.

@phoebewang
Copy link
Contributor

but if any vectorization options exist that avoid gather and scatter (as is the case here), those should be taken.

Do you mean these:

$ clang --help | grep -E '\-gather|scatter'
  -mno-gather             Disable generation of gather instructions in auto-vectorization(x86 only)
  -mno-scatter            Disable generation of scatter instructions in auto-vectorization(x86 only)

@ganeshgit
Copy link
Contributor

It generates gather for skylake which does not have AVX512, but not skylake-avx512 (which does). Skylake hasAVX2(), and I'm guessing hasFastGather().

Zen4's gathers are just as fast as Skylake's: https://uops.info/html-instr/VGATHERQPD_YMM_VSIB_YMM_YMM.html although requires 24 uops for Zen4, vs only 5 for Skylake.

At a throughput of 1/4 cycles, that is 8x slower than vmovupd (which is 2/1 cycle).

Note that this means even the scalar vmovsd, at 2 loads/1 cycles, would take 2 cycles to load 4 double, giving it twice the throughput in loads as vgatherqpd on Skylake (4 cycles).

If it speeds the rest of the code up by enabling vectorization, it can still be profitable. It is better than vmovsd + all the shufflevectors needed to assemble a SIMD vector. Especially if the loop is bottlenecked by compute even in the presence of gathers (in which case the gathers would essentially be free). But it should be heavily penalized.

I don't know if the loop vectorizer (without vplan) can compare different options at all, but if any vectorization options exist that avoid gather and scatter (as is the case here), those should be taken. If vectorization is only possible in the presence of gather/scatter, ideally there'd be a check to see if the code is bottlenecked by load/store throughput. It so, gather/scatter actually hurt these -- even on fastGather() CPUs like Skylake, and thus should likely still be avoided.

It is bit tricky on how the latencies are measured and reported in uops.info! As you indicated, an instruction serviced from the microcode ROM\sequencer is costly and can be preferred only based on the insn mix with which they operate. With high number of uops, the pipelines get choked heavily and so it always good to avoid the sequence unless a user knows and wants gather instruction at any cost. So, fastGather must be disabled for zen3 & zen4.

@RKSimon
Copy link
Collaborator

RKSimon commented May 13, 2024

OK, so this looks like we just need to update the TTI gather/scatter overhead costs to not implicitly assume TuningFastGather for avx512 targets (I'm assuming we can reuse TuningFastGather for scatter as well).

All recent Intel CPUs (Skylake onward) already have TuningFastGather in their tuning flags so I don't think anything needs updating there.

There is the question of whether x86-64-v4 should still set TuningFastGather or not.

@phoebewang Do you know of any other side-effects we need to consider?

@phoebewang
Copy link
Contributor

I don't know. CC @gpei-dev

@RKSimon
Copy link
Collaborator

RKSimon commented May 13, 2024

An alternative is we just set the TuningPreferNoGather/TuningPreferNoScatter flags to znver4

@ganeshgit
Copy link
Contributor

An alternative is we just set the TuningPreferNoGather/TuningPreferNoScatter flags to znver4

I will post the code changes for this. @phoebewang what is your take on having NoScatter and NoGather for x86-64-v4?

@gpei-dev
Copy link
Contributor

Intel optimizes gather instructions for years, big core's gather performance is better than simulated gather most time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants