Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Less than ideal codegen for iteration #351

Open
dragostis opened this issue Nov 2, 2023 · 2 comments
Open

Less than ideal codegen for iteration #351

dragostis opened this issue Nov 2, 2023 · 2 comments

Comments

@dragostis
Copy link

The way query iteration is currently implemented leads to inefficient code. On my machine, iterate_mut_100k runs in around 29us. However, running the same iteration through explicit archetypes leads to much better performance at around 10us. Looking at the assembly, the difference is loop unrolling and better auto-vectorization.

From previous experience writing code like this in Rust, I'd say that the issue is caused by having to deal with case where we need to move to the next archetype when iterating over the query. I haven't yet tested, but I'd expect similar or even worse performance were the archetypes Iterator::chained together.

Here is the code that runs ~2.9x faster:

fn iterate_mut_100k_archetypes(b: &mut Bencher) {
    let mut world = World::new();
    for i in 0..100_000 {
        world.spawn((Position(-(i as f32)), Velocity(i as f32)));
    }
    b.iter(|| {
        for archetype in world.archetypes() {
            if let (Some(mut pos), Some(vel)) = (archetype.get::<&mut Position>(), archetype.get::<&Velocity>()) {
                for (pos, vel) in pos.iter_mut().zip(vel.iter()) {
                    pos.0 += vel.0;
                }
            }
        }
    })
}

I'm opening this issue to start a discussion about how some of this performance could be tapped into without having to rely on explicitly iterating through all the archetypes.

@Ralith
Copy link
Owner

Ralith commented Nov 2, 2023

I'm not overly concerned about this because I think the difference will disappear outside of trivial loop bodies (i.e. in real use), but improvements would certainly be welcome.

@dragostis
Copy link
Author

I mostly agree about real use, but there are many cases where this would be useful. For example, in forma there's a decently complex rasterization pipeline that relies on rustc/LLVM auto-vectorizing every line. (of which there can be many)

I'll try to see if I can get LLVM to spit out some interesting optimization remarks that might be useful and thanks for the timely reply!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants