Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Seemingly inefficient code generated to forward a parameter to a function #22891

Closed
rprichard opened this issue Feb 28, 2015 · 14 comments
Closed
Labels
A-codegen Area: Code generation

Comments

@rprichard
Copy link
Contributor

The generated code for passing arguments larger than a machine word looks inefficient.

Test case:

#[inline(never)]
pub fn bar(x: &str) { println!("{}", x) }
pub fn foo(x: &str) { bar(x); bar(x); }

On x86_64-unknown-linux-gnu, compiling with rustc test.rs -O -C no-stack-check --crate-type dylib --emit asm, I see this code for foo:

    .section    .text._ZN3foo20hb6f131ac36a30532PaaE,"ax",@progbits
    .globl  _ZN3foo20hb6f131ac36a30532PaaE
    .align  16, 0x90
    .type   _ZN3foo20hb6f131ac36a30532PaaE,@function
_ZN3foo20hb6f131ac36a30532PaaE:
    .cfi_startproc
    pushq   %rbx
.Ltmp4:
    .cfi_def_cfa_offset 16
    subq    $16, %rsp
.Ltmp5:
    .cfi_def_cfa_offset 32
.Ltmp6:
    .cfi_offset %rbx, -16
    movq    %rdi, %rbx
    movups  (%rbx), %xmm0
    movaps  %xmm0, (%rsp)
    leaq    (%rsp), %rdi
    callq   _ZN3bar20hf21270c370b3427feaaE@PLT
    movups  (%rbx), %xmm0
    movaps  %xmm0, (%rsp)
    leaq    (%rsp), %rdi
    callq   _ZN3bar20hf21270c370b3427feaaE@PLT
    addq    $16, %rsp
    popq    %rbx
    retq
.Ltmp7:
    .size   _ZN3foo20hb6f131ac36a30532PaaE, .Ltmp7-_ZN3foo20hb6f131ac36a30532PaaE
    .cfi_endproc

foo receives the address of the &str in %rdi. It copies it into a new stack location for each call, then passes the address of that location to bar.

Could foo forward the address of the &str along without making stack copies?

If I remove one of the bar calls from foo, then the function also ought to become a tail call, but it doesn't. Tail call optimization does occur if I replace the &str types with &&str.

The calling convention for passing &str (and other arguments larger than a machine word?) seems to be:

  1. Make a copy of the argument on the stack.
  2. Pass the address of the copy in the conventional manner (in a register or on the stack).
  3. The callee may modify the copy.

i.e. We seem to be passing values both by-value and by-reference.

With the current convention, I think we could get smaller code by eliding some of the copies. If the copies were instead immutable, I think we could elide more copies.

Compiler version:

rustc 1.0.0-nightly (b47aebe3f 2015-02-26) (built 2015-02-27)
binary: rustc
commit-hash: b47aebe3fc2da06c760fd8ea19f84cbc41d34831
commit-date: 2015-02-26
build-date: 2015-02-27
host: x86_64-unknown-linux-gnu
release: 1.0.0-nightly
@dotdash
Copy link
Contributor

dotdash commented Feb 28, 2015

I'm currently working on a PR that makes use of LLVM's byval attribute. That would at least make the code a bit shorter because the address of the copy is implicitly passed because it's a fixed offset from the stack pointer. With that, foo is translated to:

    subq    $24, %rsp
.Ltmp4:
    .cfi_def_cfa_offset 32
    movaps  32(%rsp), %xmm0
    movups  %xmm0, (%rsp)
    callq   _ZN3bar20h6b5879ee29f32fdeeaaE@PLT
    movaps  32(%rsp), %xmm0
    movups  %xmm0, (%rsp)
    callq   _ZN3bar20h6b5879ee29f32fdeeaaE@PLT
    addq    $24, %rsp
    retq

And some other improvements also happen, I'll detail them with the PR, once I get the last few bugs fixed.

In a later step, we could definitely look into additional improvements like just dropping the byval flag for functions that are known not to modify their argument.

@rprichard
Copy link
Contributor Author

Good to know. Does dropping byval mean that the value is passed by-reference?

@dotdash
Copy link
Contributor

dotdash commented Feb 28, 2015

Since the byval attribute replaces the calls to memcpy that we currently manually emit, yes, that would then completely remove the copy and make it a pass-by-reference.

@rprichard
Copy link
Contributor Author

@dotdash Does your PR also fix #22924?

@kmcallister kmcallister added the A-codegen Area: Code generation label Mar 2, 2015
@dotdash
Copy link
Contributor

dotdash commented Mar 4, 2015

The byval stuff is blocked on #5016

@comex
Copy link
Contributor

comex commented Mar 12, 2015

As mentioned in a recent Discourse thread - why does it have to go through a double indirection (pointer to fat pointer to string data) at all, whether or not that can be optimized in forwarding situations? Normally a small structure passed by value would be placed in registers.

@dotdash
Copy link
Contributor

dotdash commented Mar 12, 2015

I'm working on a patch that does that (pass fat pointers in registers), and a first iteration shows good results, but I'm still passing the fat pointers as a single struct { i8*, i64 } instead of splitting it into two arguments which probably isn't a good idea. Unfortunately, to split a single rust argument into two llvm arguments needs some refactorings, so it might take some time to be ready.

@comex
Copy link
Contributor

comex commented Mar 12, 2015

Why isn't it a good idea? I think ABIs are normally good at dealing with structs passed by value, although ARM has a hazard in the case of small structs returned by value (it's slower than returning an equivalently sized scalar).

@dotdash
Copy link
Contributor

dotdash commented Mar 12, 2015

The rust ABI isn't good at it, yet. That's why I'm working on the patch to change it.

The thing with the first iteration of my patch is that it translates fn foo(s: &str) to declare void @foo({ i8*, i64 }) instead of declare void @foo(i8*, i64). It works to some degree, because LLVM happens to pass each struct field as if it was a distinct argument, so at the asm level you get the desired results (at least on x86 and x86_64), but the LLVM function passes get almost 40% slower, which I suspect to happen due to LLVM not handling FCA arguments that well. We might also miss some optimizations that way. We dropped FCA loads/stores a while ago because of that.

@comex
Copy link
Contributor

comex commented Mar 13, 2015

LLVM 'happening to pass' (i.e. must pass using standard ABIs on most architectures) structs like multiple values is what I meant by ABIs being good at it. Quite interesting that it slows down LLVM, though.

@dotdash
Copy link
Contributor

dotdash commented Mar 13, 2015

Oh, now I see. No, LLVM doesn't automatically handle ABI compliance. It only handles calling conventions. For example, we used to pass a struct consisting of 4 u8 values (think RGBA) directly as the struct type. LLVM did pass the 4 values as 4 individual arguments. But the x86_64 SysV ABI (and by now the rust ABI, too) wants that struct to be passed as a single u32 value. And it's the frontend's (i.e. our) job to handle that.

@comex
Copy link
Contributor

comex commented Mar 13, 2015

Oh... wow, I'm really surprised by that. I'd love to know what motivated that decision, considering how close LLVM IR is semantically to C.

dotdash added a commit to dotdash/rust that referenced this issue Jun 20, 2015
This has a number of advantages compared to creating a copy in memory
and passing a pointer. The obvious one is that we don't have to put the
data into memory but can keep it in registers. Since we're currently
passing a pointer anyway (instead of using e.g. a known offset on the
stack, which is what the `byval` attribute would achieve), we only use a
single additional register for each fat pointer, but save at least two
pointers worth of stack in exchange (sometimes more because more than
one copy gets eliminated). On archs that pass arguments on the stack, we
save a pointer worth of stack even without considering the omitted
copies.

Additionally, LLVM can optimize the code a lot better, to a large degree
due to the fact that lots of copies are gone or can be optimized away.
Additionally, we can now emit attributes like nonnull on the data and/or
vtable pointers contained in the fat pointer, potentially allowing for
even more optimizations.

This results in LLVM passes being about 3-7% faster (depending on the
crate), and the resulting code is also a few percent smaller, for
example:

   text    data  filename
5671479 3941461  before/librustc-d8ace771.so
5447663 3905745  after/librustc-d8ace771.so

1944425 2394024  before/libstd-d8ace771.so
1896769 2387610  after/libstd-d8ace771.so

I had to remove a call in the backtrace-debuginfo test, because LLVM can
now merge the tails of some blocks when optimizations are turned on,
which can't correctly preserve line info.

Fixes rust-lang#22924

Cc rust-lang#22891 (at least for fat pointers the code is good now)
bors added a commit that referenced this issue Jun 20, 2015

This has a number of advantages compared to creating a copy in memory
and passing a pointer. The obvious one is that we don't have to put the
data into memory but can keep it in registers. Since we're currently
passing a pointer anyway (instead of using e.g. a known offset on the
stack, which is what the `byval` attribute would achieve), we only use a
single additional register for each fat pointer, but save at least two
pointers worth of stack in exchange (sometimes more because more than
one copy gets eliminated). On archs that pass arguments on the stack, we
save a pointer worth of stack even without considering the omitted
copies.

Additionally, LLVM can optimize the code a lot better, to a large degree
due to the fact that lots of copies are gone or can be optimized away.
Additionally, we can now emit attributes like nonnull on the data and/or
vtable pointers contained in the fat pointer, potentially allowing for
even more optimizations.

This results in LLVM passes being about 3-7% faster (depending on the
crate), and the resulting code is also a few percent smaller, for
example:

|text|data|filename|
|----|----|--------|
|5671479|3941461|before/librustc-d8ace771.so|
|5447663|3905745|after/librustc-d8ace771.so|
| | | |
|1944425|2394024|before/libstd-d8ace771.so|
|1896769|2387610|after/libstd-d8ace771.so|

I had to remove a call in the backtrace-debuginfo test, because LLVM can
now merge the tails of some blocks when optimizations are turned on,
which can't correctly preserve line info.

Fixes #22924

Cc #22891 (at least for fat pointers the code is good now)
@steveklabnik
Copy link
Member

Triage: it's not totally clear to me if this ticket is still valid. A lot has changed between now and then, and I get very different code, but I'm not sure if that's from all the other stuff that's gone on since pre-1.0.

@Mark-Simulacrum
Copy link
Member

Yeah as per the above it looks like this is probably not valid anymore, or at least I can't say either way -- closing. If that's not the case, please reopen; I apologize ahead of time!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-codegen Area: Code generation
Projects
None yet
Development

No branches or pull requests

6 participants