Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite the pattern matching code in trans. #27050

Closed
wants to merge 3 commits into from

Conversation

eefriedman
Copy link
Contributor

The old code was not well structured, difficult to understand,
and buggy.

The new implementation is completely naive, so there may be a slight
runtime performance loss. That said, adding optimizations on top of
a clear and correct implementation seems easier than trying to
fix the old mess.

Fixes issue #19064.
Fixes issue #26989.
Fixes issue #26251.
Fixes issue #18060.
Fixes issue #24875.
Fixes issue #23311.
Fixes issue #20046.

The old code was not well structured, difficult to understand,
and buggy.

The new implementation is completely naive, so there may be a slight
runtime performance loss. That said, adding optimizations on top of
a clear and correct implementation seems easier than trying to
fix the old mess.

Fixes issue rust-lang#19064.
Fixes issue rust-lang#26989.
Fixes issue rust-lang#26251.
Fixes issue rust-lang#18060.
Fixes issue rust-lang#24875.
Fixes issue rust-lang#23311.
Fixes issue rust-lang#20046.
@rust-highfive
Copy link
Collaborator

r? @brson

(rust_highfive has picked a reviewer for you, use r? to override)

Store(bcx, info, expr::get_len(bcx, scratch.val));
field_datum = Datum::new(scratch.val, scratch.ty, Lvalue);
}
bcx = compile_pattern(bcx, field_datum, &field_pat.node.pat, failure_dest, pat_bindings);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this line is too long and failed travis

@arielb1
Copy link
Contributor

arielb1 commented Jul 15, 2015

are there any performance implications?

@petrochenkov
Copy link
Contributor

@eefriedman
While you are here could you look at #23121?
It was mostly implemented here https://github.com/JakubAdamBukaj/rust/tree/array-pattern-changes, but had some ICEs originating exactly in trans code you rewrote.

@eefriedman
Copy link
Contributor Author

@petrochenkov I called one of those issues out with a FIXME: https://github.com/eefriedman/rust/blob/match-cleanup/src/librustc_trans/trans/_match.rs#L494 . The adjustments should be easy enough to handle.

@arielb1 I have a terrible setup for benchmarking... but at a high level the performance implication is that this version will duplicate checks in certain cases where the old code would not.

@brson
Copy link
Contributor

brson commented Jul 16, 2015

Wow, epic patch!

@brson
Copy link
Contributor

brson commented Jul 16, 2015

r? @pnkfelix

@rust-highfive rust-highfive assigned pnkfelix and unassigned brson Jul 16, 2015
}

/// Helper for converting from the ValueRef that we pass around in the match code, which is always
/// an lvalue, into a Datum. Eventually we should just pass around a Datum and be done with it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this extra newline is failing tidy

@brson
Copy link
Contributor

brson commented Jul 16, 2015

I wonder if leaning on llvm to optimize branches into switch statements offers enough performance opportunities for us. It seems very likely to make compilation slower, and istm there's a chance this could turn into a dead end if it turns out that writing the switch IR directly produces better code more often.

EDIT: I asked @sunfish about this strategy and he seemed enthusiastic. We should definitely do a bunch of measurement for regressions in both runtime and compile time before merging.

@eefriedman
Copy link
Contributor Author

Whether we generate switch instructions or branch instructions in the IR is not at all relevant; LLVM's SimplifyCFG pass runs very early in the optimization process.

The potential for redundant code is relevant... but I think optimizations along those lines can be written incrementally. This patch doesn't introduce anything along those lines because I didn't want to complicate the patch without any evidence that it's actually useful.

@eddyb
Copy link
Member

eddyb commented Jul 16, 2015

Given the upcoming MIR (rust-lang/rfcs#1211), I believe it's better to have a simpler and saner implementation which can be then lifted up to the HIR->MIR lowering pass and expanded upon.
Doing anything too clever during translation to LLVM is only asking for trouble, honestly.

@arielb1
Copy link
Contributor

arielb1 commented Jul 16, 2015

MIR will basically require _match to be reimplemented anyway (and @nikomatsakis already has such reimplementation) as part of the lowering step.

@pnkfelix
Copy link
Member

@eddyb I cannot tell if you are arguing that we should attempt to land this PR (on the basis that it will simplify future efforts to then move to a MIR based implementation), or if you (like @arielb1, I think?) are arguing that we need not invest effort trying to rewrite the existing trans/_match code because we will be better off just doing a more direct rewrite straight to the MIR based implementation?

@eddyb
Copy link
Member

eddyb commented Jul 17, 2015

@pnkfelix I am for this PR, even if only the tests remain from it in the long run, that alone will ensure the MIR does not regress back to the old broken state.

@Aatch
Copy link
Contributor

Aatch commented Jul 17, 2015

Hmm, I definitely want to see some performance stats for this. The old code may have been difficult to understand, but it used a fairly advanced method for compiling the match. I don't want to merge what could be a major performance regression just because the code is easier to read.

@dotdash
Copy link
Contributor

dotdash commented Jul 17, 2015

I'd also like to see some numbers/codegen results, but I'd not be surprised if in some cases, e.g. matches on tuples of enums like in PartialEq, we might see better results as LLVM had some difficulties optimizing the switch there.

@nikomatsakis
Copy link
Contributor

I don't think that a possible transition to MIR affects whether or not we should land this patch. There is going to be a fair amount of effort needed to port trans to use MIR, which will take some time, and having patterns work more correctly in the meantime is great! I agree with @Aatch that some performance data would be nice.

@eefriedman
Copy link
Contributor Author

I tried some benchmarking, but my numbers are too unreliable to actually be usable. Maybe someone else can give it a shot?

@bors
Copy link
Contributor

bors commented Jul 22, 2015

☔ The latest upstream changes (presumably #26683) made this pull request unmergeable. Please resolve the merge conflicts.

@pnkfelix
Copy link
Member

@eefriedman I will look into benchmarking a little; sorry for the radio silence, I have been pretty swamped lately.

@Manishearth
Copy link
Member

Updates on this?

@pnkfelix
Copy link
Member

So, I have done some preliminary profiling.

Executive summary: This new match implementation runs faster, but regresses quality of generated code. I don't think we can land it as is without addressing the last set of micro-benchmark results presented in the shell transcript at the bottom of this comment.


Longer story:

rustc compiling libsyntax data

On some "normal code" operating on a "normal input" (in this case, rustc compiling an expanded libsyntax.rs), the differences appear to be well within the noise.

  • reference data (google sheet), see sheet 1
    • a caveat: this set of data was gathered with some other load (namely a web browser editing above spreadsheet) on the machine in question. I discovered this had significant impact while working on the second data set below, and so I removed the extra load, but I have not gathered new numbers for the libsyntax.rs compiles yet.

rustc compiling many_small_arms data

On rustc operating on arguably abnormal input, namely a test case with a 5,000 tiny match arms, the new match code within trans is quite a bit faster than the old match code (within trans), perhaps a 4/3 speedup. (Don't get excited, there's huge drawback, discussed below.)

  • Note that I'm talking about codegen time here, measured by the difference between compiling to the end versus with -Z no-trans.
  • And this is codegen for an arguably ridiculous test case.
  • reference data (google sheet), see sheet 2
  • All of this data was gathered on an unloaded machine.

In both the above cases, I'm measuring codegen time, without passing -O to the compiler (because I was not interested in how long LLVM was taking to optimize the code it was given, though perhaps that would indeed be something of independent significance).

many_small_arms execution time data

BUT, this comes at a troubling cost.

I mentioned that test input with 5,000 tiny match arms above. It was designed to measure how fast the new match code was taking within trans.

But I then realized that I probably could (and should) be measuring the quality of the code being generated by the match in the test.

The test case above was a little bit too trivial for something like LLVM to optimize, and it also had a println in the code I would want to be an inner loop, so I modified it slightly (essentially to take and return data, and to stop printing the match result; its probably still quite easy for LLVM to optimize as written).

The new test case is here: 5,000 arm matches called 100,000 times with varying inputs.

After doing this, I discovered an unfortunate drawback to the new code in this PR. Compiling with -O has almost no impact on the execution time, but if you leave off -O (as many people do during development), the runtime for this test case goes from 0.27 seconds to 9.32 seconds, a more than 30x slowdown.

% cd ../../rust-baseline/objdir-opt/
% git log -1 --oneline
6800538 Auto merge of #26984 - nham:errorck-ignore-long-diag, r=brson
% ./x86_64-apple-darwin/stage2/bin/rustc --version
rustc 1.3.0-dev (680053848 2015-07-13)
% time ./x86_64-apple-darwin/stage2/bin/rustc /tmp/driver_run.rs -o driver_run 

real    0m14.264s
user    0m14.073s
sys 0m0.187s
% time ./driver_run 

real    0m0.279s
user    0m0.262s
sys 0m0.003s
% cd ../../rust-rw-match/objdir-opt/
% git log -1 --oneline
54b4a0f Fix lint.
% ./x86_64-apple-darwin/stage2/bin/rustc --version
rustc 1.3.0-dev (54b4a0f5b 2015-07-16)
% time ./x86_64-apple-darwin/stage2/bin/rustc /tmp/driver_run.rs -o driver_run 

real    0m13.745s
user    0m13.523s
sys 0m0.219s
% time ./driver_run 

real    0m9.235s
user    0m9.219s
sys 0m0.006s
% 

(Above shell transcript was gathered on an unloaded machine.)

Now, I know this is a totally artificial micro-benchmark, but

  1. I think the benchmark remains somewhat demonstrative of real-world dispatch code, and
  2. this is one of those cases that motivates using a micro-benchmark -- to try to tease out performance cliffs like the one shown here.

@eefriedman
Copy link
Contributor Author

Hmm... addressing that particular micro-benchmark requires generating a binary search tree or jump table (probably using an LLVM switch instruction). Conceptually, it's pretty simple: basically, it's just a matter of splitting the match statement based on integer fields. In practice, there are a lot of details... especially if we want a decent heuristic to decide which field to split on first. Not sure it's worthwhile, given that this code is going to get killed by MIR.

@ranma42
Copy link
Contributor

ranma42 commented Sep 2, 2015

cc @ranma42

@aturon
Copy link
Member

aturon commented Sep 4, 2015

ping @rust-lang/compiler, would be great to give @eefriedman a definitive answer here.

@nikomatsakis
Copy link
Contributor

@aturon yes. @eefriedman and I were discussing over e-mail and I think we reached the conclusion it made more sense to wait for the MIR to land.

@nikomatsakis
Copy link
Contributor

Closing in favor of MIR transition, as discussed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.