Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gh-104909: Split BINARY_OP into micro-ops #104910

Merged
merged 8 commits into from
May 31, 2023
Merged

Conversation

gvanrossum
Copy link
Member

@gvanrossum gvanrossum commented May 24, 2023

This includes some necessary changes to the cases generator.

The diff is larger than it could be because I moved things around so the specializations are grouped by type first, then by operation (since the guards go with the type).

This version doesn't yet do ugly things to access oparg (as described in the original ideas issue) since we don't have a tier-2 interpreter yet.

BINARY_OP_INPLACE_ADD_UNICODE remains an odd duck.

To begin, we split BINARY_OP.
@gvanrossum gvanrossum marked this pull request as draft May 25, 2023 03:22
@gvanrossum gvanrossum requested a review from brandtbucher May 25, 2023 03:34
@gvanrossum
Copy link
Member Author

@brandtbucher Could you give me your professional opinion on this? These are relatively straightforward instruction splits, but I didn't really solve the problem of the specializing op requiring access to both a cache entry and the oparg (see faster-cpython/ideas#592 (comment)).

Copy link
Member

@Fidget-Spinner Fidget-Spinner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I wonder if the pyperformance numbers are neutral on this? I would expect them to be.

@gvanrossum
Copy link
Member Author

LGTM. I wonder if the pyperformance numbers are neutral on this? I would expect them to be.

Okay, I've started a benchmark run. Results in a few hours.

@gvanrossum
Copy link
Member Author

Okay, I've started a benchmark run. Results in a few hours.

Results: "1.00 slower". The plot states 0.9991 which seems well within the margin of statistical significance. I'll un-draft it, hopefully @brandtbucher will review.

@gvanrossum gvanrossum marked this pull request as ready for review May 26, 2023 00:27
@brandtbucher
Copy link
Member

Thanks, I’ll take a look tomorrow!

Copy link
Member

@brandtbucher brandtbucher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool! I left some comments below (the're on specific lines in the diff, but many of them are general to most of the instrucions changed).

At a higher level, I'm worried that this will be pretty cumbersome to carry out further if we don't design a good solution for sharing locals/opargs/etc. across uops, or maybe even redesigning the instruction format to maybe reorder the parts of the stream in the order they're used.

I don't have any key insights here, but I sense that we'll be sort of fighting with the current generator/instruction format otherwise.

Comment on lines -282 to +289
BINARY_OP_ADD_FLOAT,
BINARY_OP_MULTIPLY_INT,
BINARY_OP_ADD_INT,
BINARY_OP_ADD_UNICODE,
// BINARY_OP_INPLACE_ADD_UNICODE, // This is an odd duck.
BINARY_OP_SUBTRACT_INT,
BINARY_OP_MULTIPLY_FLOAT,
BINARY_OP_MULTIPLY_INT,
BINARY_OP_ADD_FLOAT,
BINARY_OP_SUBTRACT_FLOAT,
BINARY_OP_SUBTRACT_INT,
BINARY_OP_ADD_UNICODE,
// BINARY_OP_INPLACE_ADD_UNICODE, // See comments at that opcode.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nit: these were alphabetized before, and now they're not. I think it makes sense to reorder the implementations below, but I'm not sure there's also value in reordering these.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a big fan of alphabetization any more (just search :-), and I tried to let the grouping match the ordering of the definitions below, with the exception of BINARY_OP itself, which is somewhere else entirely.

TBH I'm not sure that there's a single organizing principle in the ordering in this file any more; I like to keep families together, but I also don't like to move code around unnecessarily.

If you insist I can undo this chunk.

Comment on lines 292 to 295
op(_BINARY_OP_INT_GUARD, (left, right -- left, right)) {
DEOPT_IF(!PyLong_CheckExact(left), BINARY_OP);
DEOPT_IF(!PyLong_CheckExact(right), BINARY_OP);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to me that there might be value in splitting these types of checks up: one for guarding the top of stack, and one for guarding the second item on the stack.

That way we could successfully remove uops for cases like 1 + rhs or lhs += 1, where they type of one argument is known (or inferred).

Copy link
Member

@Fidget-Spinner Fidget-Spinner May 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a heads up: in the very near future for py lazy basic block, we plan to just have a generic CHECK_INT x that checks the xth stack entry is indeed an int. The main reason is that it would save us an opcode.

Copy link
Member

@brandtbucher brandtbucher May 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. Just need space for a couple extra opargs. I wonder if it would make sense to have "hardcoded" or "shared" opargs as part of macro instructions.

So we could have (with straw-man syntax):

macro(BINARY_OP, oparg) = _BINARY_OP_SPECIALIZE(oparg) + _BINARY_OP_ACTION(oparg);
macro(BINARY_OP_ADD_INT, oparg) = _CHECK_INT(1) + _CHECK_INT(2) + _BINARY_OP_ADD_INT(oparg);

I sort of like it, and maybe the scheme could be extended to share other locals or caches too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gvanrossum, thoughts on something like the above?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot to also mention that we encountered many situations where one type is known but the other isnt. So it helps to have some granularity.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are all very interesting proposals, and I'll hold up this PR until we've got some agreement on what we eventually want to do.

  • Splitting the type-checking guard into atomic pieces: this makes sense from the POV of machine code generation (fewer, simpler templates) but it would make the Tier-2 interpreter slower, due to increased opcode decoding overhead (unless the guards can be eliminated, of course). There are various possible ways to address that though (possibly even super-micro-ops :-), so maybe we should just go for it, if we can agree on a syntax.
  • Having micro-ops with parameters, notably _CHECK_INT(1). I don't know how much effort this would be in the generator, but I expect it'll be easy enough, and if we are indeed going with the smallest possible uops, we should just do it.
  • Being explicit about which uops use oparg. This feels like it might require more effort in the generator, but it also sounds like an interesting way to go.

All in all, I think we should probably have an in-person discussion about this (earliest I'm available is the week of June 5th).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd keep the checks for pairs of values for now.
Even if we want to separate them for optimization, we are likely to want to combine them into superinstructions for interpretation, so we might as well keep them for now.

We can always split them up later.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I think I'll follow Mark's guidance, and keep the current structure.

Comment on lines +308 to +309
_Py_DECREF_SPECIALIZED(right, (destructor)PyObject_Free);
_Py_DECREF_SPECIALIZED(left, (destructor)PyObject_Free);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might benefit in the future from being two additional uops too, since they can be removed for known immortal values (although mechanically that might be difficult right now, given that we don't have a good way of sharing locals across uops).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, once we're over our concern for dispatch overhead in the Tier-2 interpreter, that makes a lot of sense. But I'm not quite over that (I still would like to see a Tier-2 interpreter that not slower than Tier-1 without all conceivable optimizations). So maybe we should just postpone this (maybe we need a new Ideas issue about the granularity of uops).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not worry about refcount stuff for now, we can always do it later.
The important thing for now is that the interpreter generator can understand instructions made up of a sequence of micro-ops. With that we can break down instructions as much as we like later on.

};


inst(BINARY_OP_MULTIPLY_INT, (unused/1, left, right -- prod)) {
op(_BINARY_OP_INT_GUARD, (left, right -- left, right)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I think maybe the name could just be _GUARD_INT or something, since this could potentially be reused for things like COMPARE_OP.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could rename it once we get to that bridge... But we'd probably end up moving it around again too. Anyway, it looks like there are other concerns for these guards.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested name GUARD_BOTH_INTS as it checks that both are ints, regardless of how they are used.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll use _GUARD_BOTH_INT, _GUARD_BOTH_FLOAT, and _GUARD_BOTH_UNICODE.

return 2;
case BINARY_OP_MULTIPLY_FLOAT:
return 2;
return 2+2;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see a lot of these values changing. That seems unexpected.

Copy link
Member

@carljm carljm May 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's because this is implemented by counting the total stack inputs consumed by all the constituent uops, and similarly (but separately, in a separate switch) the total outputs pushed to stack by all constituent uops. So when an instruction gets split into two uops, and the first one pushes two (internal) "outputs" to stack that are consumed as inputs by the second one, you end up adding two to both "inputs consumed" and "outputs pushed", but that cancels out so the overall stack effect of the opcode remains the same.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, but I would expect this value to not exceed the total stack size at that point. Previously, the _PyOpcode_num_popped and _PyOpcode_num_pushed could be used individually, but now they are only correct if used together, back to back, to calculate a raw stack effect.

For instance, I'm pretty sure these functions can currently be used to track data-dependencies between instructions in a CFG without resorting to full abstract interpretation. This change breaks that.

Copy link
Member

@carljm carljm May 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be possible to update the cases generator so it cancels out the "internal" stack effects of macro instructions... internally, and they don't show up in _PyOpcode_num_popped and _PyOpcode_num_pushed.

There are already macro instructions that currently break this, though (LOAD_NAME is one, since the PEP 695 implementation was merged), so it's not new with this PR, and changing it could probably be considered separately.

Copy link
Member Author

@gvanrossum gvanrossum May 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for calling this out. This is indeed unexpected. I'll look into how this is computed in the generator (it's been a while since I was in that code). If it's simple to fix I'll do it here, otherwise I'll plan a separate PR. (EDIT: I'm now pretty sure I took a shortcut when first writing this code.) (EDIT 2: It's not a simple fix; I've filed #105034 and will address this separately.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've commented on #105034.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have now fixed this. @brandtbucher Can you re-review?

Co-authored-by: Brandt Bucher <[email protected]>
Copy link
Member Author

@gvanrossum gvanrossum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At a higher level, I'm worried that this will be pretty cumbersome to carry out further if we don't design a good solution for sharing locals/opargs/etc. across uops, or maybe even redesigning the instruction format to maybe reorder the parts of the stream in the order they're used.

I don't have any key insights here, but I sense that we'll be sort of fighting with the current generator/instruction format otherwise.

Thanks for this observation -- one reason I started pulling on this particular thread now was to make sure that things like this would surface sooner rather than later.

I am beginning to think that one thing we're running into here is that the needs for the optimizer, the machine code generator, and the Tier-2 interpreter are all somewhat different. My own focus has been largely on the Tier-2 interpreter, since I'm more confident that I can build one. Assuming we keep it a stack-based VM using the same stack as the base interpreter (not a forgone conclusion yet, but makes some things easier), the only way for uops to pass things to each other is via the stack. The generator has a strategy for this for the base interpreter where such items aren't really pushed onto the evaluation stack, but stored in local variables, with the hope that the C compiler does something semi-intelligent with those (see e.g. how _tmp_1 and _tmp_2 are used in the generated code for BINARY_OP_ADD_INT, or _tmp_1 in LOAD_LOCALS).

This strategy may also work for the optimizer -- I assume it'll be maintaining some kind of internal representation of what's on the stack anyway. But for the machine code generator we would need something different -- in fact a register-based VM might make more sense here.

In any case, I'd love to hear from @markshannon about this, which probably means we'll have to wait until he and I are both back from our respective vacations.

Comment on lines -282 to +289
BINARY_OP_ADD_FLOAT,
BINARY_OP_MULTIPLY_INT,
BINARY_OP_ADD_INT,
BINARY_OP_ADD_UNICODE,
// BINARY_OP_INPLACE_ADD_UNICODE, // This is an odd duck.
BINARY_OP_SUBTRACT_INT,
BINARY_OP_MULTIPLY_FLOAT,
BINARY_OP_MULTIPLY_INT,
BINARY_OP_ADD_FLOAT,
BINARY_OP_SUBTRACT_FLOAT,
BINARY_OP_SUBTRACT_INT,
BINARY_OP_ADD_UNICODE,
// BINARY_OP_INPLACE_ADD_UNICODE, // See comments at that opcode.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a big fan of alphabetization any more (just search :-), and I tried to let the grouping match the ordering of the definitions below, with the exception of BINARY_OP itself, which is somewhere else entirely.

TBH I'm not sure that there's a single organizing principle in the ordering in this file any more; I like to keep families together, but I also don't like to move code around unnecessarily.

If you insist I can undo this chunk.

};


inst(BINARY_OP_MULTIPLY_INT, (unused/1, left, right -- prod)) {
op(_BINARY_OP_INT_GUARD, (left, right -- left, right)) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could rename it once we get to that bridge... But we'd probably end up moving it around again too. Anyway, it looks like there are other concerns for these guards.

Comment on lines 292 to 295
op(_BINARY_OP_INT_GUARD, (left, right -- left, right)) {
DEOPT_IF(!PyLong_CheckExact(left), BINARY_OP);
DEOPT_IF(!PyLong_CheckExact(right), BINARY_OP);
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are all very interesting proposals, and I'll hold up this PR until we've got some agreement on what we eventually want to do.

  • Splitting the type-checking guard into atomic pieces: this makes sense from the POV of machine code generation (fewer, simpler templates) but it would make the Tier-2 interpreter slower, due to increased opcode decoding overhead (unless the guards can be eliminated, of course). There are various possible ways to address that though (possibly even super-micro-ops :-), so maybe we should just go for it, if we can agree on a syntax.
  • Having micro-ops with parameters, notably _CHECK_INT(1). I don't know how much effort this would be in the generator, but I expect it'll be easy enough, and if we are indeed going with the smallest possible uops, we should just do it.
  • Being explicit about which uops use oparg. This feels like it might require more effort in the generator, but it also sounds like an interesting way to go.

All in all, I think we should probably have an in-person discussion about this (earliest I'm available is the week of June 5th).

Comment on lines +308 to +309
_Py_DECREF_SPECIALIZED(right, (destructor)PyObject_Free);
_Py_DECREF_SPECIALIZED(left, (destructor)PyObject_Free);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, once we're over our concern for dispatch overhead in the Tier-2 interpreter, that makes a lot of sense. But I'm not quite over that (I still would like to see a Tier-2 interpreter that not slower than Tier-1 without all conceivable optimizations). So maybe we should just postpone this (maybe we need a new Ideas issue about the granularity of uops).

return 2;
case BINARY_OP_MULTIPLY_FLOAT:
return 2;
return 2+2;
Copy link
Member Author

@gvanrossum gvanrossum May 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for calling this out. This is indeed unexpected. I'll look into how this is computed in the generator (it's been a while since I was in that code). If it's simple to fix I'll do it here, otherwise I'll plan a separate PR. (EDIT: I'm now pretty sure I took a shortcut when first writing this code.) (EDIT 2: It's not a simple fix; I've filed #105034 and will address this separately.)

};


inst(BINARY_OP_MULTIPLY_INT, (unused/1, left, right -- prod)) {
op(_BINARY_OP_INT_GUARD, (left, right -- left, right)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested name GUARD_BOTH_INTS as it checks that both are ints, regardless of how they are used.

Comment on lines 292 to 295
op(_BINARY_OP_INT_GUARD, (left, right -- left, right)) {
DEOPT_IF(!PyLong_CheckExact(left), BINARY_OP);
DEOPT_IF(!PyLong_CheckExact(right), BINARY_OP);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd keep the checks for pairs of values for now.
Even if we want to separate them for optimization, we are likely to want to combine them into superinstructions for interpretation, so we might as well keep them for now.

We can always split them up later.

Comment on lines +308 to +309
_Py_DECREF_SPECIALIZED(right, (destructor)PyObject_Free);
_Py_DECREF_SPECIALIZED(left, (destructor)PyObject_Free);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not worry about refcount stuff for now, we can always do it later.
The important thing for now is that the interpreter generator can understand instructions made up of a sequence of micro-ops. With that we can break down instructions as much as we like later on.

@@ -3302,7 +3324,7 @@ dummy_func(
top = Py_NewRef(bottom);
}

inst(BINARY_OP, (unused/1, lhs, rhs -- res)) {
op(_BINARY_OP_SPECIALIZE, (unused/1, lhs, rhs -- lhs, rhs)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be left alone.
We don't need or want to break up all instructions, just the specialized ones.
In fact, next_instr will not available to the tier 2 optimizer, so it will have to reject this instruction.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. In tier 2 we have to set ENABLE_SPECIALIZATION to false, since specializing a tier 2 instruction makes no sense -- that's the job of tier 1.

return 2;
case BINARY_OP_MULTIPLY_FLOAT:
return 2;
return 2+2;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've commented on #105034.

@markshannon
Copy link
Member

Looks good.
The generated pop/push numbers seem correct, and fixes the values for LOAD_NAME.
The generated code looks clean.

@gvanrossum gvanrossum merged commit df396b5 into python:main May 31, 2023
@gvanrossum gvanrossum deleted the split-ops branch May 31, 2023 15:09
@markshannon
Copy link
Member

markshannon commented May 31, 2023

This seems to be consistently a bit slower.

Not sure why. It might be something to do with the generated instructions, or it might just be a layout issue.
I suspect it is just a layout/cache issue, but might be worth checking the size of PyEval_EvalDefault() to double check.

@gvanrossum
Copy link
Member Author

Oh. When I benchmarked this before it was 0.9991 faster, so about 0.1% slower. I’m not sure how to compare the function sizes.

@markshannon
Copy link
Member

Obviously you want a non-debug build.
I use nm, but I'm sure there are other tools.
nm -S -td ./python | grep PyEval_EvalFrameDefault The first number is the address, the second is the size.

@gvanrossum
Copy link
Member Author

I couldn't be bothered with PGO/LTO or even building ./python, so I measured the size in the context of ceval.o only. At this commit:

[root@codespaces-c47a11 cpython]# nm -S -td Python/ceval.o | grep PyEval_EvalFrameDefault
0000000000001008 0000000000068923 T _PyEval_EvalFrameDefault
0000000000000032 0000000000002048 r _PyEval_EvalFrameDefault.opcode_targets
[root@codespaces-c47a11 cpython]# 

At the previous commit:

[root@codespaces-c47a11 cpython]# nm -S -td Python/ceval.o | grep PyEval_EvalFrameDefault
0000000000001008 0000000000068939 T _PyEval_EvalFrameDefault
0000000000000032 0000000000002048 r _PyEval_EvalFrameDefault.opcode_targets

That looks like it actually became slightly smaller (old: 68939, new: 68923, a reduction of 16 bytes).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants