gh-104909: Split BINARY_OP into micro-ops #104910

gvanrossum · 2023-05-24T23:28:48Z

This includes some necessary changes to the cases generator.

The diff is larger than it could be because I moved things around so the specializations are grouped by type first, then by operation (since the guards go with the type).

This version doesn't yet do ugly things to access oparg (as described in the original ideas issue) since we don't have a tier-2 interpreter yet.

BINARY_OP_INPLACE_ADD_UNICODE remains an odd duck.

Issue: Split opcodes into micro-ops #104909

To begin, we split BINARY_OP.

gvanrossum · 2023-05-25T03:36:40Z

@brandtbucher Could you give me your professional opinion on this? These are relatively straightforward instruction splits, but I didn't really solve the problem of the specializing op requiring access to both a cache entry and the oparg (see faster-cpython/ideas#592 (comment)).

Fidget-Spinner

LGTM. I wonder if the pyperformance numbers are neutral on this? I would expect them to be.

gvanrossum · 2023-05-25T18:29:45Z

LGTM. I wonder if the pyperformance numbers are neutral on this? I would expect them to be.

Okay, I've started a benchmark run. Results in a few hours.

gvanrossum · 2023-05-26T00:27:10Z

Okay, I've started a benchmark run. Results in a few hours.

Results: "1.00 slower". The plot states 0.9991 which seems well within the margin of statistical significance. I'll un-draft it, hopefully @brandtbucher will review.

brandtbucher · 2023-05-26T02:28:49Z

Thanks, I’ll take a look tomorrow!

brandtbucher

Cool! I left some comments below (the're on specific lines in the diff, but many of them are general to most of the instrucions changed).

At a higher level, I'm worried that this will be pretty cumbersome to carry out further if we don't design a good solution for sharing locals/opargs/etc. across uops, or maybe even redesigning the instruction format to maybe reorder the parts of the stream in the order they're used.

I don't have any key insights here, but I sense that we'll be sort of fighting with the current generator/instruction format otherwise.

brandtbucher · 2023-05-26T17:22:51Z

Python/bytecodes.c

-            BINARY_OP_ADD_FLOAT,
+            BINARY_OP_MULTIPLY_INT,
            BINARY_OP_ADD_INT,
-            BINARY_OP_ADD_UNICODE,
-            // BINARY_OP_INPLACE_ADD_UNICODE,  // This is an odd duck.
+            BINARY_OP_SUBTRACT_INT,
            BINARY_OP_MULTIPLY_FLOAT,
-            BINARY_OP_MULTIPLY_INT,
+            BINARY_OP_ADD_FLOAT,
            BINARY_OP_SUBTRACT_FLOAT,
-            BINARY_OP_SUBTRACT_INT,
+            BINARY_OP_ADD_UNICODE,
+            // BINARY_OP_INPLACE_ADD_UNICODE,  // See comments at that opcode.


Minor nit: these were alphabetized before, and now they're not. I think it makes sense to reorder the implementations below, but I'm not sure there's also value in reordering these.

I'm not a big fan of alphabetization any more (just search :-), and I tried to let the grouping match the ordering of the definitions below, with the exception of BINARY_OP itself, which is somewhere else entirely.

TBH I'm not sure that there's a single organizing principle in the ordering in this file any more; I like to keep families together, but I also don't like to move code around unnecessarily.

If you insist I can undo this chunk.

brandtbucher · 2023-05-26T17:26:15Z

Python/bytecodes.c

+        op(_BINARY_OP_INT_GUARD, (left, right -- left, right)) {
            DEOPT_IF(!PyLong_CheckExact(left), BINARY_OP);
            DEOPT_IF(!PyLong_CheckExact(right), BINARY_OP);
+        }


It seems to me that there might be value in splitting these types of checks up: one for guarding the top of stack, and one for guarding the second item on the stack.

That way we could successfully remove uops for cases like 1 + rhs or lhs += 1, where they type of one argument is known (or inferred).

Just a heads up: in the very near future for py lazy basic block, we plan to just have a generic CHECK_INT x that checks the xth stack entry is indeed an int. The main reason is that it would save us an opcode.

That makes sense. Just need space for a couple extra opargs. I wonder if it would make sense to have "hardcoded" or "shared" opargs as part of macro instructions.

So we could have (with straw-man syntax):

macro(BINARY_OP, oparg) = _BINARY_OP_SPECIALIZE(oparg) + _BINARY_OP_ACTION(oparg); macro(BINARY_OP_ADD_INT, oparg) = _CHECK_INT(1) + _CHECK_INT(2) + _BINARY_OP_ADD_INT(oparg);

I sort of like it, and maybe the scheme could be extended to share other locals or caches too.

@gvanrossum, thoughts on something like the above?

I forgot to also mention that we encountered many situations where one type is known but the other isnt. So it helps to have some granularity.

These are all very interesting proposals, and I'll hold up this PR until we've got some agreement on what we eventually want to do.

Splitting the type-checking guard into atomic pieces: this makes sense from the POV of machine code generation (fewer, simpler templates) but it would make the Tier-2 interpreter slower, due to increased opcode decoding overhead (unless the guards can be eliminated, of course). There are various possible ways to address that though (possibly even super-micro-ops :-), so maybe we should just go for it, if we can agree on a syntax.

Having micro-ops with parameters, notably _CHECK_INT(1). I don't know how much effort this would be in the generator, but I expect it'll be easy enough, and if we are indeed going with the smallest possible uops, we should just do it.

Being explicit about which uops use oparg. This feels like it might require more effort in the generator, but it also sounds like an interesting way to go.

All in all, I think we should probably have an in-person discussion about this (earliest I'm available is the week of June 5th).

I'd keep the checks for pairs of values for now.
Even if we want to separate them for optimization, we are likely to want to combine them into superinstructions for interpretation, so we might as well keep them for now.

We can always split them up later.

Okay, I think I'll follow Mark's guidance, and keep the current structure.

brandtbucher · 2023-05-26T17:28:35Z

Python/bytecodes.c

+            _Py_DECREF_SPECIALIZED(right, (destructor)PyObject_Free);
+            _Py_DECREF_SPECIALIZED(left, (destructor)PyObject_Free);


This might benefit in the future from being two additional uops too, since they can be removed for known immortal values (although mechanically that might be difficult right now, given that we don't have a good way of sharing locals across uops).

Yeah, once we're over our concern for dispatch overhead in the Tier-2 interpreter, that makes a lot of sense. But I'm not quite over that (I still would like to see a Tier-2 interpreter that not slower than Tier-1 without all conceivable optimizations). So maybe we should just postpone this (maybe we need a new Ideas issue about the granularity of uops).

Let's not worry about refcount stuff for now, we can always do it later.
The important thing for now is that the interpreter generator can understand instructions made up of a sequence of micro-ops. With that we can break down instructions as much as we like later on.

brandtbucher · 2023-05-26T17:31:04Z

Python/bytecodes.c

        };

-
-        inst(BINARY_OP_MULTIPLY_INT, (unused/1, left, right -- prod)) {
+        op(_BINARY_OP_INT_GUARD, (left, right -- left, right)) {


Also, I think maybe the name could just be _GUARD_INT or something, since this could potentially be reused for things like COMPARE_OP.

We could rename it once we get to that bridge... But we'd probably end up moving it around again too. Anyway, it looks like there are other concerns for these guards.

Suggested name GUARD_BOTH_INTS as it checks that both are ints, regardless of how they are used.

I'll use _GUARD_BOTH_INT, _GUARD_BOTH_FLOAT, and _GUARD_BOTH_UNICODE.

brandtbucher · 2023-05-26T17:31:48Z

Python/opcode_metadata.h

-            return 2;
-        case BINARY_OP_MULTIPLY_FLOAT:
-            return 2;
+            return 2+2;


I see a lot of these values changing. That seems unexpected.

I think that's because this is implemented by counting the total stack inputs consumed by all the constituent uops, and similarly (but separately, in a separate switch) the total outputs pushed to stack by all constituent uops. So when an instruction gets split into two uops, and the first one pushes two (internal) "outputs" to stack that are consumed as inputs by the second one, you end up adding two to both "inputs consumed" and "outputs pushed", but that cancels out so the overall stack effect of the opcode remains the same.

Yeah, but I would expect this value to not exceed the total stack size at that point. Previously, the _PyOpcode_num_popped and _PyOpcode_num_pushed could be used individually, but now they are only correct if used together, back to back, to calculate a raw stack effect.

For instance, I'm pretty sure these functions can currently be used to track data-dependencies between instructions in a CFG without resorting to full abstract interpretation. This change breaks that.

I think it should be possible to update the cases generator so it cancels out the "internal" stack effects of macro instructions... internally, and they don't show up in _PyOpcode_num_popped and _PyOpcode_num_pushed.

There are already macro instructions that currently break this, though (LOAD_NAME is one, since the PEP 695 implementation was merged), so it's not new with this PR, and changing it could probably be considered separately.

Thanks for calling this out. This is indeed unexpected. I'll look into how this is computed in the generator (it's been a while since I was in that code). If it's simple to fix I'll do it here, otherwise I'll plan a separate PR. (EDIT: I'm now pretty sure I took a shortcut when first writing this code.) (EDIT 2: It's not a simple fix; I've filed #105034 and will address this separately.)

I've commented on #105034.

I have now fixed this. @brandtbucher Can you re-review?

Tools/cases_generator/generate_cases.py

Co-authored-by: Brandt Bucher <[email protected]>

gvanrossum

At a higher level, I'm worried that this will be pretty cumbersome to carry out further if we don't design a good solution for sharing locals/opargs/etc. across uops, or maybe even redesigning the instruction format to maybe reorder the parts of the stream in the order they're used.

I don't have any key insights here, but I sense that we'll be sort of fighting with the current generator/instruction format otherwise.

Thanks for this observation -- one reason I started pulling on this particular thread now was to make sure that things like this would surface sooner rather than later.

I am beginning to think that one thing we're running into here is that the needs for the optimizer, the machine code generator, and the Tier-2 interpreter are all somewhat different. My own focus has been largely on the Tier-2 interpreter, since I'm more confident that I can build one. Assuming we keep it a stack-based VM using the same stack as the base interpreter (not a forgone conclusion yet, but makes some things easier), the only way for uops to pass things to each other is via the stack. The generator has a strategy for this for the base interpreter where such items aren't really pushed onto the evaluation stack, but stored in local variables, with the hope that the C compiler does something semi-intelligent with those (see e.g. how _tmp_1 and _tmp_2 are used in the generated code for BINARY_OP_ADD_INT, or _tmp_1 in LOAD_LOCALS).

This strategy may also work for the optimizer -- I assume it'll be maintaining some kind of internal representation of what's on the stack anyway. But for the machine code generator we would need something different -- in fact a register-based VM might make more sense here.

In any case, I'd love to hear from @markshannon about this, which probably means we'll have to wait until he and I are both back from our respective vacations.

gvanrossum · 2023-05-27T18:42:49Z

Python/bytecodes.c

-            BINARY_OP_ADD_FLOAT,
+            BINARY_OP_MULTIPLY_INT,
            BINARY_OP_ADD_INT,
-            BINARY_OP_ADD_UNICODE,
-            // BINARY_OP_INPLACE_ADD_UNICODE,  // This is an odd duck.
+            BINARY_OP_SUBTRACT_INT,
            BINARY_OP_MULTIPLY_FLOAT,
-            BINARY_OP_MULTIPLY_INT,
+            BINARY_OP_ADD_FLOAT,
            BINARY_OP_SUBTRACT_FLOAT,
-            BINARY_OP_SUBTRACT_INT,
+            BINARY_OP_ADD_UNICODE,
+            // BINARY_OP_INPLACE_ADD_UNICODE,  // See comments at that opcode.


I'm not a big fan of alphabetization any more (just search :-), and I tried to let the grouping match the ordering of the definitions below, with the exception of BINARY_OP itself, which is somewhere else entirely.

TBH I'm not sure that there's a single organizing principle in the ordering in this file any more; I like to keep families together, but I also don't like to move code around unnecessarily.

If you insist I can undo this chunk.

gvanrossum · 2023-05-27T18:44:17Z

Python/bytecodes.c

        };

-
-        inst(BINARY_OP_MULTIPLY_INT, (unused/1, left, right -- prod)) {
+        op(_BINARY_OP_INT_GUARD, (left, right -- left, right)) {


We could rename it once we get to that bridge... But we'd probably end up moving it around again too. Anyway, it looks like there are other concerns for these guards.

gvanrossum · 2023-05-27T18:53:02Z

Python/bytecodes.c

+        op(_BINARY_OP_INT_GUARD, (left, right -- left, right)) {
            DEOPT_IF(!PyLong_CheckExact(left), BINARY_OP);
            DEOPT_IF(!PyLong_CheckExact(right), BINARY_OP);
+        }


These are all very interesting proposals, and I'll hold up this PR until we've got some agreement on what we eventually want to do.

Splitting the type-checking guard into atomic pieces: this makes sense from the POV of machine code generation (fewer, simpler templates) but it would make the Tier-2 interpreter slower, due to increased opcode decoding overhead (unless the guards can be eliminated, of course). There are various possible ways to address that though (possibly even super-micro-ops :-), so maybe we should just go for it, if we can agree on a syntax.

Having micro-ops with parameters, notably _CHECK_INT(1). I don't know how much effort this would be in the generator, but I expect it'll be easy enough, and if we are indeed going with the smallest possible uops, we should just do it.

Being explicit about which uops use oparg. This feels like it might require more effort in the generator, but it also sounds like an interesting way to go.

All in all, I think we should probably have an in-person discussion about this (earliest I'm available is the week of June 5th).

gvanrossum · 2023-05-27T18:54:46Z

Python/bytecodes.c

+            _Py_DECREF_SPECIALIZED(right, (destructor)PyObject_Free);
+            _Py_DECREF_SPECIALIZED(left, (destructor)PyObject_Free);


Yeah, once we're over our concern for dispatch overhead in the Tier-2 interpreter, that makes a lot of sense. But I'm not quite over that (I still would like to see a Tier-2 interpreter that not slower than Tier-1 without all conceivable optimizations). So maybe we should just postpone this (maybe we need a new Ideas issue about the granularity of uops).

gvanrossum · 2023-05-27T19:02:31Z

Python/opcode_metadata.h

-            return 2;
-        case BINARY_OP_MULTIPLY_FLOAT:
-            return 2;
+            return 2+2;


Thanks for calling this out. This is indeed unexpected. I'll look into how this is computed in the generator (it's been a while since I was in that code). If it's simple to fix I'll do it here, otherwise I'll plan a separate PR. (EDIT: I'm now pretty sure I took a shortcut when first writing this code.) (EDIT 2: It's not a simple fix; I've filed #105034 and will address this separately.)

markshannon · 2023-05-30T10:33:23Z

Python/bytecodes.c

        };

-
-        inst(BINARY_OP_MULTIPLY_INT, (unused/1, left, right -- prod)) {
+        op(_BINARY_OP_INT_GUARD, (left, right -- left, right)) {


Suggested name GUARD_BOTH_INTS as it checks that both are ints, regardless of how they are used.

markshannon · 2023-05-30T10:36:32Z

Python/bytecodes.c

+        op(_BINARY_OP_INT_GUARD, (left, right -- left, right)) {
            DEOPT_IF(!PyLong_CheckExact(left), BINARY_OP);
            DEOPT_IF(!PyLong_CheckExact(right), BINARY_OP);
+        }


I'd keep the checks for pairs of values for now.
Even if we want to separate them for optimization, we are likely to want to combine them into superinstructions for interpretation, so we might as well keep them for now.

We can always split them up later.

markshannon · 2023-05-30T10:39:08Z

Python/bytecodes.c

+            _Py_DECREF_SPECIALIZED(right, (destructor)PyObject_Free);
+            _Py_DECREF_SPECIALIZED(left, (destructor)PyObject_Free);


Let's not worry about refcount stuff for now, we can always do it later.
The important thing for now is that the interpreter generator can understand instructions made up of a sequence of micro-ops. With that we can break down instructions as much as we like later on.

markshannon · 2023-05-30T10:50:44Z

Python/bytecodes.c

@@ -3302,7 +3324,7 @@ dummy_func(
            top = Py_NewRef(bottom);
        }

-        inst(BINARY_OP, (unused/1, lhs, rhs -- res)) {
+        op(_BINARY_OP_SPECIALIZE, (unused/1, lhs, rhs -- lhs, rhs)) {


This should be left alone.
We don't need or want to break up all instructions, just the specialized ones.
In fact, next_instr will not available to the tier 2 optimizer, so it will have to reject this instruction.

Good point. In tier 2 we have to set ENABLE_SPECIALIZATION to false, since specializing a tier 2 instruction makes no sense -- that's the job of tier 1.

markshannon · 2023-05-30T11:02:59Z

Python/opcode_metadata.h

-            return 2;
-        case BINARY_OP_MULTIPLY_FLOAT:
-            return 2;
+            return 2+2;


I've commented on #105034.

markshannon · 2023-05-31T10:19:00Z

Looks good.
The generated pop/push numbers seem correct, and fixes the values for LOAD_NAME.
The generated code looks clean.

markshannon · 2023-05-31T15:12:18Z

This seems to be consistently a bit slower.

Not sure why. It might be something to do with the generated instructions, or it might just be a layout issue.
I suspect it is just a layout/cache issue, but might be worth checking the size of PyEval_EvalDefault() to double check.

gvanrossum · 2023-05-31T15:19:12Z

Oh. When I benchmarked this before it was 0.9991 faster, so about 0.1% slower. I’m not sure how to compare the function sizes.

markshannon · 2023-05-31T15:57:52Z

Obviously you want a non-debug build.
I use nm, but I'm sure there are other tools.
nm -S -td ./python | grep PyEval_EvalFrameDefault The first number is the address, the second is the size.

gvanrossum · 2023-06-07T01:12:00Z

I couldn't be bothered with PGO/LTO or even building ./python, so I measured the size in the context of ceval.o only. At this commit:

[root@codespaces-c47a11 cpython]# nm -S -td Python/ceval.o | grep PyEval_EvalFrameDefault
0000000000001008 0000000000068923 T _PyEval_EvalFrameDefault
0000000000000032 0000000000002048 r _PyEval_EvalFrameDefault.opcode_targets
[root@codespaces-c47a11 cpython]#

At the previous commit:

[root@codespaces-c47a11 cpython]# nm -S -td Python/ceval.o | grep PyEval_EvalFrameDefault
0000000000001008 0000000000068939 T _PyEval_EvalFrameDefault
0000000000000032 0000000000002048 r _PyEval_EvalFrameDefault.opcode_targets

That looks like it actually became slightly smaller (old: 68939, new: 68923, a reduction of 16 bytes).

Start splitting bytecodes into micro-ops

4cac682

To begin, we split BINARY_OP.

bedevere-bot added the awaiting core review label May 24, 2023

bedevere-bot mentioned this pull request May 24, 2023

Split opcodes into micro-ops #104909

Closed

Do the remaining BINARY_OP specializations

24026a0

gvanrossum marked this pull request as draft May 25, 2023 03:22

bedevere-bot removed the awaiting core review label May 25, 2023

Move unused/1 cache effects to actions

6f7c955

gvanrossum requested a review from brandtbucher May 25, 2023 03:34

Fidget-Spinner reviewed May 25, 2023

View reviewed changes

gvanrossum added the skip news label May 26, 2023

gvanrossum marked this pull request as ready for review May 26, 2023 00:27

bedevere-bot added the awaiting core review label May 26, 2023

brandtbucher reviewed May 26, 2023

View reviewed changes

Use match/case as intended

fd46248

Co-authored-by: Brandt Bucher <[email protected]>

gvanrossum commented May 27, 2023

View reviewed changes

markshannon reviewed May 30, 2023

View reviewed changes

gvanrossum added 4 commits May 30, 2023 13:42

Roll back changes to BINARY_OP itself

5cd8b66

Rename guard uops

17483ad

Merge remote-tracking branch 'origin/main' into split-ops

74d4eab

Fix popped/pushed calculations for macros

3259d6e

gvanrossum merged commit df396b5 into python:main May 31, 2023

bedevere-bot removed the awaiting core review label May 31, 2023

gvanrossum deleted the split-ops branch May 31, 2023 15:09

		_Py_DECREF_SPECIALIZED(right, (destructor)PyObject_Free);
		_Py_DECREF_SPECIALIZED(left, (destructor)PyObject_Free);

gh-104909: Split BINARY_OP into micro-ops #104910

gh-104909: Split BINARY_OP into micro-ops #104910

Conversation

gvanrossum commented May 24, 2023 • edited Loading

gvanrossum commented May 25, 2023

Fidget-Spinner left a comment

Choose a reason for hiding this comment

gvanrossum commented May 25, 2023

gvanrossum commented May 26, 2023

brandtbucher commented May 26, 2023

brandtbucher left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Fidget-Spinner May 26, 2023 • edited Loading

Choose a reason for hiding this comment

brandtbucher May 26, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

carljm May 26, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

carljm May 26, 2023 • edited Loading

Choose a reason for hiding this comment

gvanrossum May 27, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gvanrossum left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gvanrossum May 27, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

markshannon commented May 31, 2023

markshannon commented May 31, 2023 • edited Loading

gvanrossum commented May 31, 2023

markshannon commented May 31, 2023

gvanrossum commented Jun 7, 2023

gvanrossum commented May 24, 2023 •

edited

Loading

Fidget-Spinner May 26, 2023 •

edited

Loading

brandtbucher May 26, 2023 •

edited

Loading

carljm May 26, 2023 •

edited

Loading

carljm May 26, 2023 •

edited

Loading

gvanrossum May 27, 2023 •

edited

Loading

gvanrossum May 27, 2023 •

edited

Loading

markshannon commented May 31, 2023 •

edited

Loading