-
Notifications
You must be signed in to change notification settings - Fork 171
feat(benchmark): add benchmark_test
test type
#1945
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
641036c
to
af00ec2
Compare
af00ec2
to
de7f485
Compare
There are some issue in generating the fixture. I compare to the newly created fixture, and the size is much larger than the original one. This should not happen and there should be the same content, so the same size. But this is not a big problem now. The major issue now is to resolve the failing test in CI, which I could not reproduce now locally. |
This can come in handy for benchmark tests as basically they force the consumption of all the gas available. And that condition forces us to implement padding techniques to consume EXACTLY all the gas available in a block. When in reality, for a benchmark, we don't care about this at all. |
@CPerezz I think this is still necessary for Nethermind team (Increasing gas limit) and zkEVM team (proving the entire block)? For gas limit testing, I am not sure if they can only run 1 tx and then derive the entire block execution time from it |
But you can emit a warning if needed. Why does it need to be a failure not spending ALL the gas exactly? I agree it has to be within a bound. Sure. But to the unit in precision is really different. Specially when you have to account for mem expansion and other costs. It's almost impossible to not need padding. I'm not advocating to remove this completely. But to relax it maybe. Or at least, it would be useful to know why does it need to fail specifically? When and Why was this introduced? |
@CPerezz Thank you for explanation, it is very clear! I will review the features included again and discuss with the team. As you see this is still a draft and we welcome any feedback, we also want to know what does stateless client team need for benchmarking, what's your consideration when benchmarking? |
@LouisTsai-Csie So I'm just speaking in regards of "State bottlenecks" project. Which is within the stateless-consensus team. Our goal is to measure how different client impls behave when under heavy load and different state sizes among other things. For that, we need these kind of benchmarks. But it results quite tricky to match perfectly the gas spent. And it's not required at all to be spent. 1% of wiggle room is enough to consider the benchmark useful even if it doesn't spend all the gas of the block. |
src/ethereum_test_specs/benchmark.py
Outdated
pre: Alloc | ||
post: Alloc | ||
tx: Optional[Transaction] = None | ||
blocks: Optional[List[Block]] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re #2112, I think we could have setup_tx
and setup_blocks
perhaps which contain transactions that are specifically part of the benchmark setup.
The main problem I see is that, currently we do pre.fund_eoa
for both (1) accounts that send these setup transactions and (2) accounts that send the actual benchmarking workload transactions, and they are indistinguishable at the moment.
One option could be to add a field to pre.fund_eoa
that indicates whether the account is meant to send setup transactions or workload transactions, so we can fund this transaction only in the setup phase of execute
:
setup_account = pre.fund_eoa(account_type="setup")
Downside being that the test writer needs to be cognizant of this and properly label all accounts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just spitballing here but what if we have context managers manage each phase for benchmark tests?
@pytest.mark.benchmark
def test_some_benchmark(benchmark, pre, blockchain_test):
with benchmark.setup(): # Auto-tagged as setup
setup_contract = pre.deploy_contract(...)
contract_under_test = pre.deploy_contract(code=..., storage=..., stub="...")
setup_acct = pre.fund_eoa()
setup_block = Block(txs=[
Transaction(...),
Transaction(...),
])
with benchmark.execution(): # Auto-tagged as execution
acct1 = pre.fund_eoa()
# for execute remote this is the seed / private key sender?
execution_block = Block(txs=[
Transaction(...),
])
blockchain_test(...)
One possible way I've used this in the past is tracking certain contexts with ContextVar
. This can be reset with every test and could be used in a try / finally
sort of block. Downside (but maybe a plus?) is you also have to be explicit about each phase and this may not always work out to be so deterministic 🤔. These are things that would have to be determined anyway though I think with any sort of phase management.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would be a very nice solution. If we could make it so that the default context is execution
(or workload
perhaps?) I think that would be great.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this approach! and making execution
for default phase is a good idea.
de7f485
to
688e861
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After going through the current implementation and thinking about it I think this PR is mostly on the right track.
My suggestions would be:
- We have a single new spec
benchmark_tests
that receivessetup_txs
andworkload_txs
, or agenerator
. - We have multiple generator subclasses all of which subclass
BenchmarkCodeGenerator
and an implementgenerate_setup_txs
andgenerate_workload_txs
(and perhapsdeploy_contracts
). - Internally
benchmark_tests
takessetup_txs
(or callsgenerator.generate_setup_txs()
) and, if any, generates a first setup block, and then takesworkload_txs
(or callsgenerator.generate_workload_txs()
) and puts them in the a different block.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm leaning more towards removing benchmark_state
and leaving only benchmark
, because it feels like the state format is heavily constrained by the transaction gas limit cap, and it's simply more work to introduce two different formats and it's also confusing to testers who would have to know which one to use each time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to remove the benchmark_state
wrapper! The reason i added it and the only concern now is this.
Spencer: Thanks for adding the issue. Dan briefly spoke to the geth team and I think they wanted to keep state_test. We could still remove it nonetheless.
We could later ask what is needed for the geth team.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, I feel like we should try to convince them that the state format is more suitable for consensus tests, and benchmarking tests we should prefer the blockchain test format because we can fit many transactions in it for a more realistic scenario IMO.
class BenchmarkCodeGenerator(ABC): | ||
"""Abstract base class for generating benchmark bytecode.""" | ||
|
||
def __init__( | ||
self, | ||
fork: Fork, | ||
attack_block: Bytecode, | ||
setup: Optional[Bytecode] = None, | ||
): | ||
"""Initialize with fork, attack block, and optional setup bytecode.""" | ||
self.fork = fork | ||
self.setup = setup or Bytecode() | ||
self.attack_block = attack_block |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we decide to stick with this kind of abstract class, we can refactor this to be dataclass
.
f99f318
to
30b4f76
Compare
I refactor the helper function and add the context manager feature. During the update, some question and todo came to my mind:
|
Regarding the questions you have:
Maybe we could move the abstract class
Sgtm, I'm still open to be convinced that we indeed need it.
That might be out of scope for this PR and we should leave that for the PR that touches the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking really good! I think the code generators are fantastic, and the only part I feel we should take out and move into another PR is the BenchmarkManager
.
src/ethereum_test_specs/benchmark.py
Outdated
model_config = ConfigDict(extra="forbid") | ||
|
||
pre: Alloc | ||
post: Alloc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can add a default here because most of the time we don't specify one for benchmark tests.
post: Alloc | |
post: Alloc = Field(default_factory=Alloc) |
src/ethereum_test_specs/benchmark.py
Outdated
|
||
pre: Alloc | ||
post: Alloc | ||
tx: Optional[Transaction] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tx: Optional[Transaction] = None | |
tx: Transaction | None = None |
Instead of using optional, same for the rest of the fields.
src/ethereum_test_specs/benchmark.py
Outdated
expected_benchmark_gas_used: int | None = None | ||
gas_benchmark_value: int | ||
benchmark_manager: Optional[Any] = Field(default=None, exclude=True) | ||
code_generator: Optional[Any] = Field(default=None, exclude=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
code_generator: Optional[Any] = Field(default=None, exclude=True) | |
code_generator: Optional[BenchmarkCodeGenerator] = Field(default=None, exclude=True) |
And to be able to do this, we need to bring the abstract class definition to this file, while leaving the other classes in benchmark_code_generator.py
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the implementation to:
code_generator: Optional[Any] = Field(default=None, exclude=True) | |
code_generator: BenchmarkCodeGenerator | None = None |
However, I kept running into issues with Pydantic
model validation. The main problem is that the Bytecode
type doesn’t natively support Pydantic
validation. There are a couple of possible solutions, and I went with the second one:
- Update the
BenchmarkTest
class to allow arbitrary types.
model_config = ConfigDict(extra="forbid", arbitrary_types_allowed=True) # Allows arbitrary type seems unsafe
- Add
Pydantic
type validation support directly to theBytecode
type:
@classmethod
def __get_pydantic_core_schema__(
cls, source_type: Any, handler: GetCoreSchemaHandler
) -> PlainValidatorFunctionSchema:
"""Provide Pydantic core schema for Bytecode serialization and validation."""
return no_info_plain_validator_function(
cls,
serialization=handler.generate_schema(bytes),
)
Same change applies to BenchmarkManager
.
pre=pre, | ||
post={}, | ||
tx=tx, | ||
code_generator=JumpLoopGenerator(fork, Op.JUMPDEST), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally we should not need to pass the fork here because benchmark_test
is going to always receive it from the filler:
code_generator=JumpLoopGenerator(fork, Op.JUMPDEST), | |
code_generator=JumpLoopGenerator(Op.JUMPDEST), |
We should instead have fork
as a parameter in all functions of BenchmarkCodeGenerator
.
pre=pre, | ||
post={}, | ||
tx=tx, | ||
code_generator=JumpLoopGenerator(fork, Op.JUMPDEST), | ||
gas_benchmark_value=gas_benchmark_value, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A trick to not having to pass this value every time would be to add "gas_benchmark_value"
to this constant here:
execution-spec-tests/src/pytest_plugins/shared/execute_fill.py
Lines 24 to 27 in a7544eb
ALL_FIXTURE_PARAMETERS = { | |
"genesis_environment", | |
"env", | |
} |
Now fill
and execute
know that if the value was not passed, we have to use the fixture value instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is so nice, i also want to avoid passing gas_benchmark_value
.
from ethereum_test_vm.opcode import Opcodes as Op | ||
|
||
|
||
@dataclass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dataclass | |
@dataclass(kw_only=True) |
Just to force specifying the argument names when instantiating the class and make the code more readable, i.e.
JumpLoopGenerator(fork=fork, attack_block=Op.JUMPDEST)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I see this was added to the ABC. We can remove the @dataclass
here and for other BenchmarkCodeGenerator
implementations as this is inherited.
src/ethereum_test_specs/benchmark.py
Outdated
_current_phase: ContextVar[Optional[BenchmarkPhase]] = ContextVar("benchmark_phase", default=None) | ||
|
||
|
||
class BenchmarkManager: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not completely convinced that this is the approach we should take, mainly because I feel that we can make this more generalized and seamless, for example by having a TestPhaseManager
that is directly in ethereum_test_types
and classes like Transaction
or Block
could use it when being instantiated to mark themselves automatically as part of a given test phase, then this same field we could pick it during the appropriate phases in the fill
and execute
commands.
I think we could extract everything related to BenchmarkManager
from this PR and move it to a separate follow-up PR, just to get this one merged because I feel like it's going to become too big otherwise, wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, I will work on a follow-up PR for this!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Create a follow-up PR #2157 for this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we could extract everything related to BenchmarkManager from this PR and move it to a separate follow-up PR, just to get this one merged because I feel like it's going to become too big otherwise, wdyt?
@LouisTsai-Csie, we should remove this logic from this PR, right? Or does this affect the other changes here too much to warrant this and we should refactor it in #2157? If so, that PR needs to be rebased from this one eventually right? If we can remove it here that might be nice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I removed the BenchmarkManager and move it to PR #2157 .
e803c98
to
1da86fa
Compare
Unsure if this is somehow related. But JIC mentioning it here. In #2090 we arrived to the following conclusion:
For that reason, we realized that Therefore, the way I found to profit off of this dual mode is to allow fix-mode to take care of the pre-state deployment/generation (making sure it doesn't run in case it identifies that the state is about to deploy already is). LMK what you think @LouisTsai-Csie @fselmo . If this approach doesn't make sense. Could you let me know what;'s the best way to bring all Bloatnet benchmarks into EEST? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I honestly don't have a lot to add here, this looks amazing 🔥. Really elegant approach. I added a lot of words (that's just my nature 😆) but there's really just some minor miscalculations that we should have sanity checks for anyhow. Otherwise this is looking excellent!
Major question I have is whether this will all still work if we rip out the phase manager and leave it for another PR. I think we can... is there a reason to keep it?
"""Provide Pydantic core schema for Bytecode serialization and validation.""" | ||
return no_info_plain_validator_function( | ||
cls, | ||
serialization=to_string_ser_schema(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this use something else here? I think this uses the opcode name. What is the intended goal with this change and where does this show up?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I run this test and command for experiment:
uv run fill -v tests/benchmark/test_worst_compute.py::test_worst_swap -m benchmark --clean --gas-benchmark-values 1 -k "SWAP16"
This is intended for pydantic model validation (Please correct me if wrong). If i remove the __get_pydantic_core_schema__
function here, i would get the following issue:
File "/Users/caijiacheng/Documents/Ethereum/execution-spec-tests/.venv/lib/python3.12/site-packages/pydantic/_internal/_generate_schema.py", line 639, in _unknown_type_schema
raise PydanticSchemaGenerationError(
pydantic.errors.PydanticSchemaGenerationError: Unable to generate pydantic-core schema for <class 'ethereum_test_vm.bytecode.Bytecode'>. Set `arbitrary_types_allowed=True` in the model_config to ignore this error or implement `__get_pydantic_core_schema__` on your type to fully support it.
If you got this error by calling handler(<some type>) within `__get_pydantic_core_schema__` then you likely need to call `handler.generate_schema(<some type>)` since we do not call `__get_pydantic_core_schema__` on `<some type>` otherwise to avoid infinite recursion.
Based on my understanding, this happens since the Bytecode
does not follow some Pydantic model rules? I'd explored different approaches to handle this:
- I add
arbitrary_types_allowed=True
inmodel_config
: This seems not being a good approach. - I review other existing type object, like
HexNumber
andBytes
inbase_types.py
, and i follow their pattern and add the__get_pydantic_core_schema__
function
I am not sure if this is the standard approach for such scenario, please let me know if there is better way to do so. And i also want to know more about how the Pydantic model works here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, my fault for not explaining. I meant specifically using the to_string_ser_schema()
. This seems like it uses the opcode name (e.g. PUSH1). I think we may want something else here but I'm not immediately sure when this is used at the moment. I feel like we should use the hex bytes here instead? Thoughts @LouisTsai-Csie.
cc: @spencer-tb (tagging bc you wanted to review this PR before merging)
src/ethereum_test_specs/benchmark.py
Outdated
_current_phase: ContextVar[Optional[BenchmarkPhase]] = ContextVar("benchmark_phase", default=None) | ||
|
||
|
||
class BenchmarkManager: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we could extract everything related to BenchmarkManager from this PR and move it to a separate follow-up PR, just to get this one merged because I feel like it's going to become too big otherwise, wdyt?
@LouisTsai-Csie, we should remove this logic from this PR, right? Or does this affect the other changes here too much to warrant this and we should refactor it in #2157? If so, that PR needs to be rebased from this one eventually right? If we can remove it here that might be nice.
from ethereum_test_vm.opcode import Opcodes as Op | ||
|
||
|
||
@dataclass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I see this was added to the ABC. We can remove the @dataclass
here and for other BenchmarkCodeGenerator
implementations as this is inherited.
1da86fa
to
50ec823
Compare
benchmark_test
and benchmark_state_test
test typebenchmark_test
test type
blocks=generated_blocks, | ||
) | ||
|
||
elif self.blocks is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just thought about this... but if we define both blocks
and a code_generator
, we should probably not allow this as we will silently end up returning up top with generated blocks. I think we should have a sanity check at the top of this method that we can only set one path here:
either:
self.code_generator
self.blocks
self.tx
That way we raise if any two of these are set instead of silently letting the tester think it's one or the other when the real order of precedence happens behind the scenes here. Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a check like this at the top should work well with a good error message:
set_props = [
name for name, val in [
("code_generator", self.code_generator),
("blocks", self.blocks),
("tx", self.tx),
] if val is not None
]
if len(set_props) != 1:
raise ValueError(
f"Exactly one must be set, but got {len(set_props)}: {', '.join(set_props)}"
)
🗒️ Description
As EIP-7825 is introduced in Fusaka upgrade, most of the legacy test case would fail. This issue add two test wrappers,
benchmark_test
andbenchmark_state_test
, to replace pureblockchain_test
andstate_test
test type.🔗 Related Issues or PRs
Issue #1896
✅ Checklist
tox
checks to avoid unnecessary CI fails, see also Code Standards and Enabling Pre-commit Checks:uvx --with=tox-uv tox -e lint,typecheck,spellcheck,markdownlint
type(scope):
.mkdocs serve
locally and verified the auto-generated docs for new tests in the Test Case Reference are correctly formatted.@ported_from
marker.