Skip to content

Add SealingState; don't prepare block when not ready.#103

Merged
vkomenda merged 1 commit into
aura-posfrom
afck-seals-internally
May 8, 2019
Merged

Add SealingState; don't prepare block when not ready.#103
vkomenda merged 1 commit into
aura-posfrom
afck-seals-internally

Conversation

@afck
Copy link
Copy Markdown
Collaborator

@afck afck commented Feb 26, 2019

This prevents the miner from creating new blocks for sealing even if it's not our turn to seal them.

However, it likely does not detect all cases yet where it's not our turn.

(Closes #70.)

@afck
Copy link
Copy Markdown
Collaborator Author

afck commented Feb 26, 2019

In my local test, it worked as expected, and it seemed to prevent most of the cases where we prepared unsealable blocks, but not all.

This looks like a safe change to me, but I wonder why seals_internally wasn't doing this in the first place. Let me know if you see any potential problems with this!

@DemiMarie
Copy link
Copy Markdown

Can we propose this upstream?

@DemiMarie
Copy link
Copy Markdown

I suspect the problem is missing synchronization.

Can we place a global lock around the engine and miner?

@afck
Copy link
Copy Markdown
Collaborator Author

afck commented Feb 27, 2019

Yes, I suspect that, too: Ideally, we do want to start preparing a block, even if our turn has not quite started yet, so that when it starts, we're ready to seal it. So basically we want to prepare one in our predecessor's turn as soon as we've received our predecessor's block, i.e. it shouldn't depend on time at all, but only on the latest imported block. So maybe this PR is the wrong approach after all?

Can we place a global lock around the engine and miner?

Somehow it seems in Parity every struct has all of its fields locked internally. It may be worth a try, but I suspect that global locks around engine and miner would cause them to deadlock, since there are places where (directly or indirectly) they call each other in both directions. 😬

@DemiMarie
Copy link
Copy Markdown

I think so. I suspect that the real answer is that anything we need to do in a block we are sealing, needs to be done in the actual function that seals the block.

@afck
Copy link
Copy Markdown
Collaborator Author

afck commented Feb 27, 2019

anything we need to do in a block we are sealing, needs to be done in the actual function that seals the block.

So you're essentially saying the issue doesn't have a solution, and we always need to prepare the block and try to seal it, like we currently do?

@afck afck force-pushed the afck-seals-internally branch from 36d9a4c to 296a483 Compare February 28, 2019 12:14
@afck
Copy link
Copy Markdown
Collaborator Author

afck commented Feb 28, 2019

I force-pushed a new approach that doesn't use self.step at all. (Not sure if it should.) Instead it only checks whether the block after the latest one is our turn.

Comment thread ethcore/src/engines/authority_round/mod.rs
@afck
Copy link
Copy Markdown
Collaborator Author

afck commented Feb 28, 2019

I also addressed the TODO: In on_prepare_block, we are always the block's author, so we don't have to take our address from the engine signer.

@DemiMarie
Copy link
Copy Markdown

@afck In retrospect, what I am saying is: on_prepare_block must be idempotent.

Copy link
Copy Markdown

@DemiMarie DemiMarie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What will happen if we are selected to validate for the first block after genesis? This should be checked.

Otherwise, this code looks good to me.

Comment thread ethcore/src/engines/authority_round/mod.rs
Comment thread ethcore/src/engines/authority_round/mod.rs
};

let parent_step = header_step(&parent, self.empty_steps_transition)
.expect("Header has been verified; qed");
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this fail due to an I/O error involving the database?

@phahulin
Copy link
Copy Markdown

phahulin commented Mar 4, 2019

If I'm not mistaken, @DemiMarie 's remark can be extended to the case when at least one of the validators is down and skips blocks - then the network gets stuck because step is not advanced

2019-03-04 19:41:45  IO Worker #2 TRACE engine  Fetched proposer for step 310343469: 0xbbca…4c78
2019-03-04 19:41:45  IO Worker #2 TRACE engine  Not preparing block: not our turn. (Our address: 0x75df…9441)
...
2019-03-04 19:41:50  IO Worker #1 TRACE engine  Fetched proposer for step 310343469: 0xbbca…4c78
2019-03-04 19:41:50  IO Worker #1 TRACE engine  Not preparing block: not our turn. (Our address: 0x75df…9441)
...
2019-03-04 19:41:55  IO Worker #1 TRACE engine  Fetched proposer for step 310343469: 0xbbca…4c78
2019-03-04 19:41:55  IO Worker #1 TRACE engine  Not preparing block: not our turn. (Our address: 0x75df…9441)

@afck
Copy link
Copy Markdown
Collaborator Author

afck commented Mar 4, 2019

then the network gets stuck because step is not advanced

You're right! 😩 I don't have an idea yet how to solve this. But somehow generate_seal gets around that issue, so seals_internally should be able to do that, too…
I'll give it another try tomorrow.

@DemiMarie
Copy link
Copy Markdown

@afck is it in clear_empty_steps()?

@DemiMarie
Copy link
Copy Markdown

Fixed. We were using the wrong step value.

@DemiMarie
Copy link
Copy Markdown

@phahulin can you test this?

@phahulin
Copy link
Copy Markdown

phahulin commented Mar 5, 2019

@DemiMarie couldn't get past the first few (1-2) blocks with all nodes up:

2019-03-05 11:00:20  IO Worker #3 TRACE miner  update_sealing
2019-03-05 11:00:20  IO Worker #3 TRACE miner  requires_reseal: sealing enabled
2019-03-05 11:00:20  IO Worker #3 TRACE engine  Fetched proposer for step 310354564: 0x75df…9441
2019-03-05 11:00:20  IO Worker #3 TRACE miner  requires_reseal: should_disable_sealing=false; forced=true, has_local=false, internal=Some(true), had_requests=false
2019-03-05 11:00:20  IO Worker #3 TRACE miner  update_sealing: preparing a block
2019-03-05 11:00:20  IO Worker #3 TRACE miner  prepare_block: Already have previous work; updating and returning
2019-03-05 11:00:20  IO Worker #3 DEBUG miner  Attempting to push 1 transactions.
2019-03-05 11:00:20  IO Worker #3 DEBUG miner  Adding tx 0xb214dc5e9ee5c97ff243c2832db03a0c5ca414ec545f09406f036b770447e181 took 50 ms
2019-03-05 11:00:20  IO Worker #3 DEBUG miner  Pushed 0 transactions in 51 ms
2019-03-05 11:00:20  IO Worker #3 TRACE reward  Block reward benefactors: [(0x75df42383afe6bf5194aa8fa0e9b3d5f9e869441, Author)]
2019-03-05 11:00:20  IO Worker #3 TRACE engine  Fetched proposer for step 310354564: 0x75df…9441
2019-03-05 11:00:20  IO Worker #3 TRACE miner  update_sealing: engine indicates internal sealing
2019-03-05 11:00:20  IO Worker #3 TRACE miner  seal_block_internally: attempting internal seal.
2019-03-05 11:00:20  IO Worker #3 DEBUG engine  Aborting seal generation. The step or empty_steps have changed in the meantime. 340282366920938463463374607431768211454 != 340282366920938463463374607431768211443

I've seen Aborting seal... step or empty_steps have changed... before, but now it happens every step.

@afck
Copy link
Copy Markdown
Collaborator Author

afck commented Mar 5, 2019

I can confirm that it fails for me, too, both with and without the last commit, if I disable nodes 2 and 5.
With aura-pos, it works.

@afck
Copy link
Copy Markdown
Collaborator Author

afck commented Mar 5, 2019

…but disabling the sealing queue (i.e. explicitly never taking that branch) fixes it for me!
I'll try to find out why we ended up there in the first place; I don't understand when that block was prepared and queued…

@afck afck changed the title Return false from seals_internally if not our turn. WIP: Return false from seals_internally if not our turn. Mar 5, 2019
@afck
Copy link
Copy Markdown
Collaborator Author

afck commented Mar 5, 2019

With this change it works for me.

Maybe Engine needs yet another getter so that we're able to either prevent this prepare_block call or to return None from prepare_block itself if the engine isn't ready for sealing.

@DemiMarie
Copy link
Copy Markdown

@afck It might, but it is probably a good idea to merge it as-is, unless @vkomenda or @phahulin say otherwise.

@phahulin
Copy link
Copy Markdown

phahulin commented Mar 6, 2019

IMO we shouldn't merge it yet, because the issue itself is not a blocker, and it's not yet clear (at least to me), what can we break by disabling the sealing queue.

@afck
Copy link
Copy Markdown
Collaborator Author

afck commented Mar 6, 2019

I agree; let's not merge something where we don't know what we're doing:
At least we should understand why this doesn't work with the sealing queue. Why doesn't the same problem already occur on aura-pos or stable…?

@afck
Copy link
Copy Markdown
Collaborator Author

afck commented Mar 6, 2019

OK, I think aura-pos and stable also don't use the sealing queue!

But I'm not sure yet what the proper solution is.

@afck afck changed the title WIP: Return false from seals_internally if not our turn. Return false from seals_internally if not our turn. Mar 6, 2019
@DemiMarie
Copy link
Copy Markdown

@afck SealingState::Prepare is dead code. Do you plan to change that?

@afck
Copy link
Copy Markdown
Collaborator Author

afck commented Mar 6, 2019

Good point… so maybe we don't need SealingState at all, and just need to change the effect of Some(false) to what's now SealingState::None… I'll have a closer look tomorrow.

@DemiMarie
Copy link
Copy Markdown

DemiMarie commented Mar 6, 2019

I think that we should keep the SealingState enum, since it is much more descriptive than Option<bool>. Perhaps

enum SealingState {
    Ready,
    NotReady,
    External,
}

would be a good fit?

Also, if we never use the sealing queue, we can delete all of its supporting infrastructure.

Copy link
Copy Markdown

@DemiMarie DemiMarie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not include SealingState::Prepare but never construct it. Either there should be a use for it, or it should be removed.

Also, we should panic!() if we hit a branch that is unreachable.

Comment thread ethcore/src/miner/miner.rs Outdated
Comment thread ethcore/src/miner/miner.rs Outdated
@afck
Copy link
Copy Markdown
Collaborator Author

afck commented Mar 7, 2019

Thanks, I renamed the variants as you suggested, and removed Prepare.

I had originally meant Prepare to be what was Some(false) before, but it looks like Some(false) was only returned by some engines if there was no mining address. Since that means they can't ever seal a block anyway, I changed their return value to NotReady.

Copy link
Copy Markdown

@DemiMarie DemiMarie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Panic message needs to be updated. Otherwise, looks good.

Comment thread ethcore/src/miner/miner.rs Outdated
@afck afck force-pushed the afck-seals-internally branch from 38fdfc9 to ae424a0 Compare March 7, 2019 13:10
@DemiMarie DemiMarie force-pushed the afck-seals-internally branch from ae424a0 to c6b0193 Compare March 7, 2019 18:52
Copy link
Copy Markdown

@vkomenda vkomenda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Some integration testing might be required as a separate task perhaps.

@afck afck force-pushed the afck-seals-internally branch from c6b0193 to 9549426 Compare April 18, 2019 12:49
@afck afck changed the title Return false from seals_internally if not our turn. Add SealingState; don't prepare block when not ready. Apr 18, 2019
@afck
Copy link
Copy Markdown
Collaborator Author

afck commented Apr 18, 2019

I squashed and rebased.

@afck afck force-pushed the afck-seals-internally branch from 9549426 to 389c520 Compare April 18, 2019 13:35
@varasev
Copy link
Copy Markdown
Member

varasev commented May 3, 2019

@afck Please clarify how this can be tested? If I add a node which is not a validator to posdao-test-setup and launch this, there shouldn't be any warnings in its logs, right? So, if I don't see warnings for such a node, can this PR be considered as successful?

In other words, how can I understand that these changes work fine?

@afck
Copy link
Copy Markdown
Collaborator Author

afck commented May 3, 2019

I think it rather affects nodes that are validators: Without this change, they print messages that they are not ready to seal, because they prepare blocks all the time, even if it's not their turn:

TRACE engine  generate_seal: 0xbbca…4c78 not a proposer for step 309905993

But for some reason they're not even warnings, just traces, even though it seems like a big waste of CPU time.
So that's the part that's easy to test: That those messages are gone.

The hard part is: Could this PR cause any other problems in practice? That's why I'd really like someone to review it who is very familiar with the inner workings of Parity and Aura, and it's why I also made an upstream PR: openethereum#10529
(But nobody has looked at it yet.)

@varasev
Copy link
Copy Markdown
Member

varasev commented May 3, 2019

Ok, I left ping in that PR, hope the guys will help to check the code.

@varasev
Copy link
Copy Markdown
Member

varasev commented May 7, 2019

Since Parity approved the same changes for upstream, I'll try to launch the build based on the afck-seals-internally branch and if it's ok, we will merge it.

@varasev
Copy link
Copy Markdown
Member

varasev commented May 8, 2019

I tried to launch it and don't see the trace message TRACE engine generate_seal: 0xbbca…4c78 not a proposer for step ... anymore. Also, the CPU time seems to be reduced 👍 So, I think we can merge it.

@vkomenda vkomenda merged commit d4d5db2 into aura-pos May 8, 2019
@afck afck deleted the afck-seals-internally branch May 9, 2019 08:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Don't prepare blocks when not a validator.

5 participants