-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
core: Collapse DBIC into HostController #1186
Conversation
Please feel free to suggest naming / terminology changes. I felt uninspired so stuck with ecru. |
Also, I am unsure if we tracked energy in standalone before, or if I introduced it accidentally. |
crates/standalone/src/lib.rs
Outdated
}; | ||
let database_id = database.id; | ||
let database_addr = database.address; | ||
self.control_db.update_database(database.clone())?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic did not change in this patch, but I'll point out that it is flawed:
If we fail to run the update, the new database will keep running.
There is no obvious solution for that within the current event-driven design, because it is impossible to guarantee that we can restore the previous state.
It will need to be revised in a later patch.
@kazimuth @RReverser It is somewhat mysterious to me how the repeated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Injected panics into a few places in durability::imp::local
and published a module. Without RUST_BACKTRACE
, saw:
Created new database with domain: chat, address: 93dda09db9a56d8fa6c024d843e805d8
thread 'tokio-runtime-worker' panicked at crates/durability/src/imp/local.rs:180:9:
FlushAndSyncTask::run: Random panic with probability 0.1 fires on sample 0.048514128
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
With RUST_BACKTRACE=1
, saw:
thread 'tokio-runtime-worker' panicked at crates/durability/src/imp/local.rs:180:9:
PersisterTask::run: Random panic with probability 0.1 fires on sample 0.016180456
stack backtrace:
Updated database with domain: chat, address: 93dda09db9a56d8fa6c024d843e805d8
0: rust_begin_unwind
1: core::panicking::panic_fmt
2: spacetimedb_durability::imp::local::maybe_panic
3: spacetimedb_durability::imp::local::PersisterTask<T>::run::{{closure}}::{{closure}}
4: spacetimedb_durability::imp::local::PersisterTask<T>::run::{{closure}}
5: tokio::runtime::task::core::Core<T,S>::poll
6: tokio::runtime::task::harness::Harness<T,S>::poll
7: tokio::runtime::scheduler::multi_thread::worker::Context::run_task
8: tokio::runtime::scheduler::multi_thread::worker::Context::run
9: tokio::runtime::context::runtime::enter_runtime
10: tokio::runtime::scheduler::multi_thread::worker::run
11: <tokio::runtime::blocking::task::BlockingTask<T> as core::future::future::Future>::poll
12: tokio::runtime::task::core::Core<T,S>::poll
13: tokio::runtime::task::harness::Harness<T,S>::poll
14: tokio::runtime::blocking::pool::Inner::run
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
In both cases, the database was inaccessible after the panic.
These messages are far from exceptional, but they're sufficient.
Code looks reasonable. I'm not sure what the best way to address the test failure with repeated StandaloneEnv::init
, but my first thought would be to add a global static STANDALONE_INIT: std::sync::Once
and do STANDALONE_INIT.call_once(init_with_global_config)
.
That sounds wrong, it should come up again (lazily). Will investigate… Thanks for testing! |
4018d12
to
7ea0b1a
Compare
64e46dc
to
7962d00
Compare
@@ -457,15 +457,6 @@ impl ControlDb { | |||
Ok(id) | |||
} | |||
|
|||
pub fn update_database_instance(&self, database_instance: DatabaseInstance) -> Result<()> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Turns out the DatabaseInstance
never actually changes, so this is not needed.
Disregard. It does in fact seem to come up again, I was just mistaking my |
1b7e5d2
to
19af629
Compare
/// If the computation `F` panics, the host is removed from this controller, | ||
/// releasing its resources. | ||
#[tracing::instrument(skip_all)] | ||
pub async fn using_database<F, T>(&self, database: Database, instance_id: u64, f: F) -> anyhow::Result<T> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Less than pleasant, arguably.
Can't think of a better way, though, because relational db is not UnwindSafe
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could just use AssertUnwindSafe
- we don't much care about unwinding anyway
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe, but that would still mean to catch unwind in every single method of relational db. That seems silly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just out of pure curiosity, what makes RelationalDB
not UnwindSafe
? Keep in mind I know very little about Rust stack unwinding
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
parking_lot
locks used in the datastore keep their data in a literal UnsafeCell
. Catching a panic means code can continue to run (e.g. destructors), and the compiler cannot prove anything about the memory safety of an UnsafeCell
(because that is what it's for :)). At least that's how I understand it.
I believe what Noa is saying is that our goal is to drop the whole Host
when there is a panic, so unless we have Drop
impls which do silly things, this should be safe.
Thinking about it some more, it might be possible and sufficient to only catch panicking of durability.append_tx
in RelationalDB::commit_tx
. This is the place where we deliberately panic, and once the method panicked it will continue to panic (to signal that the durability layer is in an irrecoverably broken state). Any other panic (e.g. bad unwrap) will make the current request panic, but the next one is likely to succeed.
This still requires to assert unwind safety, so I'll try it in a follow up for easier review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is quite a positive improvement. I also like your naming scheme. I thought a bit and I could not come up with anything better than Host
. I think it works.
I left one nit about renaming modules: Modules
to hosts: Hosts
. Could you make that change before merging?
/// Storage inside the [`RelationalDB`] itself, but with a retrieval | ||
/// function chosen by the user. | ||
SameDb(Arc<SameDbStorage>), | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah this is cool
pub energy_monitor: Arc<dyn EnergyMonitor>, | ||
/// Map of all modules managed by this controller, | ||
/// keyed by database instance id. | ||
modules: Modules, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rename this to hosts: Hosts
for consistency
/// If the computation `F` panics, the host is removed from this controller, | ||
/// releasing its resources. | ||
#[tracing::instrument(skip_all)] | ||
pub async fn using_database<F, T>(&self, database: Database, instance_id: u64, f: F) -> anyhow::Result<T> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just out of pure curiosity, what makes RelationalDB
not UnwindSafe
? Keep in mind I know very little about Rust stack unwinding
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor stuff
@@ -26,11 +24,9 @@ pub mod util; | |||
/// | |||
/// Types returned here should be considered internal state and **never** be | |||
/// surfaced to the API. | |||
#[async_trait] | |||
pub trait NodeDelegate: Send + Sync { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @lcodes, relevant changes wrt. your PR.
let module = host | ||
.get_or_launch_module_host(database, instance_id) | ||
.await | ||
.map_err(log_and_500)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Saw this 5x; can this be refactored to its own function over host, db, inst_id
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At least I have reduced the line count of the 4 occurrences by ~50% each :P
I think it is courtesy to #1147 to not make those lines conflict in a non-obvious way.
The whole routes
module tree could certainly benefit from a grand cleanup, tho.
}) | ||
.collect::<Vec<_>>() | ||
}); | ||
let json = worker_ctx |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @lcodes
/// The registry of all running modules. | ||
type Modules = Arc<Mutex<IntMap<u64, HostCell>>>; | ||
|
||
type ExternalStorage = dyn Fn(&Hash) -> anyhow::Result<Option<AnyBytes>> + Send + Sync + 'static; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you type alias anyhow::Result<Option<AnyBytes>>
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To what?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
E.g., type StorageResult = anyhow::Result<Option<AnyBytes>>;
or some such.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it's just me, but I find it common that the idiom Result<Option<T>>
denotes "may error or not found". Obscuring that for the sake of less angle brackets doesn't help recognizing the pattern imho, so I'll ask for a more precise name. Not sure if I could be convinced to name everything sentence-like, like MaybeErrorOrNotFound
.
Make it so `HostController` manages both the module host (wasm machinery) and the database (`RelationalDB` / `DatabaseInstanceContext`) of spacetime databases deployed to a server. The `DatabaseInstanceContextController` (DBIC) is removed in the process. This allows to make database accesses panic-safe, in that uncaught panics will cause all resouces to be released and the database to be restarted on subsequent access. This is a prerequisite for #985. It also allows to move towards storage of the module binary directly in the database / commitlog. This patch, however, makes some contortions in order to **not** introduce a breaking change just yet.
just fyi, I've got a version of #1147 ready for when this merges. |
Make it so
HostController
manages both the module host (wasm machinery) and the database (RelationalDB
/DatabaseInstanceContext
) of spacetime databases deployed to a server.The
DatabaseInstanceContextController
(DBIC) is removed in the process.This allows to make database accesses panic-safe, in that uncaught panics will cause all resouces to be released and the database to be restarted on subsequent access. This is a prerequisite for #985.
It also allows to move towards storage of the module binary directly in the database / commitlog. This patch, however, makes some contortions in order to not introduce a breaking change just yet.
Expected complexity level and risk
2.5
Testing
panic!
in thedurability
crate's background tasks, which is triggered with some probability.Deploy a test module and access it using the CLI (call reducer, get logs, sql).
Observe that the panic is triggered (in the server logs).
Run another command and observe that it returns an error.
Run it again and observe that the database restarts and the command succeeds.
On a scale from 1-10, rate how confusing the error messages are.
On a scale from 1-10, rate how confusing the fact is that the database does not eagerly restart.