Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

core: Collapse DBIC into HostController #1186

Merged
merged 1 commit into from
May 21, 2024
Merged

core: Collapse DBIC into HostController #1186

merged 1 commit into from
May 21, 2024

Conversation

kim
Copy link
Contributor

@kim kim commented Apr 30, 2024

Make it so HostController manages both the module host (wasm machinery) and the database (RelationalDB / DatabaseInstanceContext) of spacetime databases deployed to a server.

The DatabaseInstanceContextController (DBIC) is removed in the process.

This allows to make database accesses panic-safe, in that uncaught panics will cause all resouces to be released and the database to be restarted on subsequent access. This is a prerequisite for #985.

It also allows to move towards storage of the module binary directly in the database / commitlog. This patch, however, makes some contortions in order to not introduce a breaking change just yet.

Expected complexity level and risk

2.5

Testing

  • All API things should still work without changes
  • Insert a panic! in the durability crate's background tasks, which is triggered with some probability.
    Deploy a test module and access it using the CLI (call reducer, get logs, sql).
    Observe that the panic is triggered (in the server logs).
    Run another command and observe that it returns an error.
    Run it again and observe that the database restarts and the command succeeds.
    On a scale from 1-10, rate how confusing the error messages are.
    On a scale from 1-10, rate how confusing the fact is that the database does not eagerly restart.

@kim
Copy link
Contributor Author

kim commented Apr 30, 2024

Please feel free to suggest naming / terminology changes. I felt uninspired so stuck with ecru.

@kim
Copy link
Contributor Author

kim commented Apr 30, 2024

Also, I am unsure if we tracked energy in standalone before, or if I introduced it accidentally.

};
let database_id = database.id;
let database_addr = database.address;
self.control_db.update_database(database.clone())?;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic did not change in this patch, but I'll point out that it is flawed:

If we fail to run the update, the new database will keep running.
There is no obvious solution for that within the current event-driven design, because it is impossible to guarantee that we can restore the previous state.

It will need to be revised in a later patch.

@kim
Copy link
Contributor Author

kim commented Apr 30, 2024

@kazimuth @RReverser It is somewhat mysterious to me how the repeated StandaloneEnv::init with same database directory could possibly work. Is there something I'm missing, or was it just incidental that it worked?

Copy link
Contributor

@gefjon gefjon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Injected panics into a few places in durability::imp::local and published a module. Without RUST_BACKTRACE, saw:

Created new database with domain: chat, address: 93dda09db9a56d8fa6c024d843e805d8
thread 'tokio-runtime-worker' panicked at crates/durability/src/imp/local.rs:180:9:
FlushAndSyncTask::run: Random panic with probability 0.1 fires on sample 0.048514128
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

With RUST_BACKTRACE=1, saw:

thread 'tokio-runtime-worker' panicked at crates/durability/src/imp/local.rs:180:9:
PersisterTask::run: Random panic with probability 0.1 fires on sample 0.016180456
stack backtrace:
Updated database with domain: chat, address: 93dda09db9a56d8fa6c024d843e805d8
   0: rust_begin_unwind
   1: core::panicking::panic_fmt
   2: spacetimedb_durability::imp::local::maybe_panic
   3: spacetimedb_durability::imp::local::PersisterTask<T>::run::{{closure}}::{{closure}}
   4: spacetimedb_durability::imp::local::PersisterTask<T>::run::{{closure}}
   5: tokio::runtime::task::core::Core<T,S>::poll
   6: tokio::runtime::task::harness::Harness<T,S>::poll
   7: tokio::runtime::scheduler::multi_thread::worker::Context::run_task
   8: tokio::runtime::scheduler::multi_thread::worker::Context::run
   9: tokio::runtime::context::runtime::enter_runtime
  10: tokio::runtime::scheduler::multi_thread::worker::run
  11: <tokio::runtime::blocking::task::BlockingTask<T> as core::future::future::Future>::poll
  12: tokio::runtime::task::core::Core<T,S>::poll
  13: tokio::runtime::task::harness::Harness<T,S>::poll
  14: tokio::runtime::blocking::pool::Inner::run
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

In both cases, the database was inaccessible after the panic.

These messages are far from exceptional, but they're sufficient.

Code looks reasonable. I'm not sure what the best way to address the test failure with repeated StandaloneEnv::init, but my first thought would be to add a global static STANDALONE_INIT: std::sync::Once and do STANDALONE_INIT.call_once(init_with_global_config).

@kim
Copy link
Contributor Author

kim commented Apr 30, 2024

In both cases, the database was inaccessible after the panic.

That sounds wrong, it should come up again (lazily). Will investigate…

Thanks for testing!

@kim kim force-pushed the kim/collapse-dbic branch 2 times, most recently from 4018d12 to 7ea0b1a Compare May 2, 2024 09:59
@kim kim mentioned this pull request May 2, 2024
2 tasks
@kim kim force-pushed the kim/collapse-dbic branch 2 times, most recently from 64e46dc to 7962d00 Compare May 2, 2024 11:54
@@ -457,15 +457,6 @@ impl ControlDb {
Ok(id)
}

pub fn update_database_instance(&self, database_instance: DatabaseInstance) -> Result<()> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turns out the DatabaseInstance never actually changes, so this is not needed.

@gefjon
Copy link
Contributor

gefjon commented May 2, 2024

In both cases, the database was inaccessible after the panic.

That sounds wrong, it should come up again (lazily). Will investigate…

Disregard. It does in fact seem to come up again, I was just mistaking my quickstart-chat being unable to connect due to stale credentials.

@bfops bfops added the release-any To be landed in any release window label May 13, 2024
@kim kim force-pushed the kim/collapse-dbic branch 2 times, most recently from 1b7e5d2 to 19af629 Compare May 14, 2024 11:26
/// If the computation `F` panics, the host is removed from this controller,
/// releasing its resources.
#[tracing::instrument(skip_all)]
pub async fn using_database<F, T>(&self, database: Database, instance_id: u64, f: F) -> anyhow::Result<T>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Less than pleasant, arguably.
Can't think of a better way, though, because relational db is not UnwindSafe.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could just use AssertUnwindSafe - we don't much care about unwinding anyway

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe, but that would still mean to catch unwind in every single method of relational db. That seems silly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just out of pure curiosity, what makes RelationalDB not UnwindSafe? Keep in mind I know very little about Rust stack unwinding

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parking_lot locks used in the datastore keep their data in a literal UnsafeCell. Catching a panic means code can continue to run (e.g. destructors), and the compiler cannot prove anything about the memory safety of an UnsafeCell (because that is what it's for :)). At least that's how I understand it.

I believe what Noa is saying is that our goal is to drop the whole Host when there is a panic, so unless we have Drop impls which do silly things, this should be safe.

Thinking about it some more, it might be possible and sufficient to only catch panicking of durability.append_tx in RelationalDB::commit_tx. This is the place where we deliberately panic, and once the method panicked it will continue to panic (to signal that the durability layer is in an irrecoverably broken state). Any other panic (e.g. bad unwrap) will make the current request panic, but the next one is likely to succeed.

This still requires to assert unwind safety, so I'll try it in a follow up for easier review.

Copy link
Contributor

@cloutiertyler cloutiertyler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is quite a positive improvement. I also like your naming scheme. I thought a bit and I could not come up with anything better than Host. I think it works.

I left one nit about renaming modules: Modules to hosts: Hosts. Could you make that change before merging?

/// Storage inside the [`RelationalDB`] itself, but with a retrieval
/// function chosen by the user.
SameDb(Arc<SameDbStorage>),
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah this is cool

pub energy_monitor: Arc<dyn EnergyMonitor>,
/// Map of all modules managed by this controller,
/// keyed by database instance id.
modules: Modules,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rename this to hosts: Hosts for consistency

/// If the computation `F` panics, the host is removed from this controller,
/// releasing its resources.
#[tracing::instrument(skip_all)]
pub async fn using_database<F, T>(&self, database: Database, instance_id: u64, f: F) -> anyhow::Result<T>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just out of pure curiosity, what makes RelationalDB not UnwindSafe? Keep in mind I know very little about Rust stack unwinding

Copy link
Contributor

@Centril Centril left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor stuff

@@ -26,11 +24,9 @@ pub mod util;
///
/// Types returned here should be considered internal state and **never** be
/// surfaced to the API.
#[async_trait]
pub trait NodeDelegate: Send + Sync {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @lcodes, relevant changes wrt. your PR.

Comment on lines +91 to +94
let module = host
.get_or_launch_module_host(database, instance_id)
.await
.map_err(log_and_500)?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Saw this 5x; can this be refactored to its own function over host, db, inst_id?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least I have reduced the line count of the 4 occurrences by ~50% each :P

I think it is courtesy to #1147 to not make those lines conflict in a non-obvious way.

The whole routes module tree could certainly benefit from a grand cleanup, tho.

})
.collect::<Vec<_>>()
});
let json = worker_ctx
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @lcodes

/// The registry of all running modules.
type Modules = Arc<Mutex<IntMap<u64, HostCell>>>;

type ExternalStorage = dyn Fn(&Hash) -> anyhow::Result<Option<AnyBytes>> + Send + Sync + 'static;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you type alias anyhow::Result<Option<AnyBytes>>?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To what?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

E.g., type StorageResult = anyhow::Result<Option<AnyBytes>>; or some such.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it's just me, but I find it common that the idiom Result<Option<T>> denotes "may error or not found". Obscuring that for the sake of less angle brackets doesn't help recognizing the pattern imho, so I'll ask for a more precise name. Not sure if I could be convinced to name everything sentence-like, like MaybeErrorOrNotFound.

Make it so `HostController` manages both the module host (wasm
machinery) and the database (`RelationalDB` / `DatabaseInstanceContext`)
of spacetime databases deployed to a server.

The `DatabaseInstanceContextController` (DBIC) is removed in the
process.

This allows to make database accesses panic-safe, in that uncaught
panics will cause all resouces to be released and the database to be
restarted on subsequent access. This is a prerequisite for #985.

It also allows to move towards storage of the module binary directly in
the database / commitlog. This patch, however, makes some contortions in
order to **not** introduce a breaking change just yet.
@coolreader18
Copy link
Collaborator

just fyi, I've got a version of #1147 ready for when this merges.

@kim kim added this pull request to the merge queue May 21, 2024
Merged via the queue into master with commit 2de1475 May 21, 2024
6 checks passed
@kim kim deleted the kim/collapse-dbic branch May 21, 2024 17:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-any To be landed in any release window
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants