-
Notifications
You must be signed in to change notification settings - Fork 115
Implement excision. (#21) #766
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Speaking of migrating forward, at what point do we start worrying about that? Once we release something for public consumption, once mentat is used in the wild in a what that matters to us, once some form of syncing is in place (certainly then!), etc..
db/src/excision.rs
Outdated
} | ||
|
||
let target = avs.get(&pair(entids::DB_EXCISE)?) | ||
.and_then(|ars| ars.add.iter().next().cloned()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excising multiple targets would require multiple excisions, right? E.g. there's no way to do something like [{:db/excise [42 43 44]}]
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excising multiple targets would require multiple excisions, right? E.g. there's no way to do something like [{:db/excise [42 43 44]}]?
No, although I see now that I may have implemented this incorrectly. :db/excise
is :db/cardinality :db.cardinality/many
(which I assume we culted from Datomic), so the expression you give is accepted by the transactors. I will go add tests to ensure I'm doing reasonable things -- I feel like I might have assumed that the entity corresponds to only one target, which isn't true if this is cardinality many.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did implement this incorrectly. To capture things correctly I had to make target
be targets
and change the underlying materialized view.
db/src/excision.rs
Outdated
let partition = partition_map.partition_for_entid(target) | ||
.ok_or_else(|| DbError::BadExcision("target has no partition".into()))?; // TODO: more details. | ||
// Right now, Mentat only supports `:db.part/{db,user,tx}`, and tests hack in `:db.part/fake`. | ||
if partition == ":db.part/db" || partition == ":db.part/tx" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Explicitly testing for partitions that are allowed in excisions seems more future-proof.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right. I'll grow an excision_allowed()
flag.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated to use this.
db/src/db.rs
Outdated
|
||
// Excisions are transacted as data, so they end up in the transactions table; this | ||
// materializes that view. Excisions are never removed from the transactions (or datoms) | ||
// table so it's not required to include the `added` column, but let's do it so that if we |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment about 'added' seems out of date.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right.
db/src/db.rs
Outdated
{:db/id "e" :db/excise 300} | ||
]"#); | ||
// This is implementation specific, but it should be deterministic. | ||
assert_matches!(tempids(&report), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be worth it to call out how that e
is derived (user partition start + some number of ts...) to make tests easier to modify.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I copied the marker comment; I think it's clear enough, should we ever wish to adjust things.
db/src/excision.rs
Outdated
} | ||
|
||
// TODO: filter by attrs. | ||
let mut stmt: rusqlite::Statement = conn.prepare(format!("WITH ids AS (SELECT d.rowid FROM datoms AS d, excisions AS e WHERE e.status > 0 AND (e.target IS d.e OR (e.target IS d.v AND d.a IS NOT {}))) DELETE FROM datoms WHERE rowid IN ids", entids::DB_EXCISE).as_ref())?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the reasoning behind splitting excision into two steps? e.g. first delete stuff from datoms, then on a separate call from transactions. It seems backwards to me (but perhaps understanding the reason for the split will help here) - e.g. we first clean out a materialization of our transactions (datoms), and only then the "source of truth". I'd expect a two-step operation to start with the source, and finish at the edges?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a choice, suggested by Richard in #21 (comment).
Excision is very expensive, so we want to be able to amortize the cost over time. The initial implementation won't do much toward amortizing, but I want to put the framework in place for it. The reason to start at the materialization (datoms
, the edge) is that immediately after an excision the world presented to queries should change. You could imagine an alternate implementation where the expensive writes to the datoms
table are themselves queued, and instead the query engine grows knowledge of "excised entities" and filters them from query results as appropriate. You could even do this with a SQLite view: if you had an index that made it fast to look up whether a particular entity was pending excision, like
CREATE INDEX idx_excisions_pending ON excisions (target) WHERE status IS NOT 0
then you could
CREATE VIEW datoms AS
SELECT * FROM datoms_before_excisions
WHERE NOT EXISTS
(SELECT 1 FROM excisions
WHERE
excisions.status IS NOT 0 AND
((datoms.e IS excisions.target) OR (datoms.v IS excisions.target AND datoms.value_type_tag = 13))
That might be a good idea, since most of the time the idx_excisions_pending
partial index will be empty and the view will be as performant (I think) as querying the underlying datoms directly.
However, the really expensive part is rewriting all of the transaction log (and, later, rewriting the shared transaction log on a remote service). We want to be able to do that a little at a time; hence the machinery. So, really, we're splitting excision into datoms
and transactions
, and then splitting transactions
into lots of little re-writing steps that can proceed incrementally.
Only time will tell which of these approaches is best, and whether the machinery is worth the effort.
db/src/excision.rs
Outdated
@@ -127,3 +133,108 @@ pub(crate) fn excisions<'schema>(partition_map: &'schema PartitionMap, schema: & | |||
Ok(Some(excisions)) | |||
} | |||
} | |||
|
|||
pub(crate) fn enqueue_pending_excisions(conn: &rusqlite::Connection, schema: &Schema, tx_id: Entid, excisions: ExcisionMap) -> Result<()> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While this is called "enqueue", the function actually mutates state beyond just making a note of an excision-to-be-processed. Not in a permanent sense (we can always derive erased datoms if the ensure_no_pending_excisions
step doesn't run, but it still does more than it lets on, I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a fair comment. Maybe begin_excisions
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're doing a bunch of stuff with SQL as separate statements, all wrapped in individual implicit transactions.
I expected to see as many of these individual operations wrapped together in a single transaction so that the whole thing is atomic. Is there a reason that's not feasible?
db/src/excision.rs
Outdated
stmt1.execute(&[&entid, &excision.target, &excision.before_tx, &excision.before_tx.unwrap_or(tx_id)])?; // XXX | ||
if let Some(attrs) = excision.attrs { | ||
// println!("attrs {:?}", attrs); | ||
let status = excision.before_tx.unwrap_or(tx_id); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The column name status
seems a bit misleading and its use lossy. Its use as a "do i need to process anything" means that we don't record the "beforeT" attribute (like we do for other excision parameters like attributes). It is in the transaction table, yes, but the difference in handling seems inconsistent to me.
The "misleading" bit - status
implies "do i have something to do?", but here its also used to mean "what do i need to do?" at the same time, and the meaning of the column changes as we step through the steps.
I'm guessing you're optimizing away an extra column, essentially?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm guessing you're optimizing away an extra column, essentially?
Yes.
I pushed this up before revisiting this with a helpful comment. You deduced most of the assumptions and asked questions about the rest (great!); but I'll update this comment and add a class comment as well.
The layer that makes the whole thing atomic is higher up the chain -- it's the |
This makes `Datoms` own more of its own sorting and display (as `edn::Value`). The goal is to make it pleasant to create `debug::Datoms` instances that don't originate from an SQL query; the immediate consumer is a `TransactWatcher` that collects witnessed datoms... into a `debug::Datoms` instance, ready to compare with `assert_matches!`.
This adds testing machinery and uses it to at least assert the existing behaviour. (Whether the existing behaviour is sensible is a topic for another time.)
Thinking about migrating forward makes my head hurt, so I'm ignoring it for now.
I've elected to do this by turning the AEV trie into an EAV trie. This is expensive and not always necessary, but I'm going for correctness first, performance second. I intend to follow-up by using the same EAV trie for the schema detail extraction, which right now is a little ad-hoc.
…x y]}`). This was a fundamental error in the initial implementation; I made an error and didn't realize `:db/excise` was cardinality many.
…ions. This is extraordinarily expensive, but it's also the only way to ensure that excised data doesn't remain in the SQLite database files after excision. We might want to make this a flag on the connection: an encrypted database file (i.e., one using SQLCipher) might not need vacuuming for some use cases.
This is awkward and expensive but an expedient way to support `Store` instances with active caches. There are other ways to update active caches: we could drop all caches entirely and repopulate them, or we could try to associate caches to impacted entities. All such things are follow-up work as we solidify our approach to cache invalidation and try to improve performance.
My WIP towards excision.