Implement excision. (#21) #766

ncalexan · 2018-06-25T16:12:30Z

My WIP towards excision.

grigoryk

Speaking of migrating forward, at what point do we start worrying about that? Once we release something for public consumption, once mentat is used in the wild in a what that matters to us, once some form of syncing is in place (certainly then!), etc..

grigoryk · 2018-06-26T21:21:42Z

db/src/excision.rs

+        }
+
+        let target = avs.get(&pair(entids::DB_EXCISE)?)
+            .and_then(|ars| ars.add.iter().next().cloned())


Excising multiple targets would require multiple excisions, right? E.g. there's no way to do something like [{:db/excise [42 43 44]}]?

Excising multiple targets would require multiple excisions, right? E.g. there's no way to do something like [{:db/excise [42 43 44]}]?

No, although I see now that I may have implemented this incorrectly. :db/excise is :db/cardinality :db.cardinality/many (which I assume we culted from Datomic), so the expression you give is accepted by the transactors. I will go add tests to ensure I'm doing reasonable things -- I feel like I might have assumed that the entity corresponds to only one target, which isn't true if this is cardinality many.

I did implement this incorrectly. To capture things correctly I had to make target be targets and change the underlying materialized view.

grigoryk · 2018-06-26T21:25:28Z

db/src/excision.rs

+        let partition = partition_map.partition_for_entid(target)
+            .ok_or_else(|| DbError::BadExcision("target has no partition".into()))?; // TODO: more details.
+        // Right now, Mentat only supports `:db.part/{db,user,tx}`, and tests hack in `:db.part/fake`.
+        if partition == ":db.part/db" || partition == ":db.part/tx" {


Explicitly testing for partitions that are allowed in excisions seems more future-proof.

You're right. I'll grow an excision_allowed() flag.

I've updated to use this.

grigoryk · 2018-06-26T23:40:31Z

db/src/db.rs

+
+        // Excisions are transacted as data, so they end up in the transactions table; this
+        // materializes that view.  Excisions are never removed from the transactions (or datoms)
+        // table so it's not required to include the `added` column, but let's do it so that if we


Comment about 'added' seems out of date.

You're right.

grigoryk · 2018-06-26T23:43:13Z

db/src/db.rs

+            {:db/id "e" :db/excise 300}
+        ]"#);
+        // This is implementation specific, but it should be deterministic.
+        assert_matches!(tempids(&report),


Might be worth it to call out how that e is derived (user partition start + some number of ts...) to make tests easier to modify.

I copied the marker comment; I think it's clear enough, should we ever wish to adjust things.

grigoryk · 2018-06-27T00:07:48Z

db/src/excision.rs

+    }
+
+    // TODO: filter by attrs.
+    let mut stmt: rusqlite::Statement = conn.prepare(format!("WITH ids AS (SELECT d.rowid FROM datoms AS d, excisions AS e WHERE e.status > 0 AND (e.target IS d.e OR (e.target IS d.v AND d.a IS NOT {}))) DELETE FROM datoms WHERE rowid IN ids", entids::DB_EXCISE).as_ref())?;


What's the reasoning behind splitting excision into two steps? e.g. first delete stuff from datoms, then on a separate call from transactions. It seems backwards to me (but perhaps understanding the reason for the split will help here) - e.g. we first clean out a materialization of our transactions (datoms), and only then the "source of truth". I'd expect a two-step operation to start with the source, and finish at the edges?

This is a choice, suggested by Richard in #21 (comment).

Excision is very expensive, so we want to be able to amortize the cost over time. The initial implementation won't do much toward amortizing, but I want to put the framework in place for it. The reason to start at the materialization (datoms, the edge) is that immediately after an excision the world presented to queries should change. You could imagine an alternate implementation where the expensive writes to the datoms table are themselves queued, and instead the query engine grows knowledge of "excised entities" and filters them from query results as appropriate. You could even do this with a SQLite view: if you had an index that made it fast to look up whether a particular entity was pending excision, like

CREATE INDEX idx_excisions_pending ON excisions (target) WHERE status IS NOT 0

then you could

CREATE VIEW datoms AS SELECT * FROM datoms_before_excisions WHERE NOT EXISTS (SELECT 1 FROM excisions WHERE excisions.status IS NOT 0 AND ((datoms.e IS excisions.target) OR (datoms.v IS excisions.target AND datoms.value_type_tag = 13))

That might be a good idea, since most of the time the idx_excisions_pending partial index will be empty and the view will be as performant (I think) as querying the underlying datoms directly.

However, the really expensive part is rewriting all of the transaction log (and, later, rewriting the shared transaction log on a remote service). We want to be able to do that a little at a time; hence the machinery. So, really, we're splitting excision into datoms and transactions, and then splitting transactions into lots of little re-writing steps that can proceed incrementally.

Only time will tell which of these approaches is best, and whether the machinery is worth the effort.

grigoryk · 2018-06-27T00:09:44Z

db/src/excision.rs

@@ -127,3 +133,108 @@ pub(crate) fn excisions<'schema>(partition_map: &'schema PartitionMap, schema: &
        Ok(Some(excisions))
    }
 }
+
+pub(crate) fn enqueue_pending_excisions(conn: &rusqlite::Connection, schema: &Schema, tx_id: Entid, excisions: ExcisionMap) -> Result<()> {


While this is called "enqueue", the function actually mutates state beyond just making a note of an excision-to-be-processed. Not in a permanent sense (we can always derive erased datoms if the ensure_no_pending_excisions step doesn't run, but it still does more than it lets on, I think.

This is a fair comment. Maybe begin_excisions?

grigoryk

You're doing a bunch of stuff with SQL as separate statements, all wrapped in individual implicit transactions.

I expected to see as many of these individual operations wrapped together in a single transaction so that the whole thing is atomic. Is there a reason that's not feasible?

grigoryk · 2018-06-27T00:25:22Z

db/src/excision.rs

-        stmt1.execute(&[&entid, &excision.target, &excision.before_tx, &excision.before_tx.unwrap_or(tx_id)])?; // XXX
-        if let Some(attrs) = excision.attrs {
-            // println!("attrs {:?}", attrs);
+        let status = excision.before_tx.unwrap_or(tx_id);


The column name status seems a bit misleading and its use lossy. Its use as a "do i need to process anything" means that we don't record the "beforeT" attribute (like we do for other excision parameters like attributes). It is in the transaction table, yes, but the difference in handling seems inconsistent to me.

The "misleading" bit - status implies "do i have something to do?", but here its also used to mean "what do i need to do?" at the same time, and the meaning of the column changes as we step through the steps.

I'm guessing you're optimizing away an extra column, essentially?

I'm guessing you're optimizing away an extra column, essentially?

Yes.

I pushed this up before revisiting this with a helpful comment. You deduced most of the assumptions and asked questions about the rest (great!); but I'll update this comment and add a class comment as well.

ncalexan · 2018-07-10T18:23:55Z

You're doing a bunch of stuff with SQL as separate statements, all wrapped in individual implicit transactions.

I expected to see as many of these individual operations wrapped together in a single transaction so that the whole thing is atomic. Is there a reason that's not feasible?

The layer that makes the whole thing atomic is higher up the chain -- it's the Conn that manages all of this. The Conn also needs to be involved to handle transaction watchers and cache invalidation. That's coming shortly.

This makes `Datoms` own more of its own sorting and display (as `edn::Value`). The goal is to make it pleasant to create `debug::Datoms` instances that don't originate from an SQL query; the immediate consumer is a `TransactWatcher` that collects witnessed datoms... into a `debug::Datoms` instance, ready to compare with `assert_matches!`.

This adds testing machinery and uses it to at least assert the existing behaviour. (Whether the existing behaviour is sensible is a topic for another time.)

Thinking about migrating forward makes my head hurt, so I'm ignoring it for now.

I've elected to do this by turning the AEV trie into an EAV trie. This is expensive and not always necessary, but I'm going for correctness first, performance second. I intend to follow-up by using the same EAV trie for the schema detail extraction, which right now is a little ad-hoc.

…x y]}`). This was a fundamental error in the initial implementation; I made an error and didn't realize `:db/excise` was cardinality many.

… partitions.

…ions. This is extraordinarily expensive, but it's also the only way to ensure that excised data doesn't remain in the SQLite database files after excision. We might want to make this a flag on the connection: an encrypted database file (i.e., one using SQLCipher) might not need vacuuming for some use cases.

This is awkward and expensive but an expedient way to support `Store` instances with active caches. There are other ways to update active caches: we could drop all caches entirely and repopulate them, or we could try to associate caches to impacted entities. All such things are follow-up work as we solidify our approach to cache invalidation and try to improve performance.

ncalexan force-pushed the excision branch from 2ec690f to 2fc0440 Compare June 26, 2018 16:23

grigoryk self-requested a review June 26, 2018 19:30

grigoryk reviewed Jun 26, 2018

View reviewed changes

grigoryk reviewed Jun 27, 2018

View reviewed changes

ncalexan force-pushed the excision branch from 8ac2b26 to 0406619 Compare July 12, 2018 17:30

ncalexan mentioned this pull request Jul 17, 2018

Use SQLITE_SECURE_DELETE=1 to avoid vacuuming data #789

Open

ncalexan added 18 commits July 18, 2018 10:23

Pre: Remove unused scope.

7fcdfc2

Pre: Test transaction watchers at the db level.

9c34b6f

This adds testing machinery and uses it to at least assert the existing behaviour. (Whether the existing behaviour is sensible is a topic for another time.)

Part 1: Include excision details in core schema.

b7dcb35

Thinking about migrating forward makes my head hurt, so I'm ignoring it for now.

Part 3: First cut at handling excision.

d80e5d9

Part 4: Handle :db.excise/attrs.

2a5d5c3

Part 5: Handle :db.excise/beforeT.

c81c51c

Part 6: Vacuum the fulltext_values table after clearing excisions.

72293f9

Review comment: Handle multiple excision targets (like `{:db/excise […

00787cb

…x y]}`). This was a fundamental error in the initial implementation; I made an error and didn't realize `:db/excise` was cardinality many.

Review comment: Delete dangling retractions.

4c5693f

Review comment: Use Partition.allow_excision rather than hard-coded…

ba3be43

… partitions.

Review comment: Cull inaccurate comment.

d8ded2a

Review comment: replace status with last_tx_needing_rewrite.

a646245

Fixes.

8b9623e

[build] List all crates in the workspace.

9db61e3

ncalexan force-pushed the excision branch from 0406619 to 9db61e3 Compare July 19, 2018 20:02

gburd mentioned this pull request Aug 6, 2020

Use SQLITE_SECURE_DELETE=1 to avoid vacuuming data qpdb/mentat#284

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement excision. (#21) #766

Implement excision. (#21) #766

ncalexan commented Jun 25, 2018

grigoryk left a comment

grigoryk Jun 26, 2018 •

edited

Loading

ncalexan Jul 6, 2018

ncalexan Jul 10, 2018

grigoryk Jun 26, 2018

ncalexan Jul 6, 2018

ncalexan Jul 10, 2018

grigoryk Jun 26, 2018

ncalexan Jul 10, 2018

grigoryk Jun 26, 2018

ncalexan Jul 10, 2018

grigoryk Jun 27, 2018

ncalexan Jul 10, 2018

grigoryk Jun 27, 2018

ncalexan Jul 10, 2018

grigoryk left a comment

grigoryk Jun 27, 2018

ncalexan Jul 10, 2018

ncalexan commented Jul 10, 2018

Implement excision. (#21) #766

Are you sure you want to change the base?

Implement excision. (#21) #766

Conversation

ncalexan commented Jun 25, 2018

grigoryk left a comment

Choose a reason for hiding this comment

grigoryk Jun 26, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

grigoryk left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ncalexan commented Jul 10, 2018

grigoryk Jun 26, 2018 •

edited

Loading