Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Copy subgraph data in batches when grafting #2293

Merged
merged 11 commits into from
Apr 21, 2021
Merged

Copy subgraph data in batches when grafting #2293

merged 11 commits into from
Apr 21, 2021

Conversation

lutter
Copy link
Collaborator

@lutter lutter commented Mar 19, 2021

These changes make it so that the data copy for grafting is done in batches to avoid long-running transactions. It is now also permissible to graft subgraphs across shards.

Note that this PR sits on top of #2268 since it relies on its changes

Copy link
Collaborator

@leoyvens leoyvens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This foreign schema thing is some dark sorcery!

.filter(nsp::nspname.eq(namespace.as_str()))
.count()
.get_result::<i64>(conn)?
> 0)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can do diesel::dsl::exists(nsp::table.filter(...)).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neat! Changed.

.filter(ft::foreign_table_schema.eq(src.namespace.as_str()))
.count()
.get_result::<i64>(conn)?
> 0;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here too.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed too

.filter(cs::src.eq(self.src.site.id))
.filter(cs::finished_at.is_null())
.count()
.get_result::<i64>(conn)?;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again exists.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed

out.push_sql(" where vid >= ");
out.push_bind_param::<BigInt, _>(&self.first_vid)?;
out.push_sql(" and vid < ");
out.push_bind_param::<BigInt, _>(&self.last_vid)?;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

last_vid isn't really the last vid but the one more than the last, I think this would be more readable if this where <= last_vid and last_vid was passed in as the actual last vid.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True; I changed that.

.load::<MaxVid>(conn)?
.first()
.map(|v| v.max_vid)
.unwrap_or(0);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the table is empty, I believe the target_vid should be -1.

Copy link
Collaborator Author

@lutter lutter Mar 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True; that saves an unnecessary attempt to copy vid = 0

}

fn finished(&self) -> bool {
self.next_vid >= self.target_vid
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that next_vid has not yet been copied, and we want to copy up to and including target_vid, so we're only finished after next_vid > target_vid.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I made changes to that logic yesterday - I went from current_vid to next_vid because I thought it would be clearer, but didn't catch all the places that affected. Changed

self.dst.as_ref(),
&self.src,
self.next_vid,
self.next_vid + self.batch_size,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're not limiting by target_vid here, I think that'd be good to make the behavior more predictable.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's actually a pretty serious bug since we would end up copying data we should not be copying.

dst.revert_block(&conn, &dst.site.deployment, block_to_revert)?;
Layout::revert_metadata(&conn, &dst.site.deployment, block_to_revert)?;
info!(logger, "Rewound subgraph to block {}", block.number;
"time_ms" => start.elapsed().as_millis());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be good to see more justification for this step in the comment.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've actually been meaning to get rid of this and simply forgot. It was left over from when we copied everything and then got rid of the excess because copying the complex metadata structure up to just some block was too cumbersome. I've removed that in favor of copying just what we need.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As it turns out, reverting the data is important since it needs to unclamp versions that are deleted in the future. In theory, CopyEntityBatchQuery could do that during copying, but the SQL to do that becomes pretty unwieldy. Maybe as an optimization in the future.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ooh tricky

}
None => None,
};
store.start_subgraph(logger, site, graft_base)
store.start_subgraph(logger, site.clone(), graft_base)?;
self.primary_conn()?.copy_finished(site.as_ref())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It reads a bit weird to execute this unconditionally even though most subgraphs won't be copied when deployed, maybe this copy_finished call can be moved into start_subgraph?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, this can't go into start_subgraph since it manipulates a different database.

@lutter
Copy link
Collaborator Author

lutter commented Mar 26, 2021

Added more commits to address all the review comments.

let start = Instant::now();
let block_to_revert: BlockNumber = (block.number + 1)
.try_into()
.expect("block numbers fit into an i32");
dst.revert_block(&conn, &dst.site.deployment, block_to_revert)?;
Layout::revert_metadata(&conn, &dst.site.deployment, block_to_revert)?;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to self: Remember this interaction if we ever add support for removing data sources.

@lutter lutter force-pushed the lutter/deployment2 branch from 6bb91b5 to 42d1960 Compare March 26, 2021 22:20
@lutter lutter force-pushed the lutter/graft branch 3 times, most recently from 405ec3a to 69fd7f3 Compare March 27, 2021 02:12
@lutter
Copy link
Collaborator Author

lutter commented Mar 27, 2021

I made one very small change to this PR: moving active_copies to the public database schema, and adding a cancelled_at column so that ongoing copy processes can be signalled to stop copying. The actual use of that is in #2313

@lutter lutter force-pushed the lutter/deployment2 branch from 42d1960 to ba77f20 Compare March 29, 2021 17:53
@lutter lutter force-pushed the lutter/deployment2 branch from ba77f20 to 46b19f4 Compare March 30, 2021 23:58
@lutter lutter changed the base branch from lutter/deployment2 to master March 31, 2021 00:00
@lutter lutter force-pushed the lutter/graft branch 2 times, most recently from 0533d5a to 4ff0f40 Compare April 1, 2021 02:15
@lutter
Copy link
Collaborator Author

lutter commented Apr 1, 2021

Fixed a mistake in the copy_table_state table where next_vid and target_vid were int instead of int8

@lutter lutter force-pushed the lutter/graft branch 3 times, most recently from 70062ae to d20df11 Compare April 7, 2021 22:40
@lutter lutter merged commit 674efa6 into master Apr 21, 2021
@lutter lutter deleted the lutter/graft branch April 21, 2021 22:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants