Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce memory consumption for prune operation #329

Merged
merged 2 commits into from
Mar 20, 2018

Conversation

pdf
Copy link
Contributor

@pdf pdf commented Jan 14, 2018

This is a first pass at fixing prune memory consumption. I'm not as familiar with the codebase as I would like, but I think this is in a state that's okay for review. I couldn't get any of the tests to work locally though, so I'm not sure how to test further except to run against valid data (which I've done for the standard prune, but not for exhaustive/exclusive).

Implements non-caching variants of GetSnapshotChunks/LoadChunks. All existing call sites are migrated to use CacheSnapshotChunks and CacheChunks, which are the renamed (previously existing) caching implementations. This allows individual callers to be updated to use the non-caching version where that is deemed appropriate. This is a breaking change for any third-party tooling, as the return signature for LoadChunks has changed, and GetSnapshotChunks no longer populates snapsnot.ChunkHashes.

Updates PruneSnapshots to use the non-caching chunk fetching functions. For non-exhaustive prune, considers only target chunks instead of mapping all chunks in repository to memory.

Fixes #146

PS - I should also note that there are a lot of unhandled errors in the existing code-base, plus variable shadowing, etc. In the code that I've touched here, I've made sure to at least log errors where I wasn't 100% certain that failing was correct, and fixed lint where found. I'd highly recommend incorporating some linters into your workflow to catch these sorts of things.

@pdf pdf force-pushed the prune_memory branch 4 times, most recently from 2291a28 to f66bb9d Compare January 14, 2018 02:00
@pdf
Copy link
Contributor Author

pdf commented Jan 14, 2018

I apologise for the somewhat hard to follow git diff in the second commit - it comes mostly from refactoring out the actual prune operations into pruneSnapshots and pruneSnapshotsExhaustive, rather than keeping them inside the huge if/else statement (and making that conditional even larger, with the alternative snapshot traversals implemented here for each prune type).

@pdf
Copy link
Contributor Author

pdf commented Jan 14, 2018

I've done some further testing and profiling. Looks like mostly by virtue of using the non-caching chunk methods, exhaustive prune memory usage tops out at around 80MB, and standard prune at around 65MB for my dataset. The latter at least is down from gigabytes on master.

If we determine that this code is correct, it's probably worth examining some of the other usages to determine where chunk caching is actually beneficial, as there are likely other places where significant savings could me made, just by switching to the non-caching methods.

Copy link
Owner

@gilbertchen gilbertchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can make these changes if you want (and if I have the permission to update this PR).

// GetSnapshotChunks returns all chunks referenced by a given snapshot.
func (manager *SnapshotManager) GetSnapshotChunks(snapshot *Snapshot) (chunks []string) {
// CacheSnapshotChunks returns all chunks referenced by a given snapshot.
func (manager *SnapshotManager) CacheSnapshotChunks(snapshot *Snapshot) (chunks []string) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think instead of keeping two versions of GetSnapshotChunks and LoadChunks, it is possible to add an argument to the original GetSnapshotChunks that indicates if snapshot.ChunkHashes needs to be cleared at the end of the call, like:

func (manager *SnapshotManager) GetSnapshotChunks(snapshot *Snapshot, clearChunkHashes bool) (chunks []string) {

The only place where clearChunkHashes should be false is inside ShowStatisticsTabular where GetSnapshotChunks is called twice on all snapshots. In all other places I believe it is safe to set clearChunkHashes to true.

Copy link
Contributor Author

@pdf pdf Jan 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I considered both approaches (parameter vs separate methods).

  • The logic for GetSnapshotChunks using a parameter is not too complicated, I'd probably name the parameter something like cacheChunks and reverse the logic though, since the zero value for bool is false, and the default behaviour sounds like it should be not to cache the chunks in memory.
  • The problem for me is with usage of LoadChunks, since this is used directly in a few places. If this is retained in its original state, storing data on the snapshot object, it becomes the responsibility of every calling site to nil that field if they don't want to cache the chunks, which is just asking for unexpected memory usage. LoadChunks also doesn't really lend itself well to operating on a parameter for caching, since it needs to return the chunks for the non-caching version, but also store them on the snapshot when caching.

If you would prefer not to retain two implementations, one option would be to replace the caching variant of LoadChunks with the implementation in this PR, which will return the chunks without storing them, then store them on the snapshot inside GetSnapshotChunks, based on a parameter. The reason I didn't consider this route is because I wasn't sure how LoadChunks was used in places like DownloadSnapshotSequence, and I didn't immediately have time to work back up the call chain(s) for these other methods.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I thought was to keep LoadChunks as it is. Other than GetSnapshotChunks, it is only called by DownloadSnapshotSequence, which in turn is called by DownloadSnapshotContents. It should be okay to keep snapshot.ChunkHashes in ``DownloadSnapshotContents` I think.

Feel free to reverse the logic for clearChunkHashes and give it a different name; just avoid using cache in its name. This would create some confusion with the snapshot cache.

logFile.Close()
cerr := logFile.Close()
if cerr != nil {
LOG_ERROR("LOG_FILE", "Could not close log file %s: %v", logFileName, cerr)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be LOG_WARN, because this defer func can be called when a panic occurs and LOG_ERROR would cause a new panic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack


referencedChunks := make(map[string]bool)
var success bool
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for splitting this into separate functions! I have been too lazy and too afraid of breaking something. Don't worry about the git diff results. I'm sure a manual diff of the relevant parts should be good.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem - it made particular sense to do it now, as the two paths now also differ in how they walk the snapshots.

@pdf
Copy link
Contributor Author

pdf commented Jan 16, 2018

If you'd prefer to run with this yourself, you're quite welcome, maintainer edits are enabled on the PR. However, I'm happy to keep working on this based on your feedback if you'd like me to, as time allows.

@gilbertchen
Copy link
Owner

Ok, please keep working on this. I really appreciate your help!

@pdf
Copy link
Contributor Author

pdf commented Jan 19, 2018

Updated based on feedback, ready for review again.

@pdf
Copy link
Contributor Author

pdf commented Jan 19, 2018

I need to do some more profiling on this - I'm not certain that this results in the same level of memory reduction as the previous implementation.

EDIT: Seems to be close enough in my tests, considering the ratios we're talking about.

@pdf
Copy link
Contributor Author

pdf commented Jan 20, 2018

This code is not sufficient - there are locations that assume stored hashes that are not handled here, I'll take another pass.

pdf added 2 commits January 21, 2018 10:11
For non-exhaustive prune, consider only target chunks instead of mapping
all chunks in repository.
@pdf
Copy link
Contributor Author

pdf commented Jan 21, 2018

Okay, I think it's just *BackupManager.Backup that requires keepChunkHashes enabled, but I'm out of time trying to back-track all the references to *Snapshot.ChunkHashes. It might be safer just to enable keeping chunks everywhere initially, until someone who understands all the code-paths better than me can critically review and disable caching for each call to GetSnapshotChunks?

In the mean time, I've enabled keeping for chunks only for Backup and ShowStatisticsTabular.

@leftytennis
Copy link
Contributor

I'm all for any improvements that can be made obviously, but pruning is such an important function, I hope this is heavily tested before it's committed to master... :)

Has any testing been done on two diff repositories to ensure that the results of prune operations using the current master code and with these modifications results in the exact same results?

Thanks

@gilbertchen
Copy link
Owner

@pdf Thank you for your contribution! This is a good piece of work and may have fixed a previously unknown bug!

@jt70471 I tested on my own repository and made sure they pruned the same sets of chunks.

@gilbertchen gilbertchen merged commit 13fffc2 into gilbertchen:master Mar 20, 2018
@kairisku
Copy link

It seems the emptying of ChunkHashes causes backup statistics to show incorrect data (0 total):

File chunks: 0 total, 19,552M bytes; 102 new, 82,075K bytes, 49,195K bytes uploaded

@gilbertchen
Copy link
Owner

The 0 file chunk bug should be fixed by 5d2242d.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Prune excessive memory consumption
4 participants