Skip to content

Conversation

@runcom
Copy link
Member

@runcom runcom commented Feb 11, 2019

pkg/daemon: add atomic files writing and remove file system abstraction

This patch makes sure we write files and ssh (file)
atomically so, if and only if, we'll be able to reconcile a failing machine,
we don't end up having old files lying around.

Last thing this patch does is getting rid of the FileSystemClient abstraction.
I've replaced that with normal 1st class function assignements so we don't
regress in testing. We would need to move those tests to e2e anyway.

Closes #101

@cgwalters @kikisdeliveryservice @ashcrow ptal

Signed-off-by: Antonio Murdaca [email protected]

@openshift-ci-robot openshift-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Feb 11, 2019
@runcom runcom changed the title pkg/daemon: fix files leak pkg/daemon: atomically write files Feb 11, 2019
@runcom runcom changed the title pkg/daemon: atomically write files pkg/daemon: fix files leak Feb 11, 2019
@ashcrow
Copy link
Member

ashcrow commented Feb 11, 2019

I'm cool with replacing the abstraction with functions that can be mocked for testing.

@ashcrow
Copy link
Member

ashcrow commented Feb 11, 2019

/cc @kikisdeliveryservice

@runcom runcom changed the title pkg/daemon: fix files leak pkg/daemon: fix potential files leak Feb 11, 2019
@runcom runcom force-pushed the file-leak-fix branch 3 times, most recently from b039868 to 3eeea91 Compare February 11, 2019 17:19
@cgwalters
Copy link
Member

"fix files leak" is a bit understated of a commit message title 😉

"Remove filesystem mock, fix bugs"? dunno.

Personally I am not a huge fan of the current unit testing we have...I tend to prefer "real" testing and make that as convenient as possible. When the mocking starts to get nontrivial (as is the case not only with the filesystem but also our mock kube client stuff) I think the value proposition gets a lot more uncertain.

OTOH it can be easier to test error conditions via mocking. On the other other hand...a few strategic points to inject errors can go a long way.

I think though I'd say we probably shouldn't do this change right now...the bug isn't worth the code churn? Let's fix the leak in the existing code and keep this around for later?

@cgwalters
Copy link
Member

cgwalters commented Feb 11, 2019

Something that did change of course since the unit testing here was introduced is that we have our own e2e-aws-op where we can test functionality "for real". EDIT: And another thing that changed is clusters install reliably so one can use them for testing.

@runcom
Copy link
Member Author

runcom commented Feb 11, 2019

"fix files leak" is a bit understated of a commit message title

added potential, even if, on any error, we end up leaving files around :)

Personally I am not a huge fan of the current unit testing we have...I tend to prefer "real" testing and make that as convenient as possible. When the mocking starts to get nontrivial (as is the case not only with the filesystem but also our mock kube client stuff) I think the value proposition gets a lot more uncertain.

I belive the abstraction introduced was because there was no e2e, not that we have it, we can indeed test it for real

I think though I'd say we probably shouldn't do this change right now...the bug isn't worth the code churn? Let's fix the leak in the existing code and keep this around for later?

I'm not sure about, this mainly adds atomic file writing to prevent files laying around, the defer file.Close() isn't really doing nothing per-se (I spotted that later, was already adding atomic file write).

@ashcrow
Copy link
Member

ashcrow commented Feb 11, 2019

I think though I'd say we probably shouldn't do this change right now...the bug isn't worth the code churn? Let's fix the leak in the existing code and keep this around for later?

That sounds reasonable to me.

@runcom
Copy link
Member Author

runcom commented Feb 11, 2019

Something that did change of course since the unit testing here was introduced is that we have our own e2e-aws-op where we can test functionality "for real".

this is indeed why I've added Closes #101 as well in the first comment

@runcom
Copy link
Member Author

runcom commented Feb 11, 2019

#402

@runcom runcom changed the title pkg/daemon: fix potential files leak pkg/daemon: add atomic files writing and remove file system abstraction Feb 11, 2019
@runcom runcom force-pushed the file-leak-fix branch 2 times, most recently from 71657cc to 1a089bf Compare February 11, 2019 17:54
@runcom
Copy link
Member Author

runcom commented Feb 11, 2019

/retest

@runcom
Copy link
Member Author

runcom commented Feb 11, 2019

e2e(s) rock 🚀

--- FAIL: TestMCDeployed (308.89s)
	mcd_test.go:146: machine config didn't result in file being on any worker: timed out waiting for the condition

I0211 18:44:43.251341    5348 daemon.go:657] Unable to apply update: rename /tmp/file-write126687226 /etc/mytestconf: invalid cross-device link
E0211 18:44:43.251467    5348 writer.go:91] Marking degraded due to: rename /tmp/file-write126687226 /etc/mytestconf: invalid cross-device link

@openshift-ci-robot openshift-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 12, 2019
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain what this is for?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a lightweight version of the filesystem client, as in, the client existed mainly to be able to unit test the flow as we shouldn't wrap the file management API of Golang. By having a function which deals with writes only, we're able to mock it out in unit test till we have proper e2e in place for these scenarios.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah thanks! I missed the note on line 87.

@runcom
Copy link
Member Author

runcom commented Feb 17, 2019

Unit flake tracked here: #449

/retest

@openshift-ci-robot openshift-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 19, 2019
@openshift-ci-robot openshift-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 19, 2019
@openshift-ci-robot openshift-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 3, 2019
@runcom runcom changed the base branch from master to master-4.1 March 6, 2019 09:53
@openshift-ci-robot openshift-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Mar 6, 2019
@runcom runcom changed the base branch from master-4.1 to master March 6, 2019 09:54
@openshift-ci-robot openshift-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Mar 6, 2019
@runcom
Copy link
Member Author

runcom commented Mar 6, 2019

I'm moving this to master-4.1 once I understand how I can rebase that branch -.-

@runcom
Copy link
Member Author

runcom commented Mar 6, 2019

/retest

@openshift-ci-robot openshift-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 13, 2019
This patch makes sure we write files
atomically so, if and only if, we'll be able to reconcile a failing machine,
we don't end up having old files lying around.

Last thing this patch does is getting rid of the FileSystemClient abstraction.
I've replaced that with normal 1st class function assignements so we don't
regress in testing. We would need to move those tests to e2e anyway.

Signed-off-by: Antonio Murdaca <[email protected]>
@cgwalters
Copy link
Member

Can you push the rebased commit here?

@runcom
Copy link
Member Author

runcom commented Mar 14, 2019

Can you push the rebased commit here?

just done

@openshift-ci-robot openshift-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 14, 2019
@cgwalters
Copy link
Member

/approve

This LGTM.

@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cgwalters, runcom

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@runcom
Copy link
Member Author

runcom commented Mar 14, 2019

/approve

This LGTM.

I guess we can let this go in with #548 as that is based on this now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants