Skip to content

Conversation

@wking
Copy link
Member

@wking wking commented Jan 15, 2019

From @dgoodwin:

The two uses cases were (1) service delivery will start receiving telemetry for the cluster while it's installing, but they have no knowledge of the UUID which is a problem for them, and (2) if Hive fails to upload that UUID after install we have an orphaned cluster that can't be cleaned up automatically. Writing the metadata.json as an asset is a perfect solution, we can upload once ready and if it fails, no harm done, we'll just keep retrying.

/hold

@openshift-ci-robot openshift-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Jan 15, 2019
@openshift-ci-robot openshift-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jan 15, 2019
@dgoodwin
Copy link
Contributor

Looks like it will solve the problem nicely thanks. If I can help lend a hand in requesting the exception let me know.

@abhinavdahiya
Copy link
Contributor

with metadata being a separate asset, we need this to be part of create cluster target too, right?

@wking
Copy link
Member Author

wking commented Jan 15, 2019

with metadata being a separate asset, we need this to be part of create cluster target too, right?

Should be fixed with b09f8c8 -> 76e67a4.

@wking wking added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 15, 2019
Copy link
Contributor

@staebler staebler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some changes needed to support None platform.

@dgoodwin
Copy link
Contributor

dgoodwin commented Jan 16, 2019

Looks like this crashes when you move on to "create cluster":

level=debug msg="Reusing previously-fetched \"Metadata\""                                                                                                                                  [6/4989]
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0xf0e9e5]                                                                                                                            
goroutine 1 [running]:
github.com/openshift/installer/pkg/asset.PersistToFile(0x5151a40, 0x8556248, 0x7fffcef0d631, 0x7, 0x0, 0x0)                                                                                       
        /go/src/github.com/openshift/installer/pkg/asset/asset.go:50 +0xb5                                                                                                                        
main.runTargetCmd.func1(0x7fffcef0d631, 0x7, 0xc42092eb80, 0xc420891c00)                                                                                                                          
        /go/src/github.com/openshift/installer/cmd/openshift-install/create.go:157 +0x194
main.runTargetCmd.func2(0x8533be0, 0xc4202699c0, 0x0, 0x4)
        /go/src/github.com/openshift/installer/cmd/openshift-install/create.go:177 +0x81
github.com/openshift/installer/vendor/github.com/spf13/cobra.(*Command).execute(0x8533be0, 0xc420269980, 0x4, 0x4, 0x8533be0, 0xc420269980)                                                       
        /go/src/github.com/openshift/installer/vendor/github.com/spf13/cobra/command.go:766 +0x2c1                                                                                                
github.com/openshift/installer/vendor/github.com/spf13/cobra.(*Command).ExecuteC(0xc4206bd180, 0x0, 0xc4207d6500, 0xc4206bd2d0)                                                                   
        /go/src/github.com/openshift/installer/vendor/github.com/spf13/cobra/command.go:852 +0x30a                                                                                                
github.com/openshift/installer/vendor/github.com/spf13/cobra.(*Command).Execute(0xc4206bd180, 0xc420891ec8, 0x1)                                                                                  
        /go/src/github.com/openshift/installer/vendor/github.com/spf13/cobra/command.go:800 +0x2b
main.installerMain()
        /go/src/github.com/openshift/installer/cmd/openshift-install/main.go:50 +0x1ba
main.main()
        /go/src/github.com/openshift/installer/cmd/openshift-install/main.go:34 +0x39

From Devan Goodwin [1]:

  The two uses cases were (1) service delivery will start receiving
  telemetry for the cluster while it's installing, but they have no
  knowledge of the UUID which is a problem for them, and (2) if Hive
  fails to upload that UUID after install we have an orphaned cluster
  that can't be cleaned up automatically.  Writing the metadata.json
  as an asset is a perfect solution, we can upload once ready and if
  it fails, no harm done, we'll just keep retrying.

Matthew recommended the no-op load [2]:

  My suggestion is that, for now, Load should return false always.
  The installer will ignore any changes to metadata.json.  In the
  future, perhaps we should introduce a read-only asset that would
  cause the installer to warn (or fail) in the face of changes.

[1]: openshift#1057 (comment)
[2]: openshift#1070 (comment)
@wking
Copy link
Member Author

wking commented Jan 16, 2019

panic: runtime error: invalid memory address or nil pointer dereference

I think I fixed this with 76e67a4 -> a9afdc9. At least, I can no longer reproduce your panic. Can you check to confirm?

@aaronlevy
Copy link
Contributor

Approved from perspective of feature freeze (this causes significant bug for Hive). I'll let others on team do code approval.

@dgoodwin
Copy link
Contributor

Panic is fixed, thx!

@wking
Copy link
Member Author

wking commented Jan 16, 2019

e2e-aws:

level=error msg="1 error occurred:"
level=error msg="\t* module.vpc.aws_lb.api_external: 1 error occurred:"
level=error msg="\t* aws_lb.api_external: timeout while waiting for state to become 'active' (last state: 'provisioning', timeout: 10m0s)"

There's suspicion that these are openshift/cluster-ingress-operator#105 from other clusters in the account.

/retest

@abhinavdahiya
Copy link
Contributor

/hold cancel

Approved from perspective of feature freeze (this causes significant bug for Hive). I'll let others on team do code approval.

defering to @staebler for /lgtm

@openshift-ci-robot openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 16, 2019
@wking
Copy link
Member Author

wking commented Jan 17, 2019

e2e-aws:

level=error msg="Error: Error applying plan:"
level=error
level=error msg="1 error occurred:"
level=error msg="\t* module.vpc.aws_route_table_association.route_net[2]: 1 error occurred:"
level=error msg="\t* aws_route_table_association.route_net.2: timeout while waiting for state to become 'success' (timeout: 5m0s)"

/retest

@wking
Copy link
Member Author

wking commented Jan 17, 2019

e2e-aws:

    expected pod "pod-subpath-test-hostpath-cwdd" success: pod "pod-subpath-test-hostpath-cwdd" failed with status: {Phase:Failed Conditions:[{Type:Initialized Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2019-01-17 05:39:12 +0000 UTC Reason: Message:} {Type:Ready Status:False LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2019-01-17 05:38:59 +0000 UTC Reason:ContainersNotReady Message:containers with unready status: [test-container-subpath-hostpath-cwdd test-container-volume-hostpath-cwdd]} {Type:ContainersReady Status:False LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:0001-01-01 00:00:00 +0000 UTC Reason:ContainersNotReady Message:containers with unready status: [test-container-subpath-hostpath-cwdd test-container-volume-hostpath-cwdd]} {Type:PodScheduled Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2019-01-17 05:38:59 +0000 UTC Reason: Message:}] Message: Reason: NominatedNodeName: HostIP:10.0.151.63 PodIP:10.128.2.12 StartTime:2019-01-17 05:38:59 +0000 UTC InitContainerStatuses:[{Name:init-volume-hostpath-cwdd State:{Waiting:nil Running:nil Terminated:&ContainerStateTerminated{ExitCode:0,Signal:0,Reason:Completed,Message:,StartedAt:2019-01-17 05:39:10 +0000 UTC,FinishedAt:2019-01-17 05:39:10 +0000 UTC,ContainerID:cri-o://482e7aef17d5c14162207495faade7cb317817d62209d7e0ce5d4899a6efe9a6,}} LastTerminationState:{Waiting:nil Running:nil Terminated:nil} Ready:true RestartCount:0 Image:docker.io/library/busybox:latest ImageID:docker.io/library/busybox@sha256:bbb143159af9eabdf45511fd5aab4fd2475d4c0e7fd4a5e154b98e838488e510 ContainerID:cri-o://482e7aef17d5c14162207495faade7cb317817d62209d7e0ce5d4899a6efe9a6}] ContainerStatuses:[{Name:test-container-subpath-hostpath-cwdd State:{Waiting:nil Running:nil Terminated:&ContainerStateTerminated{ExitCode:1,Signal:0,Reason:Error,Message:,StartedAt:2019-01-17 05:39:16 +0000 UTC,FinishedAt:2019-01-17 05:39:26 +0000 UTC,ContainerID:cri-o://73c308be4d4c317f9ca4099c5e0ec65d39b2d821b4a75ce16cf6101b8a7b0e30,}} LastTerminationState:{Waiting:nil Running:nil Terminated:nil} Ready:false RestartCount:0 Image:gcr.io/kubernetes-e2e-test-images/mounttest-amd64:1.0 ImageID:gcr.io/kubernetes-e2e-test-images/mounttest-amd64@sha256:e3e75014e6df02dc21e6fb95f93b989a2ff8a91f36ae88d74eccbabaa21fc211 ContainerID:cri-o://73c308be4d4c317f9ca4099c5e0ec65d39b2d821b4a75ce16cf6101b8a7b0e30} {Name:test-container-volume-hostpath-cwdd State:{Waiting:nil Running:nil Terminated:&ContainerStateTerminated{ExitCode:0,Signal:0,Reason:Completed,Message:,StartedAt:2019-01-17 05:39:24 +0000 UTC,FinishedAt:2019-01-17 05:39:24 +0000 UTC,ContainerID:cri-o://6899ca1973f793b17a373449cf3cec5c1c528e7cd7d825e881571138cc003254,}} LastTerminationState:{Waiting:nil Running:nil Terminated:nil} Ready:false RestartCount:0 Image:gcr.io/kubernetes-e2e-test-images/mounttest-amd64:1.0 ImageID:gcr.io/kubernetes-e2e-test-images/mounttest-amd64@sha256:e3e75014e6df02dc21e6fb95f93b989a2ff8a91f36ae88d74eccbabaa21fc211 ContainerID:cri-o://6899ca1973f793b17a373449cf3cec5c1c528e7cd7d825e881571138cc003254}] QOSClass:BestEffort}
not to have occurred

Jan 17 05:38:40.287 I ns=openshift-operator-lifecycle-manager pod=packageserver-fdb989b6b-tc72b Successfully pulled image "registry.svc.ci.openshift.org/ci-op-q432vnpn/stable@sha256:aee3a3f6325e61597f98d63da4bfb047e248f777f8d87c780073e093dee04d18" count(1)
Jan 17 05:38:40.588 I ns=openshift-operator-lifecycle-manager pod=packageserver-fdb989b6b-tc72b Created container count(1)
Jan 17 05:38:40.588 I ns=openshift-operator-lifecycle-manager pod=packageserver-fdb989b6b-tc72b Started container count(1)

failed: (1m13s) 2019-01-17T05:39:51 "[sig-storage] Subpath [Volume type: hostPath] should support readOnly directory specified in the volumeMount [Suite:openshift/conformance/parallel] [Suite:k8s]"
...
Flaky tests:

[Feature:DeploymentConfig] deploymentconfigs with custom deployments [Conformance] should run the custom deployment steps [Suite:openshift/conformance/parallel/minimal]
[Feature:DeploymentConfig] deploymentconfigs with test deployments [Conformance] should run a deployment to completion and then scale to zero [Suite:openshift/conformance/parallel/minimal]
[sig-storage] Subpath [Volume type: hostPath] should support readOnly directory specified in the volumeMount [Suite:openshift/conformance/parallel] [Suite:k8s]

Failing tests:

[Feature:Builds] build with empty source  started build should build even with an empty source in build config [Suite:openshift/conformance/parallel]

/retest

installConfig := &installconfig.InstallConfig{}
parents.Get(clusterID, installConfig)

if installConfig.Config.Platform.None != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With these changes, metadata.json has become information and not just a file used by the destroy command. It would rather see consistency where the file is generated even when the platform is None. With that said, I can live with the changes as they are, and we can address what to do about the None platform for later releases.

@staebler
Copy link
Contributor

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jan 17, 2019
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: staebler, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@wking
Copy link
Member Author

wking commented Jan 17, 2019

e2e-aws:

Failing tests:

The bootstrap user should successfully login with password decoded from kubeadmin secret [Suite:openshift/conformance/parallel]
[Area:Networking] NetworkPolicy when using a plugin that implements NetworkPolicy should enforce multiple, stacked policies with overlapping podSelectors [Feature:OSNetworkPolicy] [Suite:openshift/conformance/parallel]
[Area:Networking] NetworkPolicy when using a plugin that implements NetworkPolicy should enforce policy based on NamespaceSelector [Feature:OSNetworkPolicy] [Suite:openshift/conformance/parallel]
[Area:Networking] NetworkPolicy when using a plugin that implements NetworkPolicy should enforce policy based on NamespaceSelector and PodSelector [Feature:OSNetworkPolicy] [Suite:openshift/conformance/parallel]
[Area:Networking] NetworkPolicy when using a plugin that implements NetworkPolicy should enforce policy based on PodSelector [Feature:OSNetworkPolicy] [Suite:openshift/conformance/parallel]
...

and many, many more. I think something is busted in CI, so I'm not going to kick this again yet.

@wking
Copy link
Member Author

wking commented Jan 17, 2019

Actually, I must just be misreading that summary, because:

passed: (53.9s) 2019-01-17T18:25:15 "[Area:Networking] NetworkPolicy when using a plugin that implements NetworkPolicy should enforce multiple, stacked policies with overlapping podSelectors [Feature:OSNetworkPolicy] [Suite:openshift/conformance/parallel]"

/retest

@wking
Copy link
Member Author

wking commented Jan 17, 2019

e2e-aws:

fail [github.com/openshift/origin/test/extended/deployments/deployments.go:391]: Expected
    <string>: --> pre: Running hook pod ...
    test pre hook executed
    --> pre: Success
    --> Scaling up deployment-test-3 from 0 to 1, scaling down deployment-test-2 from 0 to 0 (keep 1 pods available, don't exceed 2 pods)
        Scaling deployment-test-3 up to 1
to contain substring
    <string>: --> Success

...
failed: (2m45s) 2019-01-17T21:06:54 "[Feature:DeploymentConfig] deploymentconfigs with test deployments [Conformance] should run a deployment to completion and then scale to zero [Suite:openshift/conformance/parallel/minimal]"

and more, although that one has been killing us in CI recently.

/retest

@wking
Copy link
Member Author

wking commented Jan 18, 2019

e2e-aws:

fail [k8s.io/kubernetes/test/e2e/kubectl/portforward.go:515]: Jan 17 23:27:44.956: Missing "^Accepted client connection$" from log: 

...

failed: (41.1s) 2019-01-17T23:28:12 "[sig-cli] Kubectl Port forwarding [k8s.io] With a server listening on 0.0.0.0 should support forwarding over websockets [Suite:openshift/conformance/parallel] [Suite:k8s]"

and more.

/retest

@wking
Copy link
Member Author

wking commented Jan 18, 2019

e2e-aws:

fail [k8s.io/kubernetes/test/e2e/storage/persistent_volumes-local.go:1257]: Expected error:
    <*errors.errorString | 0xc420a73bc0>: {
        s: "failed running \"mkdir -p /tmp/local-volume-test-ea3bbe98-1ae6-11e9-9337-0a58ac1064d6 && dd if=/dev/zero of=/tmp/local-volume-test-ea3bbe98-1ae6-11e9-9337-0a58ac1064d6/file bs=512 count=20480 && E2E_LOOP_DEV=$(sudo losetup -f) && echo ${E2E_LOOP_DEV} && sudo losetup ${E2E_LOOP_DEV} /tmp/local-volume-test-ea3bbe98-1ae6-11e9-9337-0a58ac1064d6/file\": <nil> (exit code 1)",
    }
    failed running "mkdir -p /tmp/local-volume-test-ea3bbe98-1ae6-11e9-9337-0a58ac1064d6 && dd if=/dev/zero of=/tmp/local-volume-test-ea3bbe98-1ae6-11e9-9337-0a58ac1064d6/file bs=512 count=20480 && E2E_LOOP_DEV=$(sudo losetup -f) && echo ${E2E_LOOP_DEV} && sudo losetup ${E2E_LOOP_DEV} /tmp/local-volume-test-ea3bbe98-1ae6-11e9-9337-0a58ac1064d6/file": <nil> (exit code 1)
not to have occurred

failed: (31.7s) 2019-01-18T06:04:40 "[sig-storage] PersistentVolumes-local  [Volume type: blockfs] Set fsGroup for local volume should set different fsGroup for second pod if first pod is deleted [Suite:openshift/conformance/parallel] [Suite:k8s]"

/retest

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

1 similar comment
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@wking
Copy link
Member Author

wking commented Jan 19, 2019

e2e-aws:

level=error msg="\t* aws_route53_record.etcd_a_nodes[0]: 1 error occurred:"
level=error msg="\t* aws_route53_record.etcd_a_nodes.0: [ERR]: Error building changeset: timeout while waiting for state to become 'accepted' (timeout: 5m0s)"

/retest

@openshift-merge-robot openshift-merge-robot merged commit 3711aae into openshift:master Jan 19, 2019
@wking wking deleted the metadata branch January 19, 2019 07:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants