Skip to content

Conversation

@lihongan
Copy link
Contributor

@lihongan lihongan commented Jul 14, 2025

changes:

  • if no dns zones in dns.config, then skip the specific tests that need create/publish dnsrecords (e.g. for shard ingresscontroller or gateway)
  • update shard ingressconroller to use different domain to avoid overlapping with default one *.apps.<baseDomain>

update:
The gcp-user-provisioned-dns cluster set LoadBalancer.DNSManagementPolicy as UnmanagedLoadBalancerDNS, that is why DNSManaged is false. No dns zones is just one of conditions to set DNSManaged as false and cannot be used in this case.

@openshift-ci-robot openshift-ci-robot added jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Jul 14, 2025
@openshift-ci-robot
Copy link

@lihongan: This pull request references Jira Issue OCPBUGS-59176, which is invalid:

  • expected the bug to target the "4.20.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

changes:

  • if no dns zones in dns.config, then skip the specific tests that need create/publish dnsrecords (e.g. for shard ingresscontroller or gateway)
  • update shard ingressconroller to use different domain to avoid overlapping with default one *.apps.<baseDomain>

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot requested review from gcs278 and grzpiotrowski July 14, 2025 10:33
@lihongan
Copy link
Contributor Author

/jira refresh

@openshift-ci-robot openshift-ci-robot added the jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. label Jul 14, 2025
@openshift-ci-robot
Copy link

@lihongan: This pull request references Jira Issue OCPBUGS-59176, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.20.0) matches configured target version for branch (4.20.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @lihongan

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot removed the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label Jul 14, 2025
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jul 14, 2025

@openshift-ci-robot: GitHub didn't allow me to request PR reviews from the following users: lihongan.

Note that only openshift members and repo collaborators can review this PR, and authors cannot review their own PRs.

Details

In response to this:

@lihongan: This pull request references Jira Issue OCPBUGS-59176, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.20.0) matches configured target version for branch (4.20.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @lihongan

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

return false, nil
}
if dns.Spec.PublicZone == nil && dns.Spec.PrivateZone == nil {
noDNSZones = true
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The condition dns.Spec.PublicZone == nil && dns.Spec.PrivateZone == nil is not available to check the GCP user provisioned DNS cluster, instead the cluster set LoadBalancer.DNSManagementPolicy as below

			if platformStatus.GCP != nil && platformStatus.GCP.CloudLoadBalancerConfig != nil &&
				platformStatus.GCP.CloudLoadBalancerConfig.DNSType == configv1.ClusterHostedDNSType {
				effectiveStrategy.LoadBalancer.DNSManagementPolicy = operatorv1.UnmanagedLoadBalancerDNS
			}

see https://github.com/openshift/cluster-ingress-operator/blob/cbc0b217b655f1f0ce0becc9145c2a6042beabea/pkg/operator/controller/ingress/controller.go#L478-L492

We might need to update the func to check default ingresscontroller OperatorCondition Type: "DNSManaged" is ConditionFalse

@lihongan lihongan changed the title OCPBUGS-59176: skip specific tests if no DNS zones OCPBUGS-59176: skip specific tests if DNSManaged is false Jul 15, 2025
@openshift-ci-robot
Copy link

@lihongan: This pull request references Jira Issue OCPBUGS-59176, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.20.0) matches configured target version for branch (4.20.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @lihongan

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

changes:

  • if no dns zones in dns.config, then skip the specific tests that need create/publish dnsrecords (e.g. for shard ingresscontroller or gateway)
  • update shard ingressconroller to use different domain to avoid overlapping with default one *.apps.<baseDomain>

update:
The gcp-user-provisioned-dns cluster set LoadBalancer.DNSManagementPolicy as UnmanagedLoadBalancerDNS, that is why DNSManaged is false. No dns zones is just one of conditions to set DNSManaged as false and cannot be used in this case.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jul 15, 2025

@openshift-ci-robot: GitHub didn't allow me to request PR reviews from the following users: lihongan.

Note that only openshift members and repo collaborators can review this PR, and authors cannot review their own PRs.

Details

In response to this:

@lihongan: This pull request references Jira Issue OCPBUGS-59176, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.20.0) matches configured target version for branch (4.20.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @lihongan

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

changes:

  • if no dns zones in dns.config, then skip the specific tests that need create/publish dnsrecords (e.g. for shard ingresscontroller or gateway)
  • update shard ingressconroller to use different domain to avoid overlapping with default one *.apps.<baseDomain>

update:
The gcp-user-provisioned-dns cluster set LoadBalancer.DNSManagementPolicy as UnmanagedLoadBalancerDNS, that is why DNSManaged is false. No dns zones is just one of conditions to set DNSManaged as false and cannot be used in this case.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-trt
Copy link

openshift-trt bot commented Jul 15, 2025

Job Failure Risk Analysis for sha: 49d13c3

Job Name Failure Risk
pull-ci-openshift-origin-main-e2e-gcp-disruptive IncompleteTests
Tests for this run (32) are below the historical average (1239): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)

@alebedev87
Copy link
Contributor

/assign

Comment on lines 202 to 204
if !isDNSManaged {
g.Skip("Skipping on this cluster since DNSManaged is false")
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is DNSRecord still created if there are no zones in DNSConfig? If so, I think that we should not skip but do a dedicated assertDNSRecord (instead of to the current assertDNSRecordStatus). WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks DNSRecord is always created but the .status might be nil (dns.Spec.PublicZone == nil && dns.Spec.PrivateZone == nil) or shows Published type as False (PublicZone == {} or PrivateZone == {}).

Do you mean we should check dnses.config.openshift.io/v1 firstly then make decision to assertDNSRecord or assertDNSRecordStatus ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the skip and added new logic in assertDNSRecordStatus to not check status if zone of dns.config is nil or empty.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the skip and added new logic in assertDNSRecordStatus to not check status if zone of dns.config is nil or empty.

What I meant is that assertDNSRecordStatus should check that DNSRecord's Published == False and Status is nil if there are no DNS zones in the DNS config. Just like your mentioned in the comment above the latest:

Looks DNSRecord is always created but the .status might be nil (dns.Spec.PublicZone == nil && dns.Spec.PrivateZone == nil) or shows Published type as False (PublicZone == {} or PrivateZone == {}).

Comment on lines +56 to +62
isDNSManaged, err := isDNSManaged(oc, time.Minute)
if err != nil {
e2e.Failf("Failed to get default ingresscontroller DNSManaged status: %v", err)
}
if !isDNSManaged {
g.Skip("Skipping on this cluster since DNSManaged is false")
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering why only GRPC interoperability test should be skipped. We have a lot more tests in extended/router suite, do they all work when there are no DNS zones for the cluster?

Copy link
Contributor Author

@lihongan lihongan Jul 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Firstly, there are three conditions for DNSManaged status and if any of them are unsatisfied the status will be set as False, DNS zones in dns.config is just one of condition. see https://github.com/openshift/api/blob/fed56f2794b13fb3691cee40d0b429c2111d0e7f/operator/v1/types_ingress.go#L2060-L2065

Secondly, in the job "e2e-gcp-user-provisioned-dns" we just see the 4 failing tests, and the cluster set LoadBalancer.DNSManagementPolicy = UnmanagedLoadBalancerDNS but not rely on DNS zones. see https://github.com/openshift/cluster-ingress-operator/blob/cbc0b217b655f1f0ce0becc9145c2a6042beabea/pkg/operator/controller/ingress/controller.go#L478-L492

Thirdly, I believe most tests in extended/router suite are just tested with default ingresscontroller and the request go through default router pod, but GRPC/http2 tests create another shard ingresscontroller to test the feature and it rely on cloud DNS to resolve the shard ingresscontroller domain.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alebedev87 , @lihongan just checking to see what the resolution of this is. We need this fix so that the "e2e-gcp-user-provisioned-dns" periodic jobs are not in the Red due to these tests and force this feature to be reverted.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we planning to make this a permanent skip? I don't agree with that.

Copy link
Contributor Author

@lihongan lihongan Aug 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed with Andrey, I plan to amend the logic of sending of HTTP requests to target the Load Balancer's IP address in case the DNS is not managed, with this trick then we don't need to skip the tests.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jul 31, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: lihongan
Once this PR has been reviewed and has the lgtm label, please ask for approval from alebedev87. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jul 31, 2025

@lihongan: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-ovn-serial-publicnet-1of2 49d13c3 link false /test e2e-aws-ovn-serial-publicnet-1of2
ci/prow/okd-e2e-gcp 49d13c3 link false /test okd-e2e-gcp
ci/prow/e2e-azure-ovn-etcd-scaling 49d13c3 link false /test e2e-azure-ovn-etcd-scaling
ci/prow/e2e-aws-ovn-etcd-scaling 49d13c3 link false /test e2e-aws-ovn-etcd-scaling
ci/prow/e2e-gcp-fips-serial-2of2 49d13c3 link false /test e2e-gcp-fips-serial-2of2
ci/prow/e2e-gcp-disruptive 49d13c3 link false /test e2e-gcp-disruptive
ci/prow/e2e-vsphere-ovn-dualstack-primaryv6 49d13c3 link false /test e2e-vsphere-ovn-dualstack-primaryv6
ci/prow/e2e-vsphere-ovn-etcd-scaling 49d13c3 link false /test e2e-vsphere-ovn-etcd-scaling
ci/prow/e2e-gcp-ovn-etcd-scaling 49d13c3 link false /test e2e-gcp-ovn-etcd-scaling
ci/prow/e2e-gcp-fips-serial-1of2 49d13c3 link false /test e2e-gcp-fips-serial-1of2
ci/prow/e2e-aws-ovn-serial-publicnet-2of2 49d13c3 link false /test e2e-aws-ovn-serial-publicnet-2of2
ci/prow/e2e-openstack-serial 49d13c3 link false /test e2e-openstack-serial
ci/prow/e2e-metal-ipi-ovn-dualstack e18b0fa link false /test e2e-metal-ipi-ovn-dualstack
ci/prow/e2e-aws-proxy e18b0fa link false /test e2e-aws-proxy
ci/prow/e2e-agnostic-ovn-cmd e18b0fa link false /test e2e-agnostic-ovn-cmd
ci/prow/e2e-openstack-ovn e18b0fa link false /test e2e-openstack-ovn
ci/prow/e2e-metal-ipi-serial-ovn-ipv6-1of2 e18b0fa link false /test e2e-metal-ipi-serial-ovn-ipv6-1of2
ci/prow/e2e-metal-ipi-ovn-ipv6 e18b0fa link true /test e2e-metal-ipi-ovn-ipv6
ci/prow/okd-scos-e2e-aws-ovn e18b0fa link false /test okd-scos-e2e-aws-ovn
ci/prow/e2e-metal-ipi-ovn e18b0fa link false /test e2e-metal-ipi-ovn
ci/prow/e2e-aws-disruptive e18b0fa link false /test e2e-aws-disruptive
ci/prow/e2e-vsphere-ovn-upi e18b0fa link true /test e2e-vsphere-ovn-upi
ci/prow/e2e-metal-ipi-ovn-dualstack-local-gateway e18b0fa link false /test e2e-metal-ipi-ovn-dualstack-local-gateway
ci/prow/e2e-aws-ovn-single-node-upgrade e18b0fa link false /test e2e-aws-ovn-single-node-upgrade
ci/prow/e2e-aws-ovn-single-node-serial e18b0fa link false /test e2e-aws-ovn-single-node-serial
ci/prow/e2e-gcp-ovn-rt-upgrade e18b0fa link false /test e2e-gcp-ovn-rt-upgrade
ci/prow/e2e-aws-ovn-single-node e18b0fa link false /test e2e-aws-ovn-single-node
ci/prow/e2e-metal-ipi-serial-1of2 e18b0fa link false /test e2e-metal-ipi-serial-1of2
ci/prow/e2e-vsphere-ovn e18b0fa link true /test e2e-vsphere-ovn
ci/prow/e2e-gcp-ovn-techpreview e18b0fa link false /test e2e-gcp-ovn-techpreview
ci/prow/e2e-azure e18b0fa link false /test e2e-azure
ci/prow/e2e-metal-ipi-serial-ovn-ipv6-2of2 e18b0fa link false /test e2e-metal-ipi-serial-ovn-ipv6-2of2
ci/prow/e2e-metal-ipi-ovn-kube-apiserver-rollout e18b0fa link false /test e2e-metal-ipi-ovn-kube-apiserver-rollout
ci/prow/e2e-metal-ipi-virtualmedia e18b0fa link false /test e2e-metal-ipi-virtualmedia
ci/prow/e2e-aws-ovn-fips e18b0fa link true /test e2e-aws-ovn-fips
ci/prow/e2e-metal-ipi-serial-2of2 e18b0fa link false /test e2e-metal-ipi-serial-2of2
ci/prow/e2e-gcp-ovn e18b0fa link true /test e2e-gcp-ovn
ci/prow/e2e-gcp-ovn-techpreview-serial-2of2 e18b0fa link false /test e2e-gcp-ovn-techpreview-serial-2of2

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-trt
Copy link

openshift-trt bot commented Jul 31, 2025

Job Failure Risk Analysis for sha: e18b0fa

Job Name Failure Risk
pull-ci-openshift-origin-main-e2e-aws-ovn-single-node-upgrade IncompleteTests

@alebedev87
Copy link
Contributor

As discussed, even if the DNS is not managed on cluster, an IngressController or Gateway can still have internet facing load balancers (provisioned for their publishing services). We can amend the logic of sending of HTTP requests to target the load balancer's DNS name in case the DNS is not managed. This would also need the Host header to be set to the HTTPRoute or Route's hostname (goclient example of how to do this).

@sadasu
Copy link
Contributor

sadasu commented Aug 13, 2025

As discussed, even if the DNS is not managed on cluster, an IngressController or Gateway can still have internet facing load balancers (provisioned for their publishing services). We can amend the logic of sending of HTTP requests to target the load balancer's DNS name in case the DNS is not managed. This would also need the Host header to be set to the HTTPRoute or Route's hostname (goclient example of how to do this).

@alebedev87 would like to see these changes done as part of this fix or as a follow up to skipping the tests temporarily.

@alebedev87
Copy link
Contributor

would like to see these changes done as part of this fix or as a follow up to skipping the tests temporarily.

@sadasu: We discussed this with Hongan, we thought that we could give it a chance to be implemented. The possibility to do a follow-up PR was discussed too but we were not sure about the priority. Is it needed before 4.20 branch cut?

@candita
Copy link
Contributor

candita commented Aug 15, 2025

This is more likely due to the new ClusterHostedDNS feature being tested on a nightly.

/hold

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 15, 2025

// skip status check if privateZone/publicZone in dns.config is nil or empty
emptyDNSZone := &configv1.DNSZone{}
dns, err := oc.AdminConfigClient().ConfigV1().DNSes().Get(context, "cluster", metav1.GetOptions{})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be checked outside the 10 minute loop, or do we need to check it every time?

@alebedev87
Copy link
Contributor

This is more likely due to the new ClusterHostedDNS feature being tested on a nightly.

Ack. This explains some things. From the customer configured DNS EP:

OpenShift will start static CoreDNS pods to provide DNS resolution for API, Internal API and Ingress services that are essential for cluster creation. In order for a master or worker node to come online, they need the machine config server via the api-int domain. So, it is essential for the api-int domain to be resolvable on the bootstrap node. The static CoreDNS pod started on the bootstrap node needs entries for just api-int resolution and the static CoreDNS pod started on the control plane nodes needs to be able to resolve api, api-int and *.apps domains. Although not identical, this approach leverages learnings from OpenShift's approach to providing in-cluster DNS for on-prem platforms.

After cluster deployment is completed, the customer will update their external DNS solution with the same assigned LB IP addresses used for the configuration of the internal CoreDNS instance. OpenShift will not delete the CoreDNS pod even after cluster installation completes.

If the user successfully configures their external DNS service with api, api-int and *.apps services, then they could optionally delete the in-cluster CoreDNS pod and the cluster is expected to function fully as expected. This is a completely optional step with this design. If the customer does configure their custom DNS solution and leave the in-cluster CoreDNS pod running, all in-cluster entities will end up using the CoreDNS pod's DNS services and all out of cluster requests will be handled by the external DNS.

The http2 and grpc tests use dedicated ingresscontrollers, so the default wildcard created by the new static CoreDNS won't work for those tests.

@lihongan:

  1. Can you please update the description of the bug to mention that the failure was found during the testing of featuregates for the promotion to GA? Currently the bug does not name featuregates and doesn't mention that they are from TechPreviewNoUpgrade featureset.
  2. Can you please update the PR description and the commit message to explain in details why we change the domain and logic only for GRPC/HTTP2 test cases (explanation above)?

Also, it seems like the GatewayAPI use case was overlooked. We have to follow up on the final decision, so far I see 2 PRs for this:

@lihongan
Copy link
Contributor Author

Thanks @alebedev87 , I updated the bug description and posted new PR to try to fix the tests instead of skipping them.

@lihongan
Copy link
Contributor Author

/close
superseded by #30131

@openshift-ci openshift-ci bot closed this Aug 19, 2025
@openshift-ci-robot
Copy link

@lihongan: This pull request references Jira Issue OCPBUGS-59176. The bug has been updated to no longer refer to the pull request using the external bug tracker.

Details

In response to this:

changes:

  • if no dns zones in dns.config, then skip the specific tests that need create/publish dnsrecords (e.g. for shard ingresscontroller or gateway)
  • update shard ingressconroller to use different domain to avoid overlapping with default one *.apps.<baseDomain>

update:
The gcp-user-provisioned-dns cluster set LoadBalancer.DNSManagementPolicy as UnmanagedLoadBalancerDNS, that is why DNSManaged is false. No dns zones is just one of conditions to set DNSManaged as false and cannot be used in this case.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 19, 2025

@lihongan: Closed this PR.

Details

In response to this:

/close
superseded by #30131

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants