Skip to content

Upstream + hybrid-overlay merge 2019-12-28#70

Merged
dcbw merged 109 commits intoopenshift:masterfrom
dcbw:upstream-2019-12-20
Dec 29, 2019
Merged

Upstream + hybrid-overlay merge 2019-12-28#70
dcbw merged 109 commits intoopenshift:masterfrom
dcbw:upstream-2019-12-20

Conversation

@dcbw
Copy link
Contributor

@dcbw dcbw commented Dec 21, 2019

dcbw and others added 30 commits November 15, 2019 08:59
Signed-off-by: Dan Williams <dcbw@redhat.com>
Signed-off-by: Girish Moodalbail <gmoodalbail@nvidia.com>
AddFilteredEndpointsHandler must take label selector like other handlers
build: fix 'make lint' when GOPATH isn't explicitly set
When handling the scheme:address:port URLs given to OVN for
configuring how to reach OVN services, properly handle IPv6 addresses
by not assuming we can just split on ":" across the whole string.

Also use JoinHostPort to properly join a host and port for both IPv4
and IPv6 cases.
Fix parsing of IPv6 addresses in ovn URLs
…ic event notifications to watchers

Improving debugging for failing tests
So, we have registered 9409 and 9410 port numbers for ovnkube-master
and ovnkube-node here:

https://github.com/prometheus/prometheus/wiki/Default-port-allocations

Change the current port numbers to use the reserved port numbers.
Furthermore, with the current port numbers -- 9101 and 9102 -- the
node_exporter daemonset is crashing because it uses one of the
above ports.

Signed-off-by: Girish Moodalbail <gmoodalbail@nvidia.com>
the test case was passing config.GatewayModeLocal for the shared
gateway mode instead of config.GatewayModeShared.

Signed-off-by: Girish Moodalbail <gmoodalbail@nvidia.com>
the boolean argument that determined whether a localnet logical switch
port was required or not was required for spare gateway mode. the two
gateway modes we support today will always have localnet logical
switch port, so remove that redundant argument

Signed-off-by: Girish Moodalbail <gmoodalbail@nvidia.com>
Fixes: c3def15 ("Add multicast support.")
Signed-off-by: Dumitru Ceara <dceara@redhat.com>
Pulls in changes to support multiple subnets and to support IPv6:

openshift/sdn#66
Sync SubnetAllocator from openshift/sdn
Enable IGMP Querier only if a source IPv4 is available.
The pod network info of IP, MAC, Gateway, and Routes are under
'ovn' annotation. We need to move it under 'k8s.ovn.org' namespace.

The new annotation is called 'pod-networks', and it is going to be a
map of 'network_name' to pod's IP information on that network. For
example: ("default" refers to the first OVN interface to the Pod)

    {
        "default": {
            "gateway_ip": "192.168.2.1",
            "ip_address": "192.168.2.3/24",
            "mac_address": "8a:24:f4:a8:02:04"
        }
    }

The changes assumes that the master is upgraded first. It continues to
write both the old/new annotation names to facilitate yet-to-be
upgraded ovnkube nodes. In the next release of ovn-kubernetes, we can
remove the code that adds `legacy` annotation.

Signed-off-by: Yun Zhou <yunz@nvidia.com>
Signed-off-by: Girish Moodalbail <gmoodalbail@nvidia.com>
Move pod annotation under k8s.ovn.org namespace
The current test annotates the node upfront and later checks to see if
the node has correct subnet information. This is not right. We need to
start with no subnet annotation and then later check if the node has
subnet annotation.

Signed-off-by: Girish Moodalbail <gmoodalbail@nvidia.com>
As with all the other Poll*() functions, don't return an error if
all we want to do is just check again at the next interval.

Signed-off-by: Dan Williams <dcbw@redhat.com>
the MAC address for node's management port is randomly chosen. this
address is then added to node's annotation. the master reads the
address and creates a corresponding logical switch port using this
address.

now when node reboots, the mac address of the management port on the
node changes. this changed address is then reflected on node's
annotation and then in the UpdateFunc callback handler for the node
resource, we update the MAC address of the logical switch port.

this is all unnecessary complexity, so better way is to just
persist the initial MAC for the management port in the interface's
MAC column

Signed-off-by: Girish Moodalbail <gmoodalbail@nvidia.com>
Signed-off-by: Girish Moodalbail <gmoodalbail@nvidia.com>
use ahostsv4 database to ensure we get IPv4 address always
Signed-off-by: Girish Moodalbail <gmoodalbail@nvidia.com>
Signed-off-by: Girish Moodalbail <gmoodalbail@nvidia.com>
the MAC address of br-nexthop port is re-generated upon every reboot.
OVN SB remembers the old MAC address in it's MAC_Binding table and this
causes communication issue. just like how physical NICs have fixed MAC
addresses, create these interfaces with the fixed MAC address of
00:00:a9:fe:21:01 where in the last 4 hex octets correspond to
169.255.33.1

Fixes openshift#946

Signed-off-by: Girish Moodalbail <gmoodalbail@nvidia.com>
add a switch to flip on/off multicast support (is disabled by default)
Signed-off-by: Girish Moodalbail <gmoodalbail@nvidia.com>
with map[string]interface{} we can have the value to be `nil` and that
can be used to remove an annotation from the node.

Signed-off-by: Girish Moodalbail <gmoodalbail@nvidia.com>
@dcbw
Copy link
Contributor Author

dcbw commented Dec 21, 2019

/test e2e-aws-ovn

set other_config:hwaaddr on br-local before you add br-nexthop
@dcbw
Copy link
Contributor Author

dcbw commented Dec 21, 2019

Last "failure" was actually success except for [Feature:Prometheus][Conformance] Prometheus when installed on the cluster shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured [Suite:openshift/conformance/parallel/minimal] which we know is not quite right...

@dcbw
Copy link
Contributor Author

dcbw commented Dec 21, 2019

/test e2e-aws-ovn

@dcbw
Copy link
Contributor Author

dcbw commented Dec 22, 2019

Another "pass" except for the Prometheus alert issue.

/test e2e-aws-ovn

1 similar comment
@dcbw
Copy link
Contributor Author

dcbw commented Dec 22, 2019

Another "pass" except for the Prometheus alert issue.

/test e2e-aws-ovn

girishmg and others added 4 commits December 23, 2019 17:39
With 400+ odd nodes, the current MangementPortReady() function is not
scaling. The ovn-nbctl calls are timing out. When we have a way to find
out that the data path for the management port is ready by checking
for OpenFlow rules on the integration bridge we should make use of it.

Signed-off-by: Girish Moodalbail <gmoodalbail@nvidia.com>
With 400+ odd nodes, the current GatewayReady() function is not
scaling. The ovn-nbctl calls are timing out. When we have a way to find
out that the data path for the L3Gateway is ready by checking
for OpenFlow rules on the integration bridge we should make use of it.

Adding SNAT rules is the last thing we do while building the logical
topology. So, check for the SNAT rule in table 65 in the integration
bridge

Signed-off-by: Girish Moodalbail <gmoodalbail@nvidia.com>
scale: ascertain management port readiness by checking OpenFlow rules
scale: ascertain gateway readiness by checking OpenFlow rules
@dcbw
Copy link
Contributor Author

dcbw commented Dec 23, 2019

Another "pass" except for the Prometheus alert issue. Other failure is the openshift-apiserver failing with 1 endpoint.go:68] ccResolverWrapper: sending new addresses to cc: [{https://etcd.openshift-etcd.svc:2379 0 \u003cnil\u003e}]\nF1222 22:45:09.504287 1 openshift_apiserver.go:420] context deadline exceeded

/test e2e-aws-ovn

russellb and others added 3 commits December 23, 2019 09:57
ovn-kubernetes was already setting ovn-remote-probe-interval.  This
patch follows the same pattern for ovn-openflow-probe-interval, and
does it for the same reasons.

The default value for this option is 5 seconds. On a large cluster,
this can cause excessive CPU consumption in ovn-controller.  If it
takes ovn-controller 5 seconds to do a full state computation, then
you'll see ovn-controller end up in effectively a busy loop, because
it isn't able to keep up with this probe interval.

The openflow probe is even less interesting than the OVSDB
remote probe.  At least the ovsdb connection is to something remote.
The openflow connection is always local, so this is unlikely to be a
problem.  We now set it to 3 minutes by default, just in case, instead
of disabling it completely.

Signed-off-by: Russell Bryant <russell@ovn.org>
ovn-controller: Set ovn-openflow-probe-interval
ovnkube-master.log file, with 290K lines of log messages, had close to
221K lines of '... UPDATE for event handler X' log messages that doesn't
provide any meaningful information. in fact, in that noise we might miss
important log message. so remove these debug messages.

Signed-off-by: Girish Moodalbail <gmoodalbail@nvidia.com>
@dcbw
Copy link
Contributor Author

dcbw commented Dec 23, 2019

Another "pass" except for the Prometheus alert issue.

/test e2e-aws-ovn

dcbw and others added 3 commits December 23, 2019 13:31
remove unwanted debug log messages in factory.go
currently, that function gets other-config to ascertain that the
logcial switch is created for a node and continues. later on, we make
an another call to get other-config:subnet. instead, check for
other-config:subnet itself and avoid an unnecessary call.

Signed-off-by: Girish Moodalbail <gmoodalbail@nvidia.com>
scale: waitForNodeLogicalSwitch() should get other-config:subnet itself
@dcbw
Copy link
Contributor Author

dcbw commented Dec 26, 2019

/test e2e-aws-ovn

@dcbw
Copy link
Contributor Author

dcbw commented Dec 26, 2019

ovnkube masters do provide metrics on 0.0.0.0:9102:

# HELP ovnkube_master_pod_creation_latency_seconds The latency between pod creation and setting the OVN annotations
# TYPE ovnkube_master_pod_creation_latency_seconds histogram
ovnkube_master_pod_creation_latency_seconds_bucket{le="0.1"} 0
ovnkube_master_pod_creation_latency_seconds_bucket{le="0.2"} 2
ovnkube_master_pod_creation_latency_seconds_bucket{le="0.4"} 6
ovnkube_master_pod_creation_latency_seconds_bucket{le="0.8"} 30
ovnkube_master_pod_creation_latency_seconds_bucket{le="1.6"} 57

so perhaps the problem is either getting those metrics to prometheus, or the prometheus alert itself?

@dcbw
Copy link
Contributor Author

dcbw commented Dec 26, 2019

And a success without the prometheus metric issue.

/test e2e-aws-ovn

@dcbw
Copy link
Contributor Author

dcbw commented Dec 28, 2019

Prometheus alert issue again, otherwise good.

/test e2e-aws-ovn

@dcbw
Copy link
Contributor Author

dcbw commented Dec 28, 2019

/test e2e-aws-ovn

1 similar comment
@dcbw
Copy link
Contributor Author

dcbw commented Dec 29, 2019

/test e2e-aws-ovn

@dcbw
Copy link
Contributor Author

dcbw commented Dec 29, 2019

Fixes for prometheus alert failures are openshift/cluster-network-operator#435 and openshift/cluster-network-operator#436

@dcbw dcbw changed the title Upstream + hybrid-overlay merge 2019-12-20 Upstream + hybrid-overlay merge 2019-12-28 Dec 29, 2019
@dcbw dcbw merged commit 13d85c0 into openshift:master Dec 29, 2019
@dcbw dcbw mentioned this pull request Dec 29, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.