Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[k8s] The k8s integration tests are failing #33520

Closed
TylerHelmuth opened this issue Jun 12, 2024 · 24 comments
Closed

[k8s] The k8s integration tests are failing #33520

TylerHelmuth opened this issue Jun 12, 2024 · 24 comments
Labels
ci-cd CI, CD, testing, build issues flaky test a test is flaky help wanted Extra attention is needed internal/k8stest processor/k8sattributes k8s Attributes processor receiver/k8scluster receiver/k8sobjects receiver/kubeletstats release:blocker The issue must be resolved before cutting the next release

Comments

@TylerHelmuth
Copy link
Member

Component(s)

processor/k8sattributes, receiver/k8scluster, receiver/k8sobjects, receiver/kubeletstats

Describe the issue you're reporting

The k8s integration tests have started failing. See https://github.com/open-telemetry/opentelemetry-collector-contrib/actions/workflows/e2e-tests.yml?query=branch%3Amain.

@TylerHelmuth TylerHelmuth added help wanted Extra attention is needed ci-cd CI, CD, testing, build issues flaky test a test is flaky internal/k8stest labels Jun 12, 2024
@TylerHelmuth
Copy link
Member Author

TylerHelmuth commented Jun 12, 2024

I have been unable to reproduce the issues locally and reverting #33415 did not help (according to the CI jobs on main that was the first commit where things started to flake).

Looking at the workflow it looks like all versions are pinned so I don't think we suddenly started using some new action, kind versions, etc.

Copy link
Contributor

Pinging code owners for internal/k8stest: @crobert-1. See Adding Labels via Comments if you do not have permissions to add labels yourself.

@TylerHelmuth
Copy link
Member Author

@jinja2 @fatsheep9146 any guesses?

@TylerHelmuth TylerHelmuth added the release:blocker The issue must be resolved before cutting the next release label Jun 12, 2024
Copy link
Contributor

Pinging code owners for receiver/k8sobjects: @dmitryax @hvaghani221 @TylerHelmuth. See Adding Labels via Comments if you do not have permissions to add labels yourself.

Copy link
Contributor

Pinging code owners for processor/k8sattributes: @dmitryax @rmfitzpatrick @fatsheep9146 @TylerHelmuth. See Adding Labels via Comments if you do not have permissions to add labels yourself.

Copy link
Contributor

Pinging code owners for receiver/k8scluster: @dmitryax @TylerHelmuth @povilasv. See Adding Labels via Comments if you do not have permissions to add labels yourself.

Copy link
Contributor

Pinging code owners for receiver/kubeletstats: @dmitryax @TylerHelmuth. See Adding Labels via Comments if you do not have permissions to add labels yourself.

@axw
Copy link
Contributor

axw commented Jun 13, 2024

I had reproduced the error locally yesterday (or at least something that looked the same), but had to switch focus before I could find the root cause. Now I can't reproduce it :(

One thing I did notice was in the collector logs there were errors about not being able to connect to kind-control-plane. Perhaps the e2e workflow should capture the pod logs before tearing down, to make debugging easier.

@fatsheep9146
Copy link
Contributor

I had reproduced the error locally yesterday (or at least something that looked the same), but had to switch focus before I could find the root cause. Now I can't reproduce it :(

One thing I did notice was in the collector logs there were errors about not being able to connect to kind-control-plane. Perhaps the e2e workflow should capture the pod logs before tearing down, to make debugging easier.

I also could not reproduce the same error like github action, it's really weird. But your advise is really good to capture the logs of pod (no matter collector or telemetrygen) in workflow to help debugging. @axw

@ChrsMark
Copy link
Member

ChrsMark commented Jun 13, 2024

Not sure if there is another way to get access to the Pods' logs but I tried sth dirty to capture the logs of the Pods: #33538.
Let's see if this can provide us some insights here.

@ChrsMark
Copy link
Member

Got some interesting "connection refused" errors: https://github.com/open-telemetry/opentelemetry-collector-contrib/actions/runs/9497224255/job/26173693278?pr=33538#step:11:225

2024-06-13T09:44:56.953Z	info	exporterhelper/retry_sender.go:118	Exporting failed. Will retry the request after interval.	{"kind": "exporter", "data_type": "traces", "name": "otlp", "error": "rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp 127.0.0.1:4317: connect: connection refused\"", "interval": "7.546970563s"}
2024-06-13T09:44:57.064Z	info	exporterhelper/retry_sender.go:118	Exporting failed. Will retry the request after interval.	{"kind": "exporter", "data_type": "logs", "name": "otlp", "error": "rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp 127.0.0.1:4317: connect: connection refused\"", "interval": "7.612004411s"}
2024-06-13T09:44:57.486Z	info	exporterhelper/retry_sender.go:118	Exporting failed. Will retry the request after interval.	{"kind": "exporter", "data_type": "traces", "name": "otlp", "error": "rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp 127.0.0.1:4317: connect: connection refused\"", "interval": "6.403460654s"}

@fatsheep9146
Copy link
Contributor

@ChrsMark
Seems that the logic of getting hostEndpoint is the root cause, and this logic is different between mac and linux.

@fatsheep9146
Copy link
Contributor

image
@ChrsMark In your latest pr, the hostEndpoint is empty, I think this is the root cause.
https://github.com/open-telemetry/opentelemetry-collector-contrib/actions/runs/9498361987/job/26177138997?pr=33538

@fatsheep9146
Copy link
Contributor

I suspect the reason is due to the https://github.com/actions/runner-images/pull/10039/files.
The os ubuntu-latest we use in github action updated with new version of docker.

@ChrsMark
Copy link
Member

ChrsMark commented Jun 13, 2024

Sounds possible @fatsheep9146, I will try to upgrade docker on my machine to 26.x.x as well and see if I can reproduce it.

Update:

I was able to reproduce this locally with docker 26.1.4 (ubuntu machine).
Collector Pod logs:

2024-06-13T12:42:55.052Z	warn	zapgrpc/zapgrpc.go:193	[core] [Channel #2 SubChannel #8]grpc: addrConn.createTransport failed to connect to {Addr: "127.0.0.1:4317", ServerName: "localhost:4317", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:4317: connect: connection refused"	{"grpc_log": true}
2024-06-13T12:42:55.052Z	warn	zapgrpc/zapgrpc.go:193	[core] [Channel #2 SubChannel #8]grpc: addrConn.createTransport failed to connect to {Addr: "[::1]:4317", ServerName: "localhost:4317", }. Err: connection error: desc = "transport: Error while dialing: dial tcp [::1]:4317: connect: connection refused"	{"grpc_log": true}
2024-06-13T12:42:55.316Z	warn	zapgrpc/zapgrpc.go:193	[core] [Channel #1 SubChannel #9]grpc: addrConn.createTransport failed to connect to {Addr: "127.0.0.1:4317", ServerName: "localhost:4317", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:4317: connect: connection refused"	{"grpc_log": true}
2024-06-13T12:42:55.316Z	warn	zapgrpc/zapgrpc.go:193	[core] [Channel #1 SubChannel #9]grpc: addrConn.createTransport failed to connect to {Addr: "[::1]:4317", ServerName: "localhost:4317", }. Err: connection error: desc = "transport: Error while dialing: dial tcp [::1]:4317: connect: connection refused"	{"grpc_log": true}
2024-06-13T12:42:57.265Z	warn	zapgrpc/zapgrpc.go:193	[core] [Channel #4 SubChannel #11]grpc: addrConn.createTransport failed to connect to {Addr: "127.0.0.1:4317", ServerName: "localhost:4317", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:4317: connect: connection refused"	{"grpc_log": true}
2024-06-13T12:42:57.265Z	warn	zapgrpc/zapgrpc.go:193	[core] [Channel #4 SubChannel #11]grpc: addrConn.createTransport failed to connect to {Addr: "[::1]:4317", ServerName: "localhost:4317", }. Err: connection error: desc = "transport: Error while dialing: dial tcp [::1]:4317: connect: connection refused"	{"grpc_log": true}

@fatsheep9146
Copy link
Contributor

@ChrsMark I'm trying to update the sdk version docker to see if it can fix the problem.

@ChrsMark
Copy link
Member

@fatsheep9146 thank's! FYI debugging this, I spot that

network, err := client.NetworkInspect(ctx, "kind", types.NetworkInspectOptions{})
is failing with context deadline exceeded, but the weird thing is that this error is for some reason "muted".

Hopefully the lib upgrade can solve this.

@ChrsMark
Copy link
Member

I had a successful run at #33548. I'm going to enable the rest of the tests and check again.

@fatsheep9146
Copy link
Contributor

I had a successful run at #33548. I'm going to enable the rest of the tests and check again.
@ChrsMark

Yes, I found update docker sdk library is blocked by for some reasons.
#32614
#31989

So I also try to use another way to get the right host endpoint
https://github.com/open-telemetry/opentelemetry-collector-contrib/actions/runs/9501492668/job/26187569925?pr=33542

I think we can try in both ways and get more opnions from others.

@ChrsMark
Copy link
Member

I hit an additional error at k8scluster receiver. It seems that some image names have also changed as well: https://github.com/open-telemetry/opentelemetry-collector-contrib/actions/runs/9501435003/job/26187307997?pr=33548#step:11:35

potential fix: c87a639

@fatsheep9146
Copy link
Contributor

I hit an additional error at k8scluster receiver. It seems that some image names have also changed as well: https://github.com/open-telemetry/opentelemetry-collector-contrib/actions/runs/9501435003/job/26187307997?pr=33548#step:11:35

potential fix: c87a639

I think this maybe due to the newer version of kind

@ChrsMark
Copy link
Member

@fatsheep9146 e2e tests passed at #33548. I'm opening that one for review since it offers a fix anyways. I'll be out tomorrow (Friday) so feel free to pick the gateway check and proceed with yours if people find the approach more suitable. I'm fine either way as soon as we solve the issue :).

TylerHelmuth pushed a commit that referenced this issue Jun 13, 2024
**Description:** <Describe what has changed.>
<!--Ex. Fixing a bug - Describe the bug and how this fixes the issue.
Ex. Adding a feature - Explain what this achieves.-->
Only return address that is not empty for `kind` network.
This started affecting the e2e tests possibly because of the
`ubuntu-latest`'s docker version update that is mentioned at
#33520 (comment).
Relates to
#33520.

/cc @fatsheep9146 

Sample `kind` network:

```console
curl --unix-socket /run/docker.sock http://docker/networks/kind | jq              
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   841  100   841    0     0   821k      0 --:--:-- --:--:-- --:--:--  821k
{
  "Name": "kind",
  "Id": "801d2abe204253cbd5d1d135f111a7fb386b830382bde79a699fb4f9aaf674b1",
  "Created": "2024-06-13T15:31:57.738509232+03:00",
  "Scope": "local",
  "Driver": "bridge",
  "EnableIPv6": true,
  "IPAM": {
    "Driver": "default",
    "Options": {},
    "Config": [
      {
        "Subnet": "fc00:f853:ccd:e793::/64"
      },
      {
        "Subnet": "172.18.0.0/16",
        "Gateway": "172.18.0.1"
      }
    ]
  },
  "Internal": false,
  "Attachable": false,
  "Ingress": false,
  "ConfigFrom": {
    "Network": ""
  },
  "ConfigOnly": false,
  "Containers": {
    "db113750635782bc1bfdf31e5f62af3c63f02a9c8844f7fe9ef045b5d9b76d12": {
      "Name": "kind-control-plane",
      "EndpointID": "8b15bb391109ca1ecfbb4bf7a96060b01e3913694d34e23d67eec22684f037bb",
      "MacAddress": "02:42:ac:12:00:02",
      "IPv4Address": "172.18.0.2/16",
      "IPv6Address": "fc00:f853:ccd:e793::2/64"
    }
  },
  "Options": {
    "com.docker.network.bridge.enable_ip_masquerade": "true",
    "com.docker.network.driver.mtu": "1500"
  },
  "Labels": {}
}
```

**Link to tracking Issue:** <Issue number if applicable>

**Testing:** <Describe what testing was performed and which tests were
added.>

**Documentation:** <Describe the documentation added.>

---------

Signed-off-by: ChrsMark <[email protected]>
@crobert-1
Copy link
Member

Resolved by #33548

@crobert-1
Copy link
Member

Thanks for addressing and fixing so quickly @ChrsMark and @fatsheep9146!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci-cd CI, CD, testing, build issues flaky test a test is flaky help wanted Extra attention is needed internal/k8stest processor/k8sattributes k8s Attributes processor receiver/k8scluster receiver/k8sobjects receiver/kubeletstats release:blocker The issue must be resolved before cutting the next release
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants