-
Notifications
You must be signed in to change notification settings - Fork 30
HDDS-6264. Bump centos to 7.9.2009, dependencies and tools in ozone-runner #10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
adoroszlai
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot @smengcl for upgrading the runner.
I ran complete acceptance test locally and found the following:
- Several tests fail due to
sh: diff: command not found ozonescriptsenvironment has some problem after successfulscm --init, log below.
+ docker-compose exec -T scm /opt/hadoop/sbin/start-ozone.sh
Starting datanodes
dc6e551d3ca0: Warning: Permanently added 'dc6e551d3ca0,172.30.0.2' (ECDSA) to the list of known hosts.
dc6e551d3ca0: "System is booting up. Unprivileged users are not permitted to log in yet. Please come back later. For technical details, see pam_nologin(8)."
dc6e551d3ca0: Connection closed by 172.30.0.2 port 22
Starting Ozone Manager nodes [om]
om: Warning: Permanently added 'om,172.30.0.3' (ECDSA) to the list of known hosts.
om: "System is booting up. Unprivileged users are not permitted to log in yet. Please come back later. For technical details, see pam_nologin(8)."
om: Connection closed by 172.30.0.3 port 22
Starting storage container manager nodes [scm]
scm: Warning: Permanently added 'scm,172.30.0.4' (ECDSA) to the list of known hosts.
scm: "System is booting up. Unprivileged users are not permitted to log in yet. Please come back later. For technical details, see pam_nologin(8)."
scm: Connection closed by 172.30.0.4 port 22
Thanks @adoroszlai for reviewing this. Which command do you use to run those acceptance tests locally with the locally built dev image? I will try to run the whole acceptance test suite locally (in a Linux x86-64 box) as well. -- Edit: Looks like For 1 it looks like we need to explicitly yum install |
This overrides the defaults coming from
The |
|
The root cause for issue (2) is that centos 8.4's [root@d6aa10d75824 /]# cat /etc/pam.d/sshd
#%PAM-1.0
auth substack password-auth
auth include postlogin
account required pam_sepermit.so
account required pam_nologin.so
account include password-auth
password include password-auth
# pam_selinux.so close should be the first session rule
session required pam_selinux.so close
session required pam_loginuid.so
# pam_selinux.so open should only be followed by sessions to be executed in the user context
session required pam_selinux.so open env_params
session required pam_namespace.so
session optional pam_keyinit.so force revoke
session optional pam_motd.so
session include password-auth
session include postlogin
[root@d6aa10d75824 /]# rpm -q --whatprovides /etc/pam.d/sshd
openssh-server-8.0p1-10.el8.x86_64Therefore, when
Turns out I forgot to rebuild the ozone-runner-scripts image used by $ docker-compose exec scm bash
bash-4.4$ cat /etc/*release
CentOS Linux release 8.4.2105
NAME="CentOS Linux"
VERSION="8"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="8"
PLATFORM_ID="platform:el8"
PRETTY_NAME="CentOS Linux 8"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:8"
HOME_URL="https://centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"
CENTOS_MANTISBT_PROJECT="CentOS-8"
CENTOS_MANTISBT_PROJECT_VERSION="8"
CentOS Linux release 8.4.2105
CentOS Linux release 8.4.2105
bash-4.4$ ls -l /run/nologin
ls: cannot access '/run/nologin': No such file or directory
bash-4.4$ ssh om
Warning: Permanently added 'om,172.26.0.3' (ECDSA) to the list of known hosts.
[hadoop@2a810ad3016a ~]$
[hadoop@2a810ad3016a ~]$ logout
Connection to om closed.
bash-4.4$ ssh datanode
Warning: Permanently added 'datanode,172.26.0.4' (ECDSA) to the list of known hosts.
[hadoop@c847a4f84b94 ~]$
[hadoop@c847a4f84b94 ~]$ logout
Connection to datanode closed.
bash-4.4$ |
Install diffutils for diff command; Remove /run/nologin to allow ssh as non-root user.
|
With the patch above, there is seemingly only one issue remaining (all acceptance test suites up to Can be reproduced by running Note that I have modded diff --git a/hadoop-ozone/dist/src/main/compose/testlib.sh b/hadoop-ozone/dist/src/main/compose/testlib.sh
index 9c3d6c49c1..abad8c03dd 100755
--- a/hadoop-ozone/dist/src/main/compose/testlib.sh
+++ b/hadoop-ozone/dist/src/main/compose/testlib.sh
@@ -204,6 +204,7 @@ execute_robot_test(){
set -e
if [[ ${rc} -gt 0 ]]; then
+ generate_report
stop_docker_env
fiBack to the problem analysis, the test New client (centos 8.4): Old client 1.0.0 (centos 7.6): Let's just say the test case assumption (that One hacky solution (without touching the ozone repo) might be to intentionally overwrite the /etc/passwd in the new image. And I'm not sure centos 8.4 image would still work with this change. Proper solution:
|
|
Thanks @smengcl for updating the patch. Great job analysing the failure in
I vote for option 2, this is a bug in the test. Note that Ozone is not automatically updated to any new I have encountered another problem in some of the secure environments locally. Kerberos in the new image seems to be more strict about host names (fully qualified vs. not).
|
Yup I will file a jira to fix that.
I can observe the same issue as well once I boot up the On another note, I just realized that we should use centos 7.9 image, which has arm64 images as well, mainly because centos 8 is EOL while centos 7 still has a little over 2 years of official support (maintenance updates until 2024-06-30, after which we still have a choice move to almalinux). And moving from 7.6 to 7.9 avoids a lot of those compatiblity issues (but it's still fun digging into those problems. learnt a whole lot from them. and they might still become a problem if moving to almalinux). I will push a commit to use centos 7.9.2009 instead (uses |
…b-init 1.2.5 (was 1.2.0); byteman 4.0.9 (was 4.0.4); async-profiler-2.6 (was 2.0); goofys 0.24.0 (was 0.20.0)
|
Filed HDDS-6270 to fix the xcompat acceptance test in Ozone. |
|
After some digging, I realized that the reason curl SPNEGO is throwing gss error # Ran on scm1.org: kinit -k HTTP/[email protected] -t /etc/security/keytabs/HTTP.keytab
kdc_1 | Feb 07 10:53:46 kdc krb5kdc[7](info): AS_REQ (8 etypes {aes256-cts-hmac-sha1-96(18), aes128-cts-hmac-sha1-96(17), aes256-cts-hmac-sha384-192(20), aes128-cts-hmac-sha256-128(19), DEPRECATED:des3-cbc-sha1(16), DEPRECATED:arcfour-hmac(23), camellia128-cts-cmac(25), camellia256-cts-cmac(26)}) 172.25.0.116: ISSUE: authtime 1644231226, etypes {rep=aes256-cts-hmac-sha1-96(18), tkt=aes256-cts-hmac-sha1-96(18), ses=aes256-cts-hmac-sha1-96(18)}, HTTP/[email protected] for krbtgt/[email protected]
# Ran on scm1.org: curl -v --negotiate -u : -I http://scm1.org:9876/
# First tried HTTP/scm1.org -- no match
kdc_1 | Feb 07 10:54:03 kdc krb5kdc[7](info): TGS_REQ (8 etypes {aes256-cts-hmac-sha1-96(18), aes128-cts-hmac-sha1-96(17), aes256-cts-hmac-sha384-192(20), aes128-cts-hmac-sha256-128(19), DEPRECATED:des3-cbc-sha1(16), DEPRECATED:arcfour-hmac(23), camellia128-cts-cmac(25), camellia256-cts-cmac(26)}) 172.25.0.116: LOOKING_UP_SERVER: authtime 0, etypes {rep=UNSUPPORTED:(0)} HTTP/[email protected] for HTTP/[email protected], Server not found in Kerberos database
# Fallback to HTTP/ORG -- still no match. Thrown to the client (curl)
kdc_1 | Feb 07 10:54:03 kdc krb5kdc[7](info): TGS_REQ (8 etypes {aes256-cts-hmac-sha1-96(18), aes128-cts-hmac-sha1-96(17), aes256-cts-hmac-sha384-192(20), aes128-cts-hmac-sha256-128(19), DEPRECATED:des3-cbc-sha1(16), DEPRECATED:arcfour-hmac(23), camellia128-cts-cmac(25), camellia256-cts-cmac(26)}) 172.25.0.116: UNKNOWN_SERVER: authtime 0, etypes {rep=UNSUPPORTED:(0)} HTTP/[email protected] for krbtgt/[email protected], Server not found in Kerberos databaseThen we would observe curl (with verbose) printing this error message: However, this issue exists even in the current image version ( Nonetheless, this seems to be silently breaking other HTTP communication between recon and om1/2/3 as I have observed in KDC log as well. Which might need to be fixed in ozone-docker-testkrb5. Should just be some simple line addition in |
|
Environment
$ docker version
Client:
Cloud integration: v1.0.22
Version: 20.10.12
API version: 1.41
Go version: go1.16.12
Git commit: e91ed57
Built: Mon Dec 13 11:46:56 2021
OS/Arch: darwin/amd64
Context: default
Experimental: true
Server: Docker Engine - Community
Engine:
Version: 20.10.12
API version: 1.41 (minimum version 1.12)
Go version: go1.16.12
Git commit: 459d0df
Built: Mon Dec 13 11:43:56 2021
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.4.12
GitCommit: 7b11cfaabd73bb80907dd23182b9347b4245eb5d
runc:
Version: 1.0.2
GitCommit: v1.0.2-0-g52b36a2
docker-init:
Version: 0.19.0
GitCommit: de40ad0Steps
$ docker build -t apache/ozone-runner:dev .
$ mvn clean install -Pdist -DskipTests -e -Dmaven.javadoc.skip=true -Ddocker.ozone-runner.version=dev
$ cd hadoop-ozone/dist/target/ozone-*-SNAPSHOT/compose/ozonesecure-ha/
$ ./test.sh
$ cd hadoop-ozone/dist/target/ozone-*-SNAPSHOT/compose/ozonesecure-ha/
$ docker-compose up -d
$ docker-compose exec scm1.org bash
bash-4.2$ cat /etc/*release
CentOS Linux release 7.9.2009 (Core)
NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"
CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"
CentOS Linux release 7.9.2009 (Core)
CentOS Linux release 7.9.2009 (Core)Here is the raw https://gist.github.com/smengcl/29dfc9c53d31f11c137533bf4d92e571 And here is packed result directory after the @adoroszlai Do you want to try running |
Do you have any idea why it passes in CI using that image? |
Hi @adoroszlai . Sorry I should have clarified. This issue ( The issue should only happen when accessing:
And none of the above cases are covered in the existing robot tests. And to be honest this is not a bug in the Ozone code, but an issue of misconfigured KDC. This issue happens because we haven't properly assigned SPNEGO service principals (namely those principals beginning with
This other issue you have mentioned broken in centos 8.4 image might have been triggered by something else. Likely due to some new config additions, and possibly a default behavior change of hostname canonicalization from krb5 1.15 to 1.18. For kdc5 1.18 we might need to set |
|
To put that into reproducible steps:
Prepare# change directory to ozone project rooot
$ mvn clean install -Pdist -DskipTests -e -Dmaven.javadoc.skip=true -Ddocker.ozone-runner.version=dev
$ cd ./hadoop-ozone/dist/target/ozone-1.3.0-SNAPSHOT/compose/ozonesecure-ha
$ docker-compose up -d
# Wait for > 20 sec for the cluster to boot up
$ docker-compose exec scm1.org bashIn scm1.orgbash-4.2$ hostname
scm1.org
bash-4.2$ kinit -k HTTP/[email protected] -t /etc/security/keytabs/HTTP.keytab
bash-4.2$ klist
Ticket cache: FILE:/tmp/krb5cc_1000
Default principal: HTTP/[email protected]
Valid starting Expires Service principal
02/09/22 20:48:34 02/10/22 20:48:34 krbtgt/[email protected]
renew until 02/16/22 20:48:34Auth via SPNEGO to S3G web endpoint (SUCCESS)Got 302 (as expected)bash-4.2$ curl -v --negotiate -u : http://s3g:9878/
* About to connect() to s3g port 9878 (#0)
* Trying 172.25.0.114...
* Connected to s3g (172.25.0.114) port 9878 (#0)
> GET / HTTP/1.1
> User-Agent: curl/7.29.0
> Host: s3g:9878
> Accept: */*
>
< HTTP/1.1 302 Found
< Date: Wed, 09 Feb 2022 20:49:21 GMT
< Cache-Control: no-cache
< Expires: Wed, 09 Feb 2022 20:49:21 GMT
< Date: Wed, 09 Feb 2022 20:49:21 GMT
< Pragma: no-cache
< Content-Type: text/plain;charset=utf-8
< X-Content-Type-Options: nosniff
< X-FRAME-OPTIONS: SAMEORIGIN
< X-XSS-Protection: 1; mode=block
< Location: http://s3g:9878/static/
< Content-Length: 0
<
* Connection #0 to host s3g left intactGot 401, then 200 (as expected)bash-4.2$ curl -v --negotiate -u : -I http://s3g:9878/static/index.html
* About to connect() to s3g port 9878 (#0)
* Trying 172.25.0.114...
* Connected to s3g (172.25.0.114) port 9878 (#0)
> HEAD /static/index.html HTTP/1.1
> User-Agent: curl/7.29.0
> Host: s3g:9878
> Accept: */*
>
< HTTP/1.1 401 Authentication required
HTTP/1.1 401 Authentication required
< X-Content-Type-Options: nosniff
X-Content-Type-Options: nosniff
< X-FRAME-OPTIONS: SAMEORIGIN
X-FRAME-OPTIONS: SAMEORIGIN
< X-XSS-Protection: 1; mode=block
X-XSS-Protection: 1; mode=block
< WWW-Authenticate: Negotiate
WWW-Authenticate: Negotiate
< Set-Cookie: hadoop.auth=; Path=/; HttpOnly
Set-Cookie: hadoop.auth=; Path=/; HttpOnly
< Cache-Control: must-revalidate,no-cache,no-store
Cache-Control: must-revalidate,no-cache,no-store
< Content-Type: text/html;charset=iso-8859-1
Content-Type: text/html;charset=iso-8859-1
< Content-Length: 459
Content-Length: 459
<
* Connection #0 to host s3g left intact
* Issue another request to this URL: 'http://s3g:9878/static/index.html'
* Found bundle for host s3g: 0x1c78fa0
* Re-using existing connection! (#0) with host s3g
* Connected to s3g (172.25.0.114) port 9878 (#0)
* Server auth using GSS-Negotiate with user ''
> HEAD /static/index.html HTTP/1.1
> Authorization: Negotiate YIICZAYJKoZIhvcSAQICAQBuggJTMIICT6ADAgEFoQMCAQ6iBwMFACAAAACjggFjYYIBXzCCAVugAwIBBaENGwtFWEFNUExFLkNPTaIWMBSgAwIBA6ENMAsbBEhUVFAbA3MzZ6OCASswggEnoAMCARKhAwIBAaKCARkEggEVA1aHDpCGx0z92AGtdDVaTTtYLV6kx71gw7ctlXzjpZ7Qi4ovEbtdrTahf9tsff6o+uOH8QIa5t0FxCAjOh7XqxpW7iQqTXIGXni3zOXzvb1vAvvOHafKNwkL7I3msMh2EqG09QIHgAPaccjXVr3/fX5CJjyfY+Hl1bmfrcHRpreXHgx98JIZusIgbYEYqGWxMRpzRNdkOScxTUj1OXY6V5urZqjJGbl8UdUZW1sW9V5ZO/IPgrYglP6PddGt5khii6UgqjeX3gq3XAniQT0PGiYOjUa9Rmr4PjZginT13mYjagf09tHBkmzFOUSPRWttJWWxqVVJZykF1uKsdWutDPsyHZm10EK8r0I0rcTlqm3kOl/6U6SB0jCBz6ADAgESooHHBIHEM20mz4GQOFbbpotjrdAJO9mQa4kTZJ0hLYjq3QzNAqQhHppYV/FEp69MXLBZNlJCBMcp1bHMfbR6TBNnSlOsCJP9C+udYIjDPaAuAY/EJXxktg/lMlzl1eZSBV+q66Dm7pJq0sxk/nZLv/plUJYZGqrfJTFj1JeFcVN1ukc6gGGqR4doFQq/pbwUSzTbCvfzOmH65GPsbXEyTVt7Lx5KhdZ5wGlh2UuQ6073sLNz7uziVPYv4LAzHJ1OjQjN7JokjX3EyQ==
> User-Agent: curl/7.29.0
> Host: s3g:9878
> Accept: */*
>
< HTTP/1.1 200 OK
HTTP/1.1 200 OK
< Date: Wed, 09 Feb 2022 20:50:13 GMT
Date: Wed, 09 Feb 2022 20:50:13 GMT
< Content-Type: text/html
Content-Type: text/html
< X-Content-Type-Options: nosniff
X-Content-Type-Options: nosniff
< X-FRAME-OPTIONS: SAMEORIGIN
X-FRAME-OPTIONS: SAMEORIGIN
< X-XSS-Protection: 1; mode=block
X-XSS-Protection: 1; mode=block
< WWW-Authenticate: Negotiate YGoGCSqGSIb3EgECAgIAb1swWaADAgEFoQMCAQ+iTTBLoAMCARKiRARCtXuBXd5k+Bf8joVJCUHitxob0QQ6HN0Q9zhCyqZaNAV9qzWKbbArpj5UE8u2wy82jitx4ORmbNskWQjX0xNwThSE
WWW-Authenticate: Negotiate YGoGCSqGSIb3EgECAgIAb1swWaADAgEFoQMCAQ+iTTBLoAMCARKiRARCtXuBXd5k+Bf8joVJCUHitxob0QQ6HN0Q9zhCyqZaNAV9qzWKbbArpj5UE8u2wy82jitx4ORmbNskWQjX0xNwThSE
< Set-Cookie: hadoop.auth="u=root&p=HTTP/[email protected]&t=kerberos&e=1644475813277&s=oxSVRo1+T5bFFmluutqHdv/0vRm1xGcApucXw9gix/s="; Path=/; HttpOnly
Set-Cookie: hadoop.auth="u=root&p=HTTP/[email protected]&t=kerberos&e=1644475813277&s=oxSVRo1+T5bFFmluutqHdv/0vRm1xGcApucXw9gix/s="; Path=/; HttpOnly
< Last-Modified: Wed, 09 Feb 2022 12:15:20 GMT
Last-Modified: Wed, 09 Feb 2022 12:15:20 GMT
< Accept-Ranges: bytes
Accept-Ranges: bytes
< Content-Length: 3106
Content-Length: 3106
<
* Closing connection 0Auth via SPNEGO to scm1.org web endpoint (401 only, FAIL)bash-4.2$ curl -v --negotiate -u : http://scm1.org:9876/
* About to connect() to scm1.org port 9876 (#0)
* Trying 172.25.0.116...
* Connected to scm1.org (172.25.0.116) port 9876 (#0)
> GET / HTTP/1.1
> User-Agent: curl/7.29.0
> Host: scm1.org:9876
> Accept: */*
>
< HTTP/1.1 401 Authentication required
< Pragma: no-cache
< X-Content-Type-Options: nosniff
< X-FRAME-OPTIONS: SAMEORIGIN
< X-XSS-Protection: 1; mode=block
< Pragma: no-cache
< X-Content-Type-Options: nosniff
< X-FRAME-OPTIONS: SAMEORIGIN
< X-XSS-Protection: 1; mode=block
* gss_init_sec_context() failed: : Server krbtgt/[email protected] not found in Kerberos database
< WWW-Authenticate: Negotiate
< Set-Cookie: hadoop.auth=; Path=/; HttpOnly
< Cache-Control: must-revalidate,no-cache,no-store
< Content-Type: text/html;charset=iso-8859-1
< Content-Length: 442
<
<html>
<head>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8"/>
<title>Error 401 Authentication required</title>
</head>
<body><h2>HTTP ERROR 401 Authentication required</h2>
<table>
<tr><th>URI:</th><td>/</td></tr>
<tr><th>STATUS:</th><td>401</td></tr>
<tr><th>MESSAGE:</th><td>Authentication required</td></tr>
<tr><th>SERVLET:</th><td>org.eclipse.jetty.servlet.DefaultServlet-37986daf</td></tr>
</table>
</body>
</html>
* Connection #0 to host scm1.org left intactSame responose when trying to access http://scm1.org:9876/conf , http://scm1.org:9876/jmx etc. The root cause is explained in the previous commennt. |
|
Thanks @smengcl for continuing work on this. Re-tested
I'm still running into this problem with I have posted a patch for Ozone to allow using custom |
|
@adoroszlai Thanks for the reply. I ran
diff --git a/hadoop-ozone/dist/src/main/compose/testlib.sh b/hadoop-ozone/dist/src/main/compose/testlib.sh
index 9c3d6c49c1..abad8c03dd 100755
--- a/hadoop-ozone/dist/src/main/compose/testlib.sh
+++ b/hadoop-ozone/dist/src/main/compose/testlib.sh
@@ -204,6 +204,7 @@ execute_robot_test(){
set -e
if [[ ${rc} -gt 0 ]]; then
+ generate_report
stop_docker_env
fi
|
Thanks for the work! Triggered here https://github.com/smengcl/hadoop-ozone/commits/HDDS-6293 |
adoroszlai
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like acceptance/kubernetes checks passed.
https://github.com/smengcl/hadoop-ozone/actions/runs/1830510656
|
Thanks @adoroszlai for the review. I will merge this shortly. |
What changes were proposed in this pull request?
Incorporated HDDS-6239 into Dockerfile (centos vault repo url change).no longer the case as centos 7 is not EOL, unlike centos 8.yum installwill fail without it.bzip2-devel,gcc48-c++,lz4-devel,snappy-devel,zlib-devel,gitin second stage (stage 1).gflagsto 2.2.2 (was 2.0.0)zstdto 1.5.2 (was 1.1.3)rocksdbto 6.28.2 (was 6.8.1)make -j$(nproc)to speed up the build processdumb-initto 1.2.5 (was 1.2.0)bytemanto 4.0.9 (was 4.0.4)async-profilerto 2.6 (was 2.0)goofysto 0.24.0 (was 0.20.0)What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-6264
How was this patch tested?
Followed https://github.com/apache/ozone-docker-runner#development for local testing, on x86_64 (amd64 / x64) Linux only:
docker build -t apache/ozone-runner:dev .mvn clean verify -DskipTests -Dskip.npx -DskipShade -Ddocker.ozone-runner.version=devon the ozone master branch (e47b6f0c35)cd hadoop-ozone/dist/target/ozone-*/compose/ozone && docker-compose up.compose/test-all.sh, all acceptance tests passed. (Took 53m55s in my x64 Linux box, or 48m23s excluding suite setups and teardowns)ozonesecure-hamight still be buggy in the setup stage (which didn't show up in the report at all) as I tried in my Linux box.