Skip to content

Conversation

@smengcl
Copy link
Contributor

@smengcl smengcl commented Feb 4, 2022

What changes were proposed in this pull request?

  • Bump golang base image to 1.17.6 (was 1.17.3)
  • Bump centos base image to 7.9.2009 (was 7.6.1810)
    • Incorporated HDDS-6239 into Dockerfile (centos vault repo url change). yum install will fail without it. no longer the case as centos 7 is not EOL, unlike centos 8.
  • Remove unused dependencies bzip2-devel, gcc48-c++, lz4-devel, snappy-devel, zlib-devel, git in second stage (stage 1).
  • Bump gflags to 2.2.2 (was 2.0.0)
  • Bump zstd to 1.5.2 (was 1.1.3)
  • Bump rocksdb to 6.28.2 (was 6.8.1)
  • Use make -j$(nproc) to speed up the build process
  • Bump dumb-init to 1.2.5 (was 1.2.0)
  • Bump byteman to 4.0.9 (was 4.0.4)
  • Bump async-profiler to 2.6 (was 2.0)
  • Bump goofys to 0.24.0 (was 0.20.0)

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-6264

How was this patch tested?

Followed https://github.com/apache/ozone-docker-runner#development for local testing, on x86_64 (amd64 / x64) Linux only:

  • Built image using docker build -t apache/ozone-runner:dev .
Successfully tagged apache/ozone-runner:dev
  • Ran mvn clean verify -DskipTests -Dskip.npx -DskipShade -Ddocker.ozone-runner.version=dev on the ozone master branch (e47b6f0c35)
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  42.393 s
[INFO] Finished at: 2022-02-04T07:24:13-08:00
[INFO] ------------------------------------------------------------------------
$ docker-compose ps
      Name                    Command               State                            Ports                         
-------------------------------------------------------------------------------------------------------------------
ozone_datanode_1   /usr/local/bin/dumb-init - ...   Up      0.0.0.0:49160->9864/tcp,:::49160->9864/tcp,            
                                                            0.0.0.0:49159->9882/tcp,:::49159->9882/tcp             
ozone_om_1         /usr/local/bin/dumb-init - ...   Up      0.0.0.0:9862->9862/tcp,:::9862->9862/tcp,              
                                                            0.0.0.0:9874->9874/tcp,:::9874->9874/tcp               
ozone_recon_1      /usr/local/bin/dumb-init - ...   Up      0.0.0.0:9888->9888/tcp,:::9888->9888/tcp               
ozone_s3g_1        /usr/local/bin/dumb-init - ...   Up      0.0.0.0:9878->9878/tcp,:::9878->9878/tcp               
ozone_scm_1        /usr/local/bin/dumb-init - ...   Up      0.0.0.0:9860->9860/tcp,:::9860->9860/tcp,              
                                                            0.0.0.0:9876->9876/tcp,:::9876->9876/tcp
  • Ran compose/test-all.sh, all acceptance tests passed. (Took 53m55s in my x64 Linux box, or 48m23s excluding suite setups and teardowns)
    • [!] Note: ozonesecure-ha might still be buggy in the setup stage (which didn't show up in the report at all) as I tried in my Linux box.
      • But it works just fine when I run it on an Intel Mac. And all ha tests passed. Hmm

@smengcl smengcl requested a review from adoroszlai February 4, 2022 15:53
Copy link
Contributor

@adoroszlai adoroszlai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot @smengcl for upgrading the runner.

I ran complete acceptance test locally and found the following:

  1. Several tests fail due to sh: diff: command not found
  2. ozonescripts environment has some problem after successful scm --init, log below.
+ docker-compose exec -T scm /opt/hadoop/sbin/start-ozone.sh
Starting datanodes
dc6e551d3ca0: Warning: Permanently added 'dc6e551d3ca0,172.30.0.2' (ECDSA) to the list of known hosts.
dc6e551d3ca0: "System is booting up. Unprivileged users are not permitted to log in yet. Please come back later. For technical details, see pam_nologin(8)."
dc6e551d3ca0: Connection closed by 172.30.0.2 port 22
Starting Ozone Manager nodes [om]
om: Warning: Permanently added 'om,172.30.0.3' (ECDSA) to the list of known hosts.
om: "System is booting up. Unprivileged users are not permitted to log in yet. Please come back later. For technical details, see pam_nologin(8)."
om: Connection closed by 172.30.0.3 port 22
Starting storage container manager nodes [scm]
scm: Warning: Permanently added 'scm,172.30.0.4' (ECDSA) to the list of known hosts.
scm: "System is booting up. Unprivileged users are not permitted to log in yet. Please come back later. For technical details, see pam_nologin(8)."
scm: Connection closed by 172.30.0.4 port 22

@smengcl
Copy link
Contributor Author

smengcl commented Feb 5, 2022

Thanks a lot @smengcl for upgrading the runner.

I ran complete acceptance test locally and found the following:

  1. Several tests fail due to sh: diff: command not found
  2. ozonescripts environment has some problem after successful scm --init, log below.

Thanks @adoroszlai for reviewing this.

Which command do you use to run those acceptance tests locally with the locally built dev image? I will try to run the whole acceptance test suite locally (in a Linux x86-64 box) as well.

-- Edit: Looks like mvn clean verify -DskipTests -Dskip.npx -DskipShade -Ddocker.ozone-runner.version=dev is enough to generate the .env / compose yaml files pointing to the dev image. Now I'm trying to figure out how to get the log.html / report.html generated when running a single test suite like compose/ozone/test.sh instead of running the all of them with compose/test-all.sh. compose/test-single.sh looks promising?

For 1 it looks like we need to explicitly yum install diffutils for centos 8.
For 2 I don't have much clue at the moment. WIth a cursory search it looks like the warning Unprivileged users are not permitted to log in yet comes from /run/nologin / /usr/sbin/nologin

@adoroszlai
Copy link
Contributor

Which command do you use to run those acceptance tests locally with the locally built dev image?
-- Edit: Looks like mvn clean verify -DskipTests -Dskip.npx -DskipShade -Ddocker.ozone-runner.version=dev is enough to generate the .env / compose yaml files pointing to the dev image.

export OZONE_RUNNER_VERSION=dev

This overrides the defaults coming from .env files. So you don't need to rebuild the whole project.

Now I'm trying to figure out how to get the log.html / report.html generated when running a single test suite like compose/ozone/test.sh instead of running the all of them with compose/test-all.sh.

The generate_report call at the end of test.sh should create log.html (if you have Robot Framework installed). I think the script exits before that if there are test failures, though.

@smengcl
Copy link
Contributor Author

smengcl commented Feb 5, 2022

The root cause for issue (2) is that centos 8.4's openssh-server-8.0p1-10.el8.x86_64 package installed by ozone repo's compose/ozonescripts/Dockerfile has pam_nologin.so listed as required in /etc/pam.d/sshd, and /run/nologin somehow not removed:

[root@d6aa10d75824 /]# cat /etc/pam.d/sshd
#%PAM-1.0
auth       substack     password-auth
auth       include      postlogin
account    required     pam_sepermit.so
account    required     pam_nologin.so
account    include      password-auth
password   include      password-auth
# pam_selinux.so close should be the first session rule
session    required     pam_selinux.so close
session    required     pam_loginuid.so
# pam_selinux.so open should only be followed by sessions to be executed in the user context
session    required     pam_selinux.so open env_params
session    required     pam_namespace.so
session    optional     pam_keyinit.so force revoke
session    optional     pam_motd.so
session    include      password-auth
session    include      postlogin
[root@d6aa10d75824 /]# rpm -q --whatprovides /etc/pam.d/sshd
openssh-server-8.0p1-10.el8.x86_64

Therefore, when /opt/hadoop/sbin/start-ozone.sh script starts to ssh (as non-root user) into other containers, the ssh server in other containers checks the existence of /run/nologin, which should have been removed by some systemd service startup but it hasn't, and eventually the ssh login is rejected.

The solution is to add a line in Ozone repo's compose/ozonescripts/Dockerfile to either:
1) Remove /run/nologin; or
2) Remote the line account required pam_nologin.so from /etc/pam.d/sshd

Turns out I forgot to rebuild the ozone-runner-scripts image used by ozonescripts. So I actually can just remove it from the Dockerfile for ozone-docker-runner here. Problem solved!

$ docker-compose exec scm bash
bash-4.4$ cat /etc/*release
CentOS Linux release 8.4.2105
NAME="CentOS Linux"
VERSION="8"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="8"
PLATFORM_ID="platform:el8"
PRETTY_NAME="CentOS Linux 8"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:8"
HOME_URL="https://centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"
CENTOS_MANTISBT_PROJECT="CentOS-8"
CENTOS_MANTISBT_PROJECT_VERSION="8"
CentOS Linux release 8.4.2105
CentOS Linux release 8.4.2105
bash-4.4$ ls -l /run/nologin
ls: cannot access '/run/nologin': No such file or directory
bash-4.4$ ssh om
Warning: Permanently added 'om,172.26.0.3' (ECDSA) to the list of known hosts.
[hadoop@2a810ad3016a ~]$
[hadoop@2a810ad3016a ~]$ logout
Connection to om closed.
bash-4.4$ ssh datanode
Warning: Permanently added 'datanode,172.26.0.4' (ECDSA) to the list of known hosts.
[hadoop@c847a4f84b94 ~]$
[hadoop@c847a4f84b94 ~]$ logout
Connection to datanode closed.
bash-4.4$

Install diffutils for diff command;
Remove /run/nologin to allow ssh as non-root user.
@smengcl
Copy link
Contributor Author

smengcl commented Feb 6, 2022

With the patch above, there is seemingly only one issue remaining (all acceptance test suites up to compose/upgrade test have passed locally on my Intel Mac) that might need to be fixed in ozone repo:

Can be reproduced by running compose/xcompat/test.sh solely:

==============================================================================
xcompat-cluster-1.1.0-client-1.0.0-read-1.0.0 :: Read Compatibility
==============================================================================
Key Can Be Read                                                       | PASS |
------------------------------------------------------------------------------
Dir Can Be Listed                                                     | PASS |
------------------------------------------------------------------------------
File Can Be Get                                                       | PASS |
------------------------------------------------------------------------------
xcompat-cluster-1.1.0-client-1.0.0-read-1.0.0 :: Read Compatibility   | PASS |
3 critical tests, 3 passed, 0 failed
3 tests total, 3 passed, 0 failed
==============================================================================
Output:  /tmp/smoketest/xcompat/result/robot-xcompat-xcompat-write-new_client-3.xml
==============================================================================
xcompat-cluster-1.1.0-client-1.0.0-read-1.1.0 :: Read Compatibility
==============================================================================
Key Can Be Read                                                       | FAIL |
'False' should be true.
------------------------------------------------------------------------------
Dir Can Be Listed                                                     | PASS |
------------------------------------------------------------------------------
File Can Be Get                                                       | PASS |
------------------------------------------------------------------------------
xcompat-cluster-1.1.0-client-1.0.0-read-1.1.0 :: Read Compatibility   | FAIL |
3 critical tests, 2 passed, 1 failed
3 tests total, 2 passed, 1 failed
==============================================================================
Output:  /tmp/smoketest/xcompat/result/robot-xcompat-xcompat-write-new_client-4.xml
Log:     /Users/smeng/repo/ozone/hadoop-ozone/dist/target/ozone-1.3.0-SNAPSHOT/compose/xcompat/result/log.html
Report:  /Users/smeng/repo/ozone/hadoop-ozone/dist/target/ozone-1.3.0-SNAPSHOT/compose/xcompat/result/report.html

Note that I have modded compose/testlib.sh to generate report even when some test fails. And as a (good) side effect the cluster stays up so I can poke around. Simply add this line:

diff --git a/hadoop-ozone/dist/src/main/compose/testlib.sh b/hadoop-ozone/dist/src/main/compose/testlib.sh
index 9c3d6c49c1..abad8c03dd 100755
--- a/hadoop-ozone/dist/src/main/compose/testlib.sh
+++ b/hadoop-ozone/dist/src/main/compose/testlib.sh
@@ -204,6 +204,7 @@ execute_robot_test(){
   set -e
 
   if [[ ${rc} -gt 0 ]]; then
+    generate_report
     stop_docker_env
   fi

Back to the problem analysis, the test Key Can Be Read seemingly assumed that the file /etc/passwd in container ozone-1.0.0 doesn't change with switching to the new client version, which is no longer the case (centos 8.4 based runner image has a different /etc/passwd than the previous centos 7.6 based one):

New client (centos 8.4):

$ docker exec -it xcompat_new_client_1 bash

bash-4.4$ ozone fs -ls ofs://om/vol1/bucket1/
Found 5 items
drwxrwxrwx   - hadoop hadoop          0 2022-02-06 07:52 ofs://om/vol1/bucket1/dir-1.0.0
drwxrwxrwx   - hadoop hadoop          0 2022-02-06 07:51 ofs://om/vol1/bucket1/dir-1.1.0
-rw-rw-rw-   3 hadoop hadoop        671 2022-02-06 07:52 ofs://om/vol1/bucket1/key-1.0.0
-rw-rw-rw-   3 hadoop hadoop        744 2022-02-06 07:51 ofs://om/vol1/bucket1/key-1.1.0
drwxrwxrwx   - hadoop hadoop          0 2022-02-06 07:51 ofs://om/vol1/bucket1/warmup

bash-4.4$ ls -l /etc/passwd
-rw-r--r-- 1 root root 744 Feb  6 03:22 /etc/passwd

bash-4.4$ cat /etc/*release
CentOS Linux release 8.4.2105
...
bash-4.4$ cat /etc/passwd
root:x:0:0:root:/root:/bin/bash
bin:x:1:1:bin:/bin:/sbin/nologin
daemon:x:2:2:daemon:/sbin:/sbin/nologin
adm:x:3:4:adm:/var/adm:/sbin/nologin
lp:x:4:7:lp:/var/spool/lpd:/sbin/nologin
sync:x:5:0:sync:/sbin:/bin/sync
shutdown:x:6:0:shutdown:/sbin:/sbin/shutdown
halt:x:7:0:halt:/sbin:/sbin/halt
mail:x:8:12:mail:/var/spool/mail:/sbin/nologin
operator:x:11:0:operator:/root:/sbin/nologin
games:x:12:100:games:/usr/games:/sbin/nologin
ftp:x:14:50:FTP User:/var/ftp:/sbin/nologin
nobody:x:65534:65534:Kernel Overflow User:/:/sbin/nologin
dbus:x:81:81:System message bus:/:/sbin/nologin
systemd-coredump:x:999:997:systemd Core Dumper:/:/sbin/nologin
systemd-resolve:x:193:193:systemd Resolver:/:/sbin/nologin
hadoop:x:1000:100::/opt/hadoop:/bin/bash

Old client 1.0.0 (centos 7.6):

$ docker exec -it xcompat_old_client_1_0_0_1 bash
bash-4.2$ ls -l /etc/passwd
-rw-r--r-- 1 root root 671 Nov  7  2019 /etc/passwd
bash-4.2$ cat /etc/*release
CentOS Linux release 7.6.1810 (Core)
...
bash-4.2$ cat /etc/passwd
root:x:0:0:root:/root:/bin/bash
bin:x:1:1:bin:/bin:/sbin/nologin
daemon:x:2:2:daemon:/sbin:/sbin/nologin
adm:x:3:4:adm:/var/adm:/sbin/nologin
lp:x:4:7:lp:/var/spool/lpd:/sbin/nologin
sync:x:5:0:sync:/sbin:/bin/sync
shutdown:x:6:0:shutdown:/sbin:/sbin/shutdown
halt:x:7:0:halt:/sbin:/sbin/halt
mail:x:8:12:mail:/var/spool/mail:/sbin/nologin
operator:x:11:0:operator:/root:/sbin/nologin
games:x:12:100:games:/usr/games:/sbin/nologin
ftp:x:14:50:FTP User:/var/ftp:/sbin/nologin
nobody:x:99:99:Nobody:/:/sbin/nologin
systemd-network:x:192:192:systemd Network Management:/:/sbin/nologin
dbus:x:81:81:System message bus:/:/sbin/nologin
hadoop:x:1000:100::/opt/hadoop:/bin/bash

Let's just say the test case assumption (that /etc/passwd doesn't change between old and new ozone client images) is not ideal.

One hacky solution (without touching the ozone repo) might be to intentionally overwrite the /etc/passwd in the new image. And I'm not sure centos 8.4 image would still work with this change.

Proper solution:

  1. Rebuild ozone-1.0.0 client image and push a centos 8.4 based one to Docker Hub, which should be done at some point as well. No further code change in Dockerfile is necessary in this case; and/or:
  2. Use another file rather than /etc/passwd for the testing (e.g. echo testing > /tmp/testkey then upload this instead) -- which needs to be done in ozone repo first.

@adoroszlai
Copy link
Contributor

Thanks @smengcl for updating the patch. Great job analysing the failure in xcompat.

  1. Rebuild ozone-1.0.0 client image and push a centos 8.4 based one to Docker Hub, which should be done at some point as well. No further code change in Dockerfile is necessary in this case; and/or:
  2. Use another file rather than /etc/passwd for the testing (e.g. echo testing > /tmp/testkey then upload this instead) -- which needs to be done in ozone repo first.

I vote for option 2, this is a bug in the test. Note that Ozone is not automatically updated to any new ozone-runner version, it requires an explicit change. We can make any fixes necessary for the image upgrade as part of that commit.

I have encountered another problem in some of the secure environments locally. Kerberos in the new image seems to be more strict about host names (fully qualified vs. not).

  1. In ozonesecure-ha environment SCM hosts have .org domain for some reason. s3/webui.robot fails with: gss_init_sec_context() failed: Server krbtgt/[email protected] not found in Kerberos database. I think this can be fixed by running the S3 tests on s3g host instead of scm1.org.
  2. ozonesecure-mr test fails with: IOException: Couldn't set up IO streams: java.lang.IllegalArgumentException: Server has invalid Kerberos principal: jhs/[email protected], expecting: jhs/[email protected]. I have no idea how to fix this.

@smengcl
Copy link
Contributor Author

smengcl commented Feb 7, 2022

I vote for option 2, this is a bug in the test. Note that Ozone is not automatically updated to any new ozone-runner version, it requires an explicit change. We can make any fixes necessary for the image upgrade as part of that commit.

Yup I will file a jira to fix that.

I have encountered another problem in some of the secure environments locally. Kerberos in the new image seems to be more strict about host names (fully qualified vs. not).

  1. In ozonesecure-ha environment SCM hosts have .org domain for some reason. s3/webui.robot fails with: gss_init_sec_context() failed: Server krbtgt/[email protected] not found in Kerberos database. I think this can be fixed by running the S3 tests on s3g host instead of scm1.org.
  2. ozonesecure-mr test fails with: IOException: Couldn't set up IO streams: java.lang.IllegalArgumentException: Server has invalid Kerberos principal: jhs/[email protected], expecting: jhs/[email protected]. I have no idea how to fix this.

I can observe the same issue as well once I boot up the ozonesecure-ha compose cluster. And I suspect this to be a bug in the krb5-libs version centos 8.4 uses (krb5-libs-1.18.2-8.el8.x86_64) after a bunch of reading and checking, where scm's krbtgt service seemingly mysteriously ignoring the dns_canonicalize_hostname = false in /etc/krb5.conf, and insists on using somewhat truncated FQDN as the instance name.

On another note, I just realized that we should use centos 7.9 image, which has arm64 images as well, mainly because centos 8 is EOL while centos 7 still has a little over 2 years of official support (maintenance updates until 2024-06-30, after which we still have a choice move to almalinux). And moving from 7.6 to 7.9 avoids a lot of those compatiblity issues (but it's still fun digging into those problems. learnt a whole lot from them. and they might still become a problem if moving to almalinux).

I will push a commit to use centos 7.9.2009 instead (uses krb5-libs-1.15.1-51.el7_9.x86_64, centos 7.6.1810 uses krb5-libs-1.15.1-34.el7.x86_64). Initial testing locally shows the problem of scm node's weird krbtgt service principal instance name is gone.

…b-init 1.2.5 (was 1.2.0); byteman 4.0.9 (was 4.0.4); async-profiler-2.6 (was 2.0); goofys 0.24.0 (was 0.20.0)
@smengcl smengcl changed the title HDDS-6264. Bump centos to 8.4.2105 and some dependencies in ozone-runner HDDS-6264. Bump centos to 7.9.2009 and dependencies/tools in ozone-runner Feb 7, 2022
@smengcl smengcl changed the title HDDS-6264. Bump centos to 7.9.2009 and dependencies/tools in ozone-runner HDDS-6264. Bump centos to 7.9.2009, dependencies and tools in ozone-runner Feb 7, 2022
@smengcl
Copy link
Contributor Author

smengcl commented Feb 7, 2022

Filed HDDS-6270 to fix the xcompat acceptance test in Ozone.

@smengcl
Copy link
Contributor Author

smengcl commented Feb 7, 2022

After some digging, I realized that the reason curl SPNEGO is throwing gss error HTTP/ORG is that it actually first fails to auth with principal HTTP/scm1.org. Then as a fallback it tries HTTP/ORG. The fallback instance name (ORG) is the DNS search domain on the node, which can be verified by running hostname -d. But only the final error is thrown back to the client (curl in this case), which is a bit misleading. Such behavior can be observed in KDC log:

# Ran on scm1.org: kinit -k HTTP/[email protected] -t /etc/security/keytabs/HTTP.keytab

kdc_1        | Feb 07 10:53:46 kdc krb5kdc[7](info): AS_REQ (8 etypes {aes256-cts-hmac-sha1-96(18), aes128-cts-hmac-sha1-96(17), aes256-cts-hmac-sha384-192(20), aes128-cts-hmac-sha256-128(19), DEPRECATED:des3-cbc-sha1(16), DEPRECATED:arcfour-hmac(23), camellia128-cts-cmac(25), camellia256-cts-cmac(26)}) 172.25.0.116: ISSUE: authtime 1644231226, etypes {rep=aes256-cts-hmac-sha1-96(18), tkt=aes256-cts-hmac-sha1-96(18), ses=aes256-cts-hmac-sha1-96(18)}, HTTP/[email protected] for krbtgt/[email protected]

# Ran on scm1.org: curl -v --negotiate -u : -I http://scm1.org:9876/

# First tried HTTP/scm1.org -- no match
kdc_1        | Feb 07 10:54:03 kdc krb5kdc[7](info): TGS_REQ (8 etypes {aes256-cts-hmac-sha1-96(18), aes128-cts-hmac-sha1-96(17), aes256-cts-hmac-sha384-192(20), aes128-cts-hmac-sha256-128(19), DEPRECATED:des3-cbc-sha1(16), DEPRECATED:arcfour-hmac(23), camellia128-cts-cmac(25), camellia256-cts-cmac(26)}) 172.25.0.116: LOOKING_UP_SERVER: authtime 0, etypes {rep=UNSUPPORTED:(0)} HTTP/[email protected] for HTTP/[email protected], Server not found in Kerberos database

# Fallback to HTTP/ORG -- still no match. Thrown to the client (curl)
kdc_1        | Feb 07 10:54:03 kdc krb5kdc[7](info): TGS_REQ (8 etypes {aes256-cts-hmac-sha1-96(18), aes128-cts-hmac-sha1-96(17), aes256-cts-hmac-sha384-192(20), aes128-cts-hmac-sha256-128(19), DEPRECATED:des3-cbc-sha1(16), DEPRECATED:arcfour-hmac(23), camellia128-cts-cmac(25), camellia256-cts-cmac(26)}) 172.25.0.116: UNKNOWN_SERVER: authtime 0, etypes {rep=UNSUPPORTED:(0)} HTTP/[email protected] for krbtgt/[email protected], Server not found in Kerberos database

Then we would observe curl (with verbose) printing this error message:

* gss_init_sec_context() failed: : Server krbtgt/[email protected] not found in Kerberos database

However, this issue exists even in the current image version (OZONE_RUNNER_VERSION=20211202-1 with centos 7.6). So this is not likely the root cause of ozonesecure-ha initialization timeout (that I am observing on my Linux box) when running its test.sh. There might be something else blocking it.

Nonetheless, this seems to be silently breaking other HTTP communication between recon and om1/2/3 as I have observed in KDC log as well. Which might need to be fixed in ozone-docker-testkrb5. Should just be some simple line addition in init.sh. Then add similar lines in update-keytabs.sh

@smengcl
Copy link
Contributor Author

smengcl commented Feb 7, 2022

ozonesecure-ha test suite has passed on my Intel Mac, ./test-all.sh is also green as well.

Environment

  • macOS 12.1 (21C52)
  • Docker Desktop 4.4.2 (73305)
$ docker version
Client:
 Cloud integration: v1.0.22
 Version:           20.10.12
 API version:       1.41
 Go version:        go1.16.12
 Git commit:        e91ed57
 Built:             Mon Dec 13 11:46:56 2021
 OS/Arch:           darwin/amd64
 Context:           default
 Experimental:      true

Server: Docker Engine - Community
 Engine:
  Version:          20.10.12
  API version:      1.41 (minimum version 1.12)
  Go version:       go1.16.12
  Git commit:       459d0df
  Built:            Mon Dec 13 11:43:56 2021
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.4.12
  GitCommit:        7b11cfaabd73bb80907dd23182b9347b4245eb5d
 runc:
  Version:          1.0.2
  GitCommit:        v1.0.2-0-g52b36a2
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

Steps

  1. In ozone-docker-runner repo, HEAD is at 9eb62af (clean). Ran:
$ docker build -t apache/ozone-runner:dev .
  1. In ozone repo, HEAD is at f757d9929c (clean). Ran:
$ mvn clean install -Pdist -DskipTests -e -Dmaven.javadoc.skip=true -Ddocker.ozone-runner.version=dev
$ cd hadoop-ozone/dist/target/ozone-*-SNAPSHOT/compose/ozonesecure-ha/
$ ./test.sh
  1. Checked that the ozonesecure-ha test suite is indeed running on centos 7.9:
$ cd hadoop-ozone/dist/target/ozone-*-SNAPSHOT/compose/ozonesecure-ha/
$ docker-compose up -d
$ docker-compose exec scm1.org bash
bash-4.2$ cat /etc/*release
CentOS Linux release 7.9.2009 (Core)
NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"

CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"

CentOS Linux release 7.9.2009 (Core)
CentOS Linux release 7.9.2009 (Core)

Here is the raw test.sh script output:

https://gist.github.com/smengcl/29dfc9c53d31f11c137533bf4d92e571

And here is packed result directory after the ozonesecure-ha run with centos 7.9 base image:

result.zip

@adoroszlai Do you want to try running ozonesecure-ha on your side again with the latest changeset? Thanks!

@adoroszlai
Copy link
Contributor

However, this issue exists even in the current image version (OZONE_RUNNER_VERSION=20211202-1 with centos 7.6).

Do you have any idea why it passes in CI using that image?

@smengcl
Copy link
Contributor Author

smengcl commented Feb 9, 2022

However, this issue exists even in the current image version (OZONE_RUNNER_VERSION=20211202-1 with centos 7.6).

Do you have any idea why it passes in CI using that image?

Hi @adoroszlai . Sorry I should have clarified. This issue (gss_init_sec_context() failed) is not triggered in the s3/webui.robot. Because that robot test only accesses the S3 Gateway endpoint (which doesn't have HA and its hostname is always s3g). I have found this when manually trying to access scm1.org 's web endpoint from the same container.

The issue should only happen when accessing:

  1. SCM web endpoints when SCM HA is enabled; and/or
  2. OM web endpoints when OM HA is enabled

And none of the above cases are covered in the existing robot tests. And to be honest this is not a bug in the Ozone code, but an issue of misconfigured KDC.

This issue happens because we haven't properly assigned SPNEGO service principals (namely those principals beginning with HTTP/) in KDC for those services when HA is enabled. When the SPENGO client (curl) tries to access scm1.org, it will try the HTTP/[email protected] principal, while we only have HTTP/[email protected] on the KDC. That is a mismatch.

  1. In ozonesecure-ha environment SCM hosts have .org domain for some reason. s3/webui.robot fails with: gss_init_sec_context() failed: Server krbtgt/[email protected] not found in Kerberos database. I think this can be fixed by running the S3 tests on s3g host instead of scm1.org.

This other issue you have mentioned broken in centos 8.4 image might have been triggered by something else. Likely due to some new config additions, and possibly a default behavior change of hostname canonicalization from krb5 1.15 to 1.18. For kdc5 1.18 we might need to set qualify_shortname = "" in krb5.conf to fix this in centos 8.4. But now that we are no longer using centos 8.4 that shouldn't be an immediate concern.

@smengcl
Copy link
Contributor Author

smengcl commented Feb 9, 2022

To put that into reproducible steps:

  • ozone-docker-runner HEAD at 9eb62af (the latest)
  • ozone HEAD at f757d9929c (master branch. no longer the latest but should still be relevant)

Prepare

# change directory to ozone project rooot
$ mvn clean install -Pdist -DskipTests -e -Dmaven.javadoc.skip=true -Ddocker.ozone-runner.version=dev
$ cd ./hadoop-ozone/dist/target/ozone-1.3.0-SNAPSHOT/compose/ozonesecure-ha
$ docker-compose up -d
# Wait for > 20 sec for the cluster to boot up
$ docker-compose exec scm1.org bash

In scm1.org

bash-4.2$ hostname
scm1.org
bash-4.2$ kinit -k HTTP/[email protected] -t /etc/security/keytabs/HTTP.keytab
bash-4.2$ klist
Ticket cache: FILE:/tmp/krb5cc_1000
Default principal: HTTP/[email protected]

Valid starting     Expires            Service principal
02/09/22 20:48:34  02/10/22 20:48:34  krbtgt/[email protected]
	renew until 02/16/22 20:48:34

Auth via SPNEGO to S3G web endpoint (SUCCESS)

Got 302 (as expected)

bash-4.2$ curl -v --negotiate -u : http://s3g:9878/
* About to connect() to s3g port 9878 (#0)
*   Trying 172.25.0.114...
* Connected to s3g (172.25.0.114) port 9878 (#0)
> GET / HTTP/1.1
> User-Agent: curl/7.29.0
> Host: s3g:9878
> Accept: */*
>
< HTTP/1.1 302 Found
< Date: Wed, 09 Feb 2022 20:49:21 GMT
< Cache-Control: no-cache
< Expires: Wed, 09 Feb 2022 20:49:21 GMT
< Date: Wed, 09 Feb 2022 20:49:21 GMT
< Pragma: no-cache
< Content-Type: text/plain;charset=utf-8
< X-Content-Type-Options: nosniff
< X-FRAME-OPTIONS: SAMEORIGIN
< X-XSS-Protection: 1; mode=block
< Location: http://s3g:9878/static/
< Content-Length: 0
<
* Connection #0 to host s3g left intact

Got 401, then 200 (as expected)

bash-4.2$ curl -v --negotiate -u : -I http://s3g:9878/static/index.html
* About to connect() to s3g port 9878 (#0)
*   Trying 172.25.0.114...
* Connected to s3g (172.25.0.114) port 9878 (#0)
> HEAD /static/index.html HTTP/1.1
> User-Agent: curl/7.29.0
> Host: s3g:9878
> Accept: */*
>
< HTTP/1.1 401 Authentication required
HTTP/1.1 401 Authentication required
< X-Content-Type-Options: nosniff
X-Content-Type-Options: nosniff
< X-FRAME-OPTIONS: SAMEORIGIN
X-FRAME-OPTIONS: SAMEORIGIN
< X-XSS-Protection: 1; mode=block
X-XSS-Protection: 1; mode=block
< WWW-Authenticate: Negotiate
WWW-Authenticate: Negotiate
< Set-Cookie: hadoop.auth=; Path=/; HttpOnly
Set-Cookie: hadoop.auth=; Path=/; HttpOnly
< Cache-Control: must-revalidate,no-cache,no-store
Cache-Control: must-revalidate,no-cache,no-store
< Content-Type: text/html;charset=iso-8859-1
Content-Type: text/html;charset=iso-8859-1
< Content-Length: 459
Content-Length: 459

<
* Connection #0 to host s3g left intact
* Issue another request to this URL: 'http://s3g:9878/static/index.html'
* Found bundle for host s3g: 0x1c78fa0
* Re-using existing connection! (#0) with host s3g
* Connected to s3g (172.25.0.114) port 9878 (#0)
* Server auth using GSS-Negotiate with user ''
> HEAD /static/index.html HTTP/1.1
> Authorization: Negotiate YIICZAYJKoZIhvcSAQICAQBuggJTMIICT6ADAgEFoQMCAQ6iBwMFACAAAACjggFjYYIBXzCCAVugAwIBBaENGwtFWEFNUExFLkNPTaIWMBSgAwIBA6ENMAsbBEhUVFAbA3MzZ6OCASswggEnoAMCARKhAwIBAaKCARkEggEVA1aHDpCGx0z92AGtdDVaTTtYLV6kx71gw7ctlXzjpZ7Qi4ovEbtdrTahf9tsff6o+uOH8QIa5t0FxCAjOh7XqxpW7iQqTXIGXni3zOXzvb1vAvvOHafKNwkL7I3msMh2EqG09QIHgAPaccjXVr3/fX5CJjyfY+Hl1bmfrcHRpreXHgx98JIZusIgbYEYqGWxMRpzRNdkOScxTUj1OXY6V5urZqjJGbl8UdUZW1sW9V5ZO/IPgrYglP6PddGt5khii6UgqjeX3gq3XAniQT0PGiYOjUa9Rmr4PjZginT13mYjagf09tHBkmzFOUSPRWttJWWxqVVJZykF1uKsdWutDPsyHZm10EK8r0I0rcTlqm3kOl/6U6SB0jCBz6ADAgESooHHBIHEM20mz4GQOFbbpotjrdAJO9mQa4kTZJ0hLYjq3QzNAqQhHppYV/FEp69MXLBZNlJCBMcp1bHMfbR6TBNnSlOsCJP9C+udYIjDPaAuAY/EJXxktg/lMlzl1eZSBV+q66Dm7pJq0sxk/nZLv/plUJYZGqrfJTFj1JeFcVN1ukc6gGGqR4doFQq/pbwUSzTbCvfzOmH65GPsbXEyTVt7Lx5KhdZ5wGlh2UuQ6073sLNz7uziVPYv4LAzHJ1OjQjN7JokjX3EyQ==
> User-Agent: curl/7.29.0
> Host: s3g:9878
> Accept: */*
>
< HTTP/1.1 200 OK
HTTP/1.1 200 OK
< Date: Wed, 09 Feb 2022 20:50:13 GMT
Date: Wed, 09 Feb 2022 20:50:13 GMT
< Content-Type: text/html
Content-Type: text/html
< X-Content-Type-Options: nosniff
X-Content-Type-Options: nosniff
< X-FRAME-OPTIONS: SAMEORIGIN
X-FRAME-OPTIONS: SAMEORIGIN
< X-XSS-Protection: 1; mode=block
X-XSS-Protection: 1; mode=block
< WWW-Authenticate: Negotiate YGoGCSqGSIb3EgECAgIAb1swWaADAgEFoQMCAQ+iTTBLoAMCARKiRARCtXuBXd5k+Bf8joVJCUHitxob0QQ6HN0Q9zhCyqZaNAV9qzWKbbArpj5UE8u2wy82jitx4ORmbNskWQjX0xNwThSE
WWW-Authenticate: Negotiate YGoGCSqGSIb3EgECAgIAb1swWaADAgEFoQMCAQ+iTTBLoAMCARKiRARCtXuBXd5k+Bf8joVJCUHitxob0QQ6HN0Q9zhCyqZaNAV9qzWKbbArpj5UE8u2wy82jitx4ORmbNskWQjX0xNwThSE
< Set-Cookie: hadoop.auth="u=root&p=HTTP/[email protected]&t=kerberos&e=1644475813277&s=oxSVRo1+T5bFFmluutqHdv/0vRm1xGcApucXw9gix/s="; Path=/; HttpOnly
Set-Cookie: hadoop.auth="u=root&p=HTTP/[email protected]&t=kerberos&e=1644475813277&s=oxSVRo1+T5bFFmluutqHdv/0vRm1xGcApucXw9gix/s="; Path=/; HttpOnly
< Last-Modified: Wed, 09 Feb 2022 12:15:20 GMT
Last-Modified: Wed, 09 Feb 2022 12:15:20 GMT
< Accept-Ranges: bytes
Accept-Ranges: bytes
< Content-Length: 3106
Content-Length: 3106

<
* Closing connection 0

Auth via SPNEGO to scm1.org web endpoint (401 only, FAIL)

bash-4.2$ curl -v --negotiate -u : http://scm1.org:9876/
* About to connect() to scm1.org port 9876 (#0)
*   Trying 172.25.0.116...
* Connected to scm1.org (172.25.0.116) port 9876 (#0)
> GET / HTTP/1.1
> User-Agent: curl/7.29.0
> Host: scm1.org:9876
> Accept: */*
>
< HTTP/1.1 401 Authentication required
< Pragma: no-cache
< X-Content-Type-Options: nosniff
< X-FRAME-OPTIONS: SAMEORIGIN
< X-XSS-Protection: 1; mode=block
< Pragma: no-cache
< X-Content-Type-Options: nosniff
< X-FRAME-OPTIONS: SAMEORIGIN
< X-XSS-Protection: 1; mode=block
* gss_init_sec_context() failed: : Server krbtgt/[email protected] not found in Kerberos database
< WWW-Authenticate: Negotiate
< Set-Cookie: hadoop.auth=; Path=/; HttpOnly
< Cache-Control: must-revalidate,no-cache,no-store
< Content-Type: text/html;charset=iso-8859-1
< Content-Length: 442
<
<html>
<head>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8"/>
<title>Error 401 Authentication required</title>
</head>
<body><h2>HTTP ERROR 401 Authentication required</h2>
<table>
<tr><th>URI:</th><td>/</td></tr>
<tr><th>STATUS:</th><td>401</td></tr>
<tr><th>MESSAGE:</th><td>Authentication required</td></tr>
<tr><th>SERVLET:</th><td>org.eclipse.jetty.servlet.DefaultServlet-37986daf</td></tr>
</table>

</body>
</html>
* Connection #0 to host scm1.org left intact

Same responose when trying to access http://scm1.org:9876/conf , http://scm1.org:9876/jmx etc. The root cause is explained in the previous commennt.

@adoroszlai
Copy link
Contributor

Thanks @smengcl for continuing work on this. Re-tested ozonesecure-ha, looks OK now.

2. ozonesecure-mr test fails with: IOException: Couldn't set up IO streams: java.lang.IllegalArgumentException: Server has invalid Kerberos principal: jhs/[email protected], expecting: jhs/[email protected]. I have no idea how to fix this.

I'm still running into this problem with ozonesecure-mr.

I have posted a patch for Ozone to allow using custom ozone-runner image (not just version, but image name). You could push your custom image to a Docker Hub repo of your own, then test it with full acceptance/kubernetes tests with my Ozone patch in your fork of Ozone.

@smengcl
Copy link
Contributor Author

smengcl commented Feb 11, 2022

@adoroszlai Thanks for the reply.

I ran compose/ozonesecure-mr/test.sh on the latest f7e1fa0ec master branch on both my Intel Mac and my x64 (Arch) Linux box, and can't reproduce the error you are seeing. The test suite passed for me. In fact I have not seen this Server has invalid Kerberos principal: jhs/[email protected], expecting: jhs/[email protected] error ever in my testing environments. Though this error looks like yet another Kerberos hostname canonicalization issue that might be related to the Docker network environment.

  1. Can you send me your Docker logs under result/ with the testlib.sh patch below so the reports are still generated when a test suite fails. Run compose/ozonesecure-mr/test.sh. And I can take a look?
diff --git a/hadoop-ozone/dist/src/main/compose/testlib.sh b/hadoop-ozone/dist/src/main/compose/testlib.sh
index 9c3d6c49c1..abad8c03dd 100755
--- a/hadoop-ozone/dist/src/main/compose/testlib.sh
+++ b/hadoop-ozone/dist/src/main/compose/testlib.sh
@@ -204,6 +204,7 @@ execute_robot_test(){
   set -e
 
   if [[ ${rc} -gt 0 ]]; then
+    generate_report
     stop_docker_env
   fi
  1. Could you confirm the Server has invalid Kerberos principal: jhs/[email protected], expecting: jhs/[email protected] error doesn't happen when you run compose/ozonesecure-mr/test.sh on the previous centos 7.6 base image?

@smengcl
Copy link
Contributor Author

smengcl commented Feb 11, 2022

I have posted a patch for Ozone to allow using custom ozone-runner image (not just version, but image name). You could push your custom image to a Docker Hub repo of your own, then test it with full acceptance/kubernetes tests with my Ozone patch in your fork of Ozone.

Thanks for the work! Triggered here https://github.com/smengcl/hadoop-ozone/commits/HDDS-6293

Copy link
Contributor

@adoroszlai adoroszlai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@smengcl
Copy link
Contributor Author

smengcl commented Feb 11, 2022

Thanks @adoroszlai for the review. I will merge this shortly.

@smengcl smengcl merged commit 517c187 into apache:master Feb 11, 2022
@smengcl smengcl deleted the HDDS-6264 branch February 29, 2024 20:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants