HDDS-3855. Add upgrade smoketest #1142

adoroszlai · 2020-06-28T12:47:46Z

What changes were proposed in this pull request?

Introduce a new sample docker-compose environment with test script geared towards running upgrades. Currently it only performs a smoketest: write some keys with old version, read with new one.
Add a script for performing workaround steps for HDDS-3499 during upgrade. This is executed using ozone-runner docker image, which now comes with ldb.

https://issues.apache.org/jira/browse/HDDS-3855

How was this patch tested?

Executed upgrade acceptance test locally and on GitHub.

https://github.com/adoroszlai/hadoop-ozone/runs/815608054

This reverts commit 8986909.

This reverts commit 60a69e9.

avijayanhwx

Thanks @adoroszlai, this looks great! We can use this as a reference for API change tests, finalization etc.

elek

Thanks to work on this @adoroszlai

Overall it looks good to me, and really impressive approach. I have a few comments -- none of them are blocker, but i like to discuss technical details...

Can you please help me to understand why did you remove -f "${compose_file}"?
fixed ip / dedicated network in docker-compose file seems to be unnecessary in this cluster (IMHO)
It seems to be a big restriction that we can't start multiple datanode on the same file system without configuring the datanode path. This is the reason why you need dn1..dn3 directories. I am wondering if we can provide a generic solution to this one. Maybe we can support ${env...} notion when we set the datanode directory?
you create external volume directories but /data is already a volume inside the docker containers. If you use simple docker-compose stop instead of down it can be reused. Did you consider using this approach?

Why do you prefer external volumes? (I found two arguments: easier to debug + easier to execute commands when the cluster is down. But interested if you had any other motivations...).

adoroszlai · 2020-07-16T11:42:11Z

Overall it looks good to me, and really impressive approach. I have a few comments -- none of them are blocker, but i like to discuss technical details...

Thanks for taking a look. I waited with the merge exactly to have this kind of discussion. ;)

Can you please help me to understand why did you remove -f "${compose_file}"?

Each -f accepts only a single filename, so using the same command with one or more files is easier with COMPOSE_FILE approach. Initially I used two separate files (including the one from ozone env), so I needed this fix, but then abandoned that approach. This part of the change could be extracted to a separate issue if you prefer to simplify this one a bit. (It allows you to run ozone/test.sh with monitoring enabled, so I'd rather not drop it completely.)

fixed ip / dedicated network in docker-compose file seems to be unnecessary in this cluster (IMHO)

you create external volume directories but /data is already a volume inside the docker containers. If you use simple docker-compose stop instead of down it can be reused. Did you consider using this approach?

Why do you prefer external volumes? (I found two arguments: easier to debug + easier to execute commands when the cluster is down. But interested if you had any other motivations...).

After stop/start this is what ozone version prints:

                  //////////////
               ////////////////////
            ////////     ////////////////
           //////      ////////////////
          /////      ////////////////  /
         /////            ////////   ///
         ////           ////////    /////
        /////         ////////////////
        /////       ////////////////   //
         ////     ///////////////   /////
         /////  ///////////////     ////
          /////       //////      /////
           //////   //////       /////
             ///////////     ////////
               //////  ////////////
               ///   //////////
              /    0.5.0-beta(Crater Lake)

Source code repository [email protected]:apache/hadoop-ozone.git -r 9b4f8fd49fa15946994bccc6c6ac50a560cfb0ea
Compiled by dchitlangia on 2020-03-16T00:54Z
Compiled with protoc 2.5.0
From source with checksum 4cde4c7a7aaa250bfbaf58220cb8e2c

Using HDDS 0.5.0-beta
Source code repository [email protected]:apache/hadoop-ozone.git -r 9b4f8fd49fa15946994bccc6c6ac50a560cfb0ea
Compiled by dchitlangia on 2020-03-16T00:53Z
Compiled with protoc 2.5.0
From source with checksum 9df32efd56424ab869a0acd0124e4bf5

So docker-compose down/up is needed because changes to the compose file (docker image, etc.) are not picked up with stop/start. And we need different images before/after upgrade.

That's the reason for both volumes and network settings. I had started out without the network/ip settings, but the containers did not always get the same address after down/up, nor would they reuse volumes.

It seems to be a big restriction that we can't start multiple datanode on the same file system without configuring the datanode path. This is the reason why you need dn1..dn3 directories. I am wondering if we can provide a generic solution to this one. Maybe we can support ${env...} notion when we set the datanode directory?

Would be nice, I think we can explore it later.

elek · 2020-07-17T10:41:11Z

I had started out without the network/ip settings, but the containers did not always get the same address after down/up, nor would they reuse volumes.

You mean that I can't start the upgrade cluster with different IP addresses? It seems to be a serious bug which should be fixed. But we can test it with the same approach: hard coded network stack and two different docker-compose file with different ip addresses.

This part of the change could be extracted to a separate issue if you prefer to simplify this one a bit

I am fine to include it, not a big change. It's just good to have the explanation here.

elek · 2020-07-17T10:44:06Z

Other random thoughts:

I plan to enable acceptance tests for k8s cluster definitions, too.

I would like to be sure that those configs are up-to-date and working
Kubernetes have better tooling for more complex clusters (eg. easy SSL certificate management)
While docker-compose is easy-to-use it has some strong limitations. K8s definitions have better flexibility (especially together with the flekszible tool.

elek

Thanks the patch (and the discussion) @adoroszlai

I am merging it now.

adoroszlai · 2020-07-17T12:54:29Z

Thanks @avijayanhwx for the review, and @elek for reviewing and merging this.

* master: HDDS-3855. Add upgrade smoketest (apache#1142) HDDS-3964. Ratis config key mismatch (apache#1204) HDDS-3612. Allow mounting bucket under other volume (apache#1104) HDDS-3926. OM Token Identifier table should use in-house serialization. (apache#1182) HDDS-3824: OM read requests should make SCM#refreshPipeline outside BUCKET_LOCK (apache#1164)

…erface * upstream/master: HDDS-3855. Add upgrade smoketest (apache#1142) HDDS-3964. Ratis config key mismatch (apache#1204) HDDS-3612. Allow mounting bucket under other volume (apache#1104) HDDS-3926. OM Token Identifier table should use in-house serialization. (apache#1182) HDDS-3824: OM read requests should make SCM#refreshPipeline outside BUCKET_LOCK (apache#1164) HDDS-3966. Disable flaky TestOMRatisSnapshots

* master: HDDS-3984. Support filter and search the columns in recon UI (apache#1218) HDDS-3806. Support recognize aws v2 Authorization header. (apache#1098) HDDS-3955. Unable to list intermediate paths on keys created using S3G. (apache#1196) HDDS-3741. Reload old OM state if Install Snapshot from Leader fails (apache#1129) HDDS-3965. SCM failed to start up for duplicated pipeline detected. (apache#1210) HDDS-3855. Add upgrade smoketest (apache#1142) HDDS-3964. Ratis config key mismatch (apache#1204) HDDS-3612. Allow mounting bucket under other volume (apache#1104) HDDS-3926. OM Token Identifier table should use in-house serialization. (apache#1182) HDDS-3824: OM read requests should make SCM#refreshPipeline outside BUCKET_LOCK (apache#1164) HDDS-3966. Disable flaky TestOMRatisSnapshots

* add-deleted-block-table: (63 commits) Make block iterator tests use deleted blocks table, and remove the now unused #deleted# Replace uses of #deleted# key prefix with access to new deleted blocks table Add deleted blocks table to base level DB wrappers Have block deleting service test look for #deleted# keys in metadata table Move block delete to correct table and remove debugging print statement Import schema version when importing container data from export HDDS-3984. Support filter and search the columns in recon UI (apache#1218) HDDS-3806. Support recognize aws v2 Authorization header. (apache#1098) HDDS-3955. Unable to list intermediate paths on keys created using S3G. (apache#1196) HDDS-3741. Reload old OM state if Install Snapshot from Leader fails (apache#1129) Move new key value block iterator implementation and tests to new interface Fix checkstyle violations HDDS-3965. SCM failed to start up for duplicated pipeline detected. (apache#1210) Update comments Add comments on added helper method Remove seekToLast() from iterator interface, implementation, and tests Add more robust unit test with alternating key matches All unit tests pass after allowing keys with deleted and deleting prefixes to be made HDDS-3855. Add upgrade smoketest (apache#1142) HDDS-3964. Ratis config key mismatch (apache#1204) ...

(cherry picked from commit 9b13ab6)

adoroszlai added 22 commits June 25, 2020 17:13

HDDS-3855. Add upgrade smoketest

dd53596

trigger new CI check

ea07c03

only run upgrade test

60a69e9

create volume root

9749205

avoid /tmp as default

ffb36d0

create data directory

87649d0

fixup for ffb36d0

5c16012

debug

b8e0aaa

own 1000

7562d5d

use default OZONE_VOLUME

8986909

Revert "use default OZONE_VOLUME"

f1fa5ee

This reverts commit 8986909.

more debug

e20b770

chown -R OZONE_VOLUME_OWNER

5f53ec0

debug upgrade script

fffe8de

move upgrade scripts to libexec

fa86512

no -it for docker

8ef93dc

Revert "only run upgrade test"

c8d3d53

This reverts commit 60a69e9.

Merge remote-tracking branch 'origin/master' into HDDS-3855

789e054

trigger new CI check

03cbcd0

trigger new CI check

9123592

Merge remote-tracking branch 'origin/master' into HDDS-3855

93d4621

trigger new CI check

3687fe3

adoroszlai self-assigned this Jun 28, 2020

Remove leftover debug

9dbbcbb

adoroszlai requested review from avijayanhwx and elek June 28, 2020 14:26

adoroszlai added test upgrade labels Jun 29, 2020

avijayanhwx approved these changes Jul 6, 2020

View reviewed changes

elek reviewed Jul 16, 2020

View reviewed changes

elek approved these changes Jul 17, 2020

View reviewed changes

elek merged commit 9b13ab6 into apache:master Jul 17, 2020

adoroszlai deleted the HDDS-3855 branch July 17, 2020 12:53

errose28 pushed a commit to errose28/ozone that referenced this pull request Jul 21, 2020

HDDS-3855. Add upgrade smoketest (apache#1142)

223d59c

ChenSammi pushed a commit that referenced this pull request Jul 22, 2020

HDDS-3855. Add upgrade smoketest (#1142)

6361f4a

(cherry picked from commit 9b13ab6)

timmylicheng pushed a commit that referenced this pull request Aug 6, 2020

HDDS-3855. Add upgrade smoketest (#1142)

ed8df6b

rakeshadr pushed a commit to rakeshadr/hadoop-ozone that referenced this pull request Sep 3, 2020

HDDS-3855. Add upgrade smoketest (apache#1142)

3d6ac67

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

HDDS-3855. Add upgrade smoketest #1142

HDDS-3855. Add upgrade smoketest #1142

Uh oh!

adoroszlai commented Jun 28, 2020

Uh oh!

avijayanhwx left a comment

Uh oh!

elek left a comment

Uh oh!

adoroszlai commented Jul 16, 2020

Uh oh!

elek commented Jul 17, 2020

Uh oh!

elek commented Jul 17, 2020

Uh oh!

elek left a comment

Uh oh!

adoroszlai commented Jul 17, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

HDDS-3855. Add upgrade smoketest #1142

HDDS-3855. Add upgrade smoketest #1142

Uh oh!

Conversation

adoroszlai commented Jun 28, 2020

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

avijayanhwx left a comment

Choose a reason for hiding this comment

Uh oh!

elek left a comment

Choose a reason for hiding this comment

Uh oh!

adoroszlai commented Jul 16, 2020

Uh oh!

elek commented Jul 17, 2020

Uh oh!

elek commented Jul 17, 2020

Uh oh!

elek left a comment

Choose a reason for hiding this comment

Uh oh!

adoroszlai commented Jul 17, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants