Skip to content

Conversation

@adoroszlai
Copy link
Contributor

What changes were proposed in this pull request?

  1. Introduce a new sample docker-compose environment with test script geared towards running upgrades. Currently it only performs a smoketest: write some keys with old version, read with new one.

  2. Add a script for performing workaround steps for HDDS-3499 during upgrade. This is executed using ozone-runner docker image, which now comes with ldb.

https://issues.apache.org/jira/browse/HDDS-3855

How was this patch tested?

Executed upgrade acceptance test locally and on GitHub.

https://github.com/adoroszlai/hadoop-ozone/runs/815608054

@adoroszlai adoroszlai self-assigned this Jun 28, 2020
Copy link
Contributor

@avijayanhwx avijayanhwx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @adoroszlai, this looks great! We can use this as a reference for API change tests, finalization etc.

Copy link
Member

@elek elek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks to work on this @adoroszlai

Overall it looks good to me, and really impressive approach. I have a few comments -- none of them are blocker, but i like to discuss technical details...

  1. Can you please help me to understand why did you remove -f "${compose_file}"?

  2. fixed ip / dedicated network in docker-compose file seems to be unnecessary in this cluster (IMHO)

  3. It seems to be a big restriction that we can't start multiple datanode on the same file system without configuring the datanode path. This is the reason why you need dn1..dn3 directories. I am wondering if we can provide a generic solution to this one. Maybe we can support ${env...} notion when we set the datanode directory?

  4. you create external volume directories but /data is already a volume inside the docker containers. If you use simple docker-compose stop instead of down it can be reused. Did you consider using this approach?

Why do you prefer external volumes? (I found two arguments: easier to debug + easier to execute commands when the cluster is down. But interested if you had any other motivations...).

@adoroszlai
Copy link
Contributor Author

Overall it looks good to me, and really impressive approach. I have a few comments -- none of them are blocker, but i like to discuss technical details...

Thanks for taking a look. I waited with the merge exactly to have this kind of discussion. ;)

  1. Can you please help me to understand why did you remove -f "${compose_file}"?

Each -f accepts only a single filename, so using the same command with one or more files is easier with COMPOSE_FILE approach. Initially I used two separate files (including the one from ozone env), so I needed this fix, but then abandoned that approach. This part of the change could be extracted to a separate issue if you prefer to simplify this one a bit. (It allows you to run ozone/test.sh with monitoring enabled, so I'd rather not drop it completely.)

  1. fixed ip / dedicated network in docker-compose file seems to be unnecessary in this cluster (IMHO)
  2. you create external volume directories but /data is already a volume inside the docker containers. If you use simple docker-compose stop instead of down it can be reused. Did you consider using this approach?

Why do you prefer external volumes? (I found two arguments: easier to debug + easier to execute commands when the cluster is down. But interested if you had any other motivations...).

After stop/start this is what ozone version prints:

                  //////////////
               ////////////////////
            ////////     ////////////////
           //////      ////////////////
          /////      ////////////////  /
         /////            ////////   ///
         ////           ////////    /////
        /////         ////////////////
        /////       ////////////////   //
         ////     ///////////////   /////
         /////  ///////////////     ////
          /////       //////      /////
           //////   //////       /////
             ///////////     ////////
               //////  ////////////
               ///   //////////
              /    0.5.0-beta(Crater Lake)

Source code repository [email protected]:apache/hadoop-ozone.git -r 9b4f8fd49fa15946994bccc6c6ac50a560cfb0ea
Compiled by dchitlangia on 2020-03-16T00:54Z
Compiled with protoc 2.5.0
From source with checksum 4cde4c7a7aaa250bfbaf58220cb8e2c

Using HDDS 0.5.0-beta
Source code repository [email protected]:apache/hadoop-ozone.git -r 9b4f8fd49fa15946994bccc6c6ac50a560cfb0ea
Compiled by dchitlangia on 2020-03-16T00:53Z
Compiled with protoc 2.5.0
From source with checksum 9df32efd56424ab869a0acd0124e4bf5

So docker-compose down/up is needed because changes to the compose file (docker image, etc.) are not picked up with stop/start. And we need different images before/after upgrade.

That's the reason for both volumes and network settings. I had started out without the network/ip settings, but the containers did not always get the same address after down/up, nor would they reuse volumes.

  1. It seems to be a big restriction that we can't start multiple datanode on the same file system without configuring the datanode path. This is the reason why you need dn1..dn3 directories. I am wondering if we can provide a generic solution to this one. Maybe we can support ${env...} notion when we set the datanode directory?

Would be nice, I think we can explore it later.

@elek
Copy link
Member

elek commented Jul 17, 2020

I had started out without the network/ip settings, but the containers did not always get the same address after down/up, nor would they reuse volumes.

You mean that I can't start the upgrade cluster with different IP addresses? It seems to be a serious bug which should be fixed. But we can test it with the same approach: hard coded network stack and two different docker-compose file with different ip addresses.

This part of the change could be extracted to a separate issue if you prefer to simplify this one a bit

I am fine to include it, not a big change. It's just good to have the explanation here.

@elek
Copy link
Member

elek commented Jul 17, 2020

Other random thoughts:

I plan to enable acceptance tests for k8s cluster definitions, too.

  1. I would like to be sure that those configs are up-to-date and working
  2. Kubernetes have better tooling for more complex clusters (eg. easy SSL certificate management)
  3. While docker-compose is easy-to-use it has some strong limitations. K8s definitions have better flexibility (especially together with the flekszible tool.

Copy link
Member

@elek elek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks the patch (and the discussion) @adoroszlai

I am merging it now.

@elek elek merged commit 9b13ab6 into apache:master Jul 17, 2020
@adoroszlai adoroszlai deleted the HDDS-3855 branch July 17, 2020 12:53
@adoroszlai
Copy link
Contributor Author

Thanks @avijayanhwx for the review, and @elek for reviewing and merging this.

errose28 added a commit to errose28/ozone that referenced this pull request Jul 17, 2020
* master:
  HDDS-3855. Add upgrade smoketest (apache#1142)
  HDDS-3964. Ratis config key mismatch (apache#1204)
  HDDS-3612. Allow mounting bucket under other volume (apache#1104)
  HDDS-3926. OM Token Identifier table should use in-house serialization. (apache#1182)
  HDDS-3824: OM read requests should make SCM#refreshPipeline outside BUCKET_LOCK (apache#1164)
errose28 added a commit to errose28/ozone that referenced this pull request Jul 17, 2020
…erface

* upstream/master:
  HDDS-3855. Add upgrade smoketest (apache#1142)
  HDDS-3964. Ratis config key mismatch (apache#1204)
  HDDS-3612. Allow mounting bucket under other volume (apache#1104)
  HDDS-3926. OM Token Identifier table should use in-house serialization. (apache#1182)
  HDDS-3824: OM read requests should make SCM#refreshPipeline outside BUCKET_LOCK (apache#1164)
  HDDS-3966. Disable flaky TestOMRatisSnapshots
errose28 added a commit to errose28/ozone that referenced this pull request Jul 20, 2020
* master:
  HDDS-3984. Support filter and search the columns in recon UI (apache#1218)
  HDDS-3806. Support recognize aws v2 Authorization header. (apache#1098)
  HDDS-3955. Unable to list intermediate paths on keys created using S3G. (apache#1196)
  HDDS-3741. Reload old OM state if Install Snapshot from Leader fails (apache#1129)
  HDDS-3965. SCM failed to start up for duplicated pipeline detected. (apache#1210)
  HDDS-3855. Add upgrade smoketest (apache#1142)
  HDDS-3964. Ratis config key mismatch (apache#1204)
  HDDS-3612. Allow mounting bucket under other volume (apache#1104)
  HDDS-3926. OM Token Identifier table should use in-house serialization. (apache#1182)
  HDDS-3824: OM read requests should make SCM#refreshPipeline outside BUCKET_LOCK (apache#1164)
  HDDS-3966. Disable flaky TestOMRatisSnapshots
errose28 pushed a commit to errose28/ozone that referenced this pull request Jul 21, 2020
errose28 added a commit to errose28/ozone that referenced this pull request Jul 21, 2020
* add-deleted-block-table: (63 commits)
  Make block iterator tests use deleted blocks table, and remove the now unused #deleted#
  Replace uses of #deleted# key prefix with access to new deleted blocks table
  Add deleted blocks table to base level DB wrappers
  Have block deleting service test look for #deleted# keys in metadata table
  Move block delete to correct table and remove debugging print statement
  Import schema version when importing container data from export
  HDDS-3984. Support filter and search the columns in recon UI (apache#1218)
  HDDS-3806. Support recognize aws v2 Authorization header. (apache#1098)
  HDDS-3955. Unable to list intermediate paths on keys created using S3G. (apache#1196)
  HDDS-3741. Reload old OM state if Install Snapshot from Leader fails (apache#1129)
  Move new key value block iterator implementation and tests to new interface
  Fix checkstyle violations
  HDDS-3965. SCM failed to start up for duplicated pipeline detected. (apache#1210)
  Update comments
  Add comments on added helper method
  Remove seekToLast() from iterator interface, implementation, and tests
  Add more robust unit test with alternating key matches
  All unit tests pass after allowing keys with deleted and deleting prefixes to be made
  HDDS-3855. Add upgrade smoketest (apache#1142)
  HDDS-3964. Ratis config key mismatch (apache#1204)
  ...
ChenSammi pushed a commit that referenced this pull request Jul 22, 2020
timmylicheng pushed a commit that referenced this pull request Aug 6, 2020
rakeshadr pushed a commit to rakeshadr/hadoop-ozone that referenced this pull request Sep 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants