Skip to content

Conversation

@errose28
Copy link
Contributor

@errose28 errose28 commented Mar 23, 2023

What changes were proposed in this pull request?

Refactor of upgrade/downgrade acceptance tests for easier maintenance. Changes include:

  • Reduced code duplication among upgrades/downgrades between different versions.
    • A new common set of callbacks can be used for any version being tested.
  • Improved directory structure.
  • Decoupling of docker compose cluster from versions being tested.
  • Ability to specify tests that are only run when upgrading to a certain version.
    • Example: Check that EC is blocked in 1.3.0 pre-finalized.
  • Simple changes for release managers to add the new version to the test matrix and test the whole matrix before releasing.
  • Full HA cluster available for testing.
  • Replaced direct MLV checks in VERSION files with finalization status checks so the tests do not need to be updated for each layout feature.
  • Testing 1.30 -> master (1.4.0) upgrade and downgrade in regular CI runs.

To help with reviewing, please see the updated README.md. It may also be useful to pull the changes locally to look at the file and directory structure.

What is the link to the Apache JIRA

HDDS-6633

How was this patch tested?

The full matrix of upgrade/downgrade acceptance tests passed on my fork.

* master: (262 commits)
  HDDS-8153. Integrate ContainerBalancer with MoveManager (apache#4391)
  HDDS-8090. When getBlock from a datanode fails, retry other datanodes. (apache#4357)
  HDDS-8163 Use try-with-resources to ensure close rockdb connection in SstFilteringService (apache#4402)
  HDDS-8065. Provide GNU long options (apache#4394)
  HDDS-7930. [addendum] input stream does not refresh expired block token.
  HDDS-7930. input stream does not refresh expired block token. (apache#4378)
  HDDS-7740. [Snapshot] Implement SnapshotDeletingService (apache#4244)
  HDDS-8076. Use container cache in Key listing API. (apache#4346)
  HDDS-8091. [addendum] Generate list of config tags from ConfigTag enum - Hadoop 3.1 compatibility fix (apache#4374)
  HDDS-8144. TestDefaultCertificateClient#testTimeBeforeExpiryGracePeriod fails as we approach DST. (apache#4382)
  HDDS-8151. Support fine grained lifetime for root CA certificate (apache#4386)
  HDDS-8150. RpcClientTest and ConfigurationSourceTest not run due to naming convention (apache#4388)
  HDDS-8131. Add Configuration for OM Ratis Log Purge Tuning Parameters. (apache#4371)
  HDDS-8133. Create ozone sh key checksum command (apache#4375)
  HDDS-8142. Check if no entries in Block DB for a container on container delete (apache#4379)
  HDDS-8118. Fail container delete on non empty chunks dir (apache#4367)
  HDDS-8028. JNI for RocksDB SST Dump tool (apache#4315)
  HDDS-8129. ContainerStateMachine allows two different tasks with the same container id running in parallel. (apache#4370)
  HDDS-8119. Remove loosely related AutoCloseable from SendContainerOutputStream (apache#4368)
  close db connection (apache#4366)
  ...
* master: (43 commits)
  HDDS-8148. Improve log for Pipeline creation failure (apache#4385)
  HDDS-7853. Add support for RemoveSCM in SCMRatisServer. (apache#4358)
  HDDS-8042. Display certificate issuer in cert list command. (apache#4429)
  HDDS-8189. [Snapshot] renamedKeyTable should only track keys in buckets that has at least one active snapshot. (apache#4436)
  HDDS-8154. Perf: Reuse Mac instances in S3 token validation (apache#4433)
  HDDS-8245. Info log for keyDeletingService when nonzero number of keys are deleted. (apache#4451)
  HDDS-8233. ReplicationManager: Throttle delete container commands from over-replication handlers (apache#4447)
  HDDS-8220. [Ozone-Streaming] Trigger volume check on IOException in StreamDataChannelBase (apache#4428)
  HDDS-8173. Fix to remove enrties from RocksDB after container gets deleted. (apache#4445)
  HDDS-7975. Rebalance acceptance tests (apache#4437)
  HDDS-8152. Reduce S3 acceptance test setup time (apache#4393)
  HDDS-8172. ECUnderReplicationHandler should consider commands already sent when processing the container (apache#4435)
  HDDS-7883. [Snapshot] Accommodate FSO, key renames and implement OMSnapshotPurgeRequest for SnapshotDeletingService (apache#4407)
  HDDS-8168. Make deadlines inside MoveManager for move commands configurable (apache#4415)
  HDDS-7918. EC: ECBlockReconstructedStripeInputStream should check for spare replicas before failing an index (apache#4441)
  HDDS-8222. EndpointBase#getBucket should handle BUCKET_NOT_FOUND (apache#4431)
  HDDS-8068. Fix Exception: JMXJsonServlet, getting attribute RatisRoles of Hadoop:service=OzoneManager. (apache#4352)
  HDDS-8139. Datanodes should not drop block delete transactions based on transaction ID (apache#4384)
  HDDS-8216. EC: OzoneClientConfig is overwritten in ECKeyOutputStream (apache#4425)
  HDDS-8054. Fix NPE in metrics for failed volume (apache#4340)
  ...
Copy link
Contributor

@fapifta fapifta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @errose28 thank you for your great work on the upgrade tests, this is a well thought out change.
I noticed one thing: we might not need the load.sh, please see the proposal in one of the comments and let me know what do you think about it.

Other than that I have found small things in the documentation that sounded strange to my non-native ear, and some versions I think does not fit into the context but I am unsure, so I reserve the right to be completely wrong with those suggestions, but please take a look ;)

- These tests will not catch backwards incompatible changes against commits in between releases.
- Example:
1. After 1.2.0, a change *c1* is made that is backwards compatible with 1.2.0.
2. After *c1*, a new change *c2* is made that is also backwards compatible with 1.0.0 but backwards *incompatible* with *c1*.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
2. After *c1*, a new change *c2* is made that is also backwards compatible with 1.0.0 but backwards *incompatible* with *c1*.
2. After *c1*, a new change *c2* is made that is also backwards compatible with 1.2.0 but backwards *incompatible* with *c1*.

- This test suite will not raise an error for *c2*, because it only tests against the last release
(1.0.0), and not the last commit (*c1*).
1. Before the release, test the whole matrix of upgrades from the previous version to the version you are releasing.
- This is important manual verification that the release does not break backwards compatibility, and the results can be included in the release vote mailing thread.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- This is important manual verification that the release does not break backwards compatibility, and the results can be included in the release vote mailing thread.
- This is an important manual verification that the release does not break backwards compatibility, and the results can be included in the release vote mailing thread.

Copy link
Contributor

@adoroszlai adoroszlai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @errose28 for the big batch of improvements.

@kerneltime
Copy link
Contributor

cc @xBis7 @SaketaChalamchala @DaveTeng0 @tanvipenumudy can you take a look?

@adoroszlai adoroszlai requested a review from fapifta March 28, 2023 08:16
@fapifta
Copy link
Contributor

fapifta commented Mar 28, 2023

Thank you @adoroszlai to check on this one, I accept your comments, I have created HDDS-8305 to track the additional suggestion on load.sh.

I am ok committing this one, and let the change as it is as Ethan will not be able to look into this further earlier than next week, and I think it is better to unblock the DataNodeDetails/Port/Upgrade related changes that are dependent on this work.

@errose28
Copy link
Contributor Author

errose28 commented Apr 4, 2023

Thanks for reviewing and merging this large PR @fapifta and @adoroszlai. I will look at the follow-up jira to remove load.sh, and filed HDDS-8375 (#4535) to fix the typos in the readme.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants