Skip to content

Conversation

@elek
Copy link
Member

@elek elek commented Nov 3, 2020

What changes were proposed in this pull request?

@jojochuang reported that in a specific case the Datanode tried to download / replicate containers multiple times from the same datanode.

SimpleContainerDownload has a logic to try out all the available Datanodes: this Jira creates a unit test t make sure the logic works well.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-4429

How was this patch tested?

Executed the new unit test.

Copy link
Contributor

@adoroszlai adoroszlai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @elek for adding this unit test.

) throws IOException {

if (datanodes.contains(datanode)) {
throw new IOException("Unavailable datanode");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

downloadContainer may fail in two ways: IOException may be thrown immediately or the returned CompletableFuture may be completed exceptionally. These hit different code paths in getContainerDataFromReplicas. I think we should cover both cases.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Added a new test case.

(As the test method is already simplified with helper methods, seems to be more simple to create a new method instead of introducing a new parametrized Junit test)

Copy link
Contributor

@adoroszlai adoroszlai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @elek for updating the patch.

.get(1L, TimeUnit.SECONDS);

//THEN
Assert.assertEquals(datanodes.get(0).getUuidString(), result.toString());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, these checks are no longer valid after 5e8aaee since SimpleContainerDownloader shuffles the datanodes.

//There is a chance for the download is successful but import is failed,
//due to data corruption. We need a random selected datanode to have a
//chance to succeed next time.
final ArrayList<DatanodeDetails> shuffledDatanodes =
new ArrayList<>(sourceDatanodes);
Collections.shuffle(shuffledDatanodes);

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking into it. Seems I stepped in my own trap ;-)

@adoroszlai
Copy link
Contributor

Thanks @elek for updating the patch.

Can you please also increase the timeout for testRandomSelection? It failed recently (in PRs) quite a few times with:

[INFO] Running org.apache.hadoop.ozone.container.replication.TestSimpleContainerDownloader
Error:  Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 1.187 s <<< FAILURE! - in org.apache.hadoop.ozone.container.replication.TestSimpleContainerDownloader
Error:  testRandomSelection(org.apache.hadoop.ozone.container.replication.TestSimpleContainerDownloader)  Time elapsed: 1.012 s  <<< ERROR!
java.lang.Exception: test timed out after 1000 milliseconds
	at java.lang.ClassLoader.defineClass1(Native Method)
	at java.lang.ClassLoader.defineClass(ClassLoader.java:756)
	at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
	at java.net.URLClassLoader.defineClass(URLClassLoader.java:468)
	at java.net.URLClassLoader.access$100(URLClassLoader.java:74)
	at java.net.URLClassLoader$1.run(URLClassLoader.java:369)
	at java.net.URLClassLoader$1.run(URLClassLoader.java:363)
	at java.security.AccessController.doPrivileged(Native Method)
	at java.net.URLClassLoader.findClass(URLClassLoader.java:362)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
	at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
	at org.bouncycastle.jcajce.provider.symmetric.Serpent$Mappings.configure(Unknown Source)
	at org.bouncycastle.jce.provider.BouncyCastleProvider.loadAlgorithms(Unknown Source)
	at org.bouncycastle.jce.provider.BouncyCastleProvider.setup(Unknown Source)
	at org.bouncycastle.jce.provider.BouncyCastleProvider.access$000(Unknown Source)
	at org.bouncycastle.jce.provider.BouncyCastleProvider$1.run(Unknown Source)
	at java.security.AccessController.doPrivileged(Native Method)
	at org.bouncycastle.jce.provider.BouncyCastleProvider.<init>(Unknown Source)
	at org.apache.hadoop.hdds.security.x509.SecurityConfig.initSecurityProvider(SecurityConfig.java:369)
	at org.apache.hadoop.hdds.security.x509.SecurityConfig.<init>(SecurityConfig.java:172)
	at org.apache.hadoop.ozone.container.replication.SimpleContainerDownloader.<init>(SimpleContainerDownloader.java:69)
	at org.apache.hadoop.ozone.container.replication.TestSimpleContainerDownloader$1.<init>(TestSimpleContainerDownloader.java:53)
	at org.apache.hadoop.ozone.container.replication.TestSimpleContainerDownloader.testRandomSelection(TestSimpleContainerDownloader.java:52)

Comment on lines +137 to +140
result = grpcReplicationClient.download(containerId)
.thenApply(r -> {
try {
grpcReplicationClient.close();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if replication was slow, the previous version may have closed the client prematurely?

@adoroszlai adoroszlai merged commit 9c6f805 into apache:master Nov 30, 2020
@adoroszlai
Copy link
Contributor

Thanks @elek for the improvement.

errose28 added a commit to errose28/ozone that referenced this pull request Dec 1, 2020
* HDDS-3698-upgrade:
  HDDS-4429. Create unit test for SimpleContainerDownloader. (apache#1551)
  HDDS-4461. Reuse compiled binaries in acceptance test (apache#1588)
  HDDS-4511: Avoiding StaleNodeHandler to take effect in TestDeleteWithSlowFollower. (apache#1625)
  HDDS-4510. SCM can avoid creating RetriableDatanodeEventWatcher for deletion command ACK (apache#1626)
  HDDS-3363. Intermittent failure in testContainerImportExport (apache#1618)
  HDDS-4370. Datanode deletion service can avoid storing deleted blocks. (apache#1620)
  HDDS-4512. Remove unused netty3 transitive dependency (apache#1627)
  HDDS-4481. With HA OM can send deletion blocks to SCM multiple times. (apache#1608)
  HDDS-4487. SCM can avoid using RETRIABLE_DATANODE_COMMAND for datanode deletion commands. (apache#1621)
  HDDS-4471. GrpcOutputStream length can overflow (apache#1617)
  HDDS-4308. Fix issue with quota update (apache#1489)
  HDDS-4392. [DOC] Add Recon architecture to docs (apache#1602)
  HDDS-4501. Reload OM State fail should terminate OM for any exceptions. (apache#1622)
  HDDS-4492. CLI flag --quota should default to 'spaceQuota' to preserve backward compatibility. (apache#1609)
  HDDS-3689. Add various profiles to MiniOzoneChaosCluster to run different modes. (apache#1420)
  HDDS-4497. Recon File Size Count task throws SQL Exception. (apache#1612)
errose28 added a commit to errose28/ozone that referenced this pull request Dec 1, 2020
* HDDS-3698-upgrade:
  HDDS-4429. Create unit test for SimpleContainerDownloader. (apache#1551)
  HDDS-4461. Reuse compiled binaries in acceptance test (apache#1588)
  HDDS-4511: Avoiding StaleNodeHandler to take effect in TestDeleteWithSlowFollower. (apache#1625)
  HDDS-4510. SCM can avoid creating RetriableDatanodeEventWatcher for deletion command ACK (apache#1626)
  HDDS-3363. Intermittent failure in testContainerImportExport (apache#1618)
  HDDS-4370. Datanode deletion service can avoid storing deleted blocks. (apache#1620)
  HDDS-4512. Remove unused netty3 transitive dependency (apache#1627)
  HDDS-4481. With HA OM can send deletion blocks to SCM multiple times. (apache#1608)
  HDDS-4487. SCM can avoid using RETRIABLE_DATANODE_COMMAND for datanode deletion commands. (apache#1621)
  HDDS-4471. GrpcOutputStream length can overflow (apache#1617)
  HDDS-4308. Fix issue with quota update (apache#1489)
  HDDS-4392. [DOC] Add Recon architecture to docs (apache#1602)
  HDDS-4501. Reload OM State fail should terminate OM for any exceptions. (apache#1622)
  HDDS-4492. CLI flag --quota should default to 'spaceQuota' to preserve backward compatibility. (apache#1609)
  HDDS-3689. Add various profiles to MiniOzoneChaosCluster to run different modes. (apache#1420)
  HDDS-4497. Recon File Size Count task throws SQL Exception. (apache#1612)
errose28 added a commit to errose28/ozone that referenced this pull request Jan 5, 2021
* master: (40 commits)
  HDDS-4473. Reduce number of sortDatanodes RPC calls (apache#1610)
  HDDS-4485. [DOC] add the authentication rules of the Ozone Ranger. (apache#1603)
  HDDS-4528. Upgrade slf4j to 1.7.30 (apache#1639)
  HDDS-4424. Update README with information how to report security issues (apache#1548)
  HDDS-4484. Use RaftServerImpl isLeader instead of periodic leader update logic in OM and isLeaderReady for read/write requests (apache#1638)
  HDDS-4429. Create unit test for SimpleContainerDownloader. (apache#1551)
  HDDS-4461. Reuse compiled binaries in acceptance test (apache#1588)
  HDDS-4511: Avoiding StaleNodeHandler to take effect in TestDeleteWithSlowFollower. (apache#1625)
  HDDS-4510. SCM can avoid creating RetriableDatanodeEventWatcher for deletion command ACK (apache#1626)
  HDDS-3363. Intermittent failure in testContainerImportExport (apache#1618)
  HDDS-4370. Datanode deletion service can avoid storing deleted blocks. (apache#1620)
  HDDS-4512. Remove unused netty3 transitive dependency (apache#1627)
  HDDS-4481. With HA OM can send deletion blocks to SCM multiple times. (apache#1608)
  HDDS-4487. SCM can avoid using RETRIABLE_DATANODE_COMMAND for datanode deletion commands. (apache#1621)
  HDDS-4471. GrpcOutputStream length can overflow (apache#1617)
  HDDS-4308. Fix issue with quota update (apache#1489)
  HDDS-4392. [DOC] Add Recon architecture to docs (apache#1602)
  HDDS-4501. Reload OM State fail should terminate OM for any exceptions. (apache#1622)
  HDDS-4492. CLI flag --quota should default to 'spaceQuota' to preserve backward compatibility. (apache#1609)
  HDDS-3689. Add various profiles to MiniOzoneChaosCluster to run different modes. (apache#1420)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants