Skip to content
Merged
Changes from all commits
Commits
Show all changes
40 commits
Select commit Hold shift + click to select a range
53e2c11
HDDS-10316 reducing the initiation, method ordering introduced.
Feb 9, 2024
399b76a
HDDS-10316 reducing the initiation, method ordering introduced.
Feb 12, 2024
23df399
HDDS-10316 reducing the initiation, method ordering introduced.
Feb 12, 2024
e37e875
HDDS-10316 reducing the initiation, method ordering introduced.
Feb 13, 2024
d3f021e
HDDS-10316 reducing the initiation, method ordering introduced.
Feb 13, 2024
9b22c3c
HDDS-10316 reducing the initiation, method ordering introduced.
Feb 13, 2024
64e35f9
HDDS-10316 common code is extracted to check the flow
Feb 13, 2024
e42e398
HDDS-10316 common code is extracted to check the flow
Feb 14, 2024
3ebe2a1
HDDS-10316 common code is extracted to check the flow
Feb 14, 2024
833e008
HDDS-10316 reducing the initiation, method ordering introduced.
Feb 14, 2024
27d7902
Merge branch 'apache:master' into raju-b-hdds-10316
raju-balpande Feb 15, 2024
c1fe332
HDDS-10346 make test independent of ordering.
Feb 26, 2024
c540681
Merge branch 'raju-b-hdds-10316' of https://github.com/raju-balpande/…
Feb 26, 2024
9665426
HDDS-10316 reducing the initiation, method ordering introduced.
Feb 26, 2024
8cc60b4
HDDS-10316 reducing the initiation, method ordering introduced.
Feb 26, 2024
68a7697
HDDS-10316 reducing the initiation, method ordering introduced.
Feb 26, 2024
c8a858a
HDDS-10316 reducing the initiation, method ordering introduced.
Feb 26, 2024
cb9d7c2
HDDS-10316 reducing the initiation, method ordering introduced.
Feb 26, 2024
317dcfa
HDDS-10316 reducing the initiation, method ordering introduced.
Feb 26, 2024
dc25ff1
HDDS-10316 reducing the initiation, method ordering introduced.
Feb 26, 2024
59c8920
HDDS-10316 reducing the initiation, method ordering introduced.
Feb 27, 2024
50033f7
HDDS-10316 reducing the initiation, method ordering introduced.
Feb 27, 2024
46f97dd
HDDS-10316 reducing the initiation, method ordering introduced.
Feb 27, 2024
3c5305e
HDDS-10316 reducing the initiation, method ordering introduced.
Feb 28, 2024
cd8ba26
HDDS-10316 reducing the initiation, method ordering introduced.
Feb 28, 2024
7725dfa
HDDS-10316 reducing the initiation, method ordering introduced.
Feb 28, 2024
14b4cd9
HDDS-10316 reducing the initiation, method ordering introduced.
Feb 28, 2024
3b1b51c
HDDS-10316 reducing the initiation, method ordering introduced.
Feb 28, 2024
f54cdbd
HDDS-10316 reducing the initiation, method ordering introduced.
Feb 28, 2024
2844cec
HDDS-10316 reducing the initiation, method ordering introduced.
Feb 28, 2024
922f92d
HDDS-10316 reducing the initiation, method ordering introduced.
Feb 28, 2024
1f4a91e
HDDS-10316 reducing the initiation, method ordering introduced.
Feb 28, 2024
e75f857
HDDS-10316 reducing the initiation, method ordering introduced.
Feb 28, 2024
eada75b
HDDS-10316 reducing the initiation, method ordering introduced.
Feb 28, 2024
608631b
HDDS-10316 reducing the initiation, method ordering introduced.
Feb 28, 2024
285c084
HDDS-10316 reducing the initiation, method ordering introduced.
Feb 29, 2024
dc519e1
HDDS-10316 reducing the initiation, method ordering introduced.
Feb 29, 2024
c566ab6
HDDS-10316 reducing the initiation, method ordering introduced.
Feb 29, 2024
1fd1b5b
Merge branch 'master' into raju-b-hdds-10316
raju-balpande Feb 29, 2024
8a8db39
Merge remote-tracking branch 'origin/master' into raju-b-hdds-10316
adoroszlai Apr 1, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -41,8 +41,12 @@
import org.apache.ozone.test.LambdaTestUtils;
import org.hadoop.ozone.recon.schema.ContainerSchemaDefinition;
import org.hadoop.ozone.recon.schema.tables.pojos.UnhealthyContainers;
import org.junit.jupiter.api.AfterEach;
import org.junit.jupiter.api.BeforeEach;
import org.junit.jupiter.api.AfterAll;
import org.junit.jupiter.api.BeforeAll;
import org.junit.jupiter.api.MethodOrderer.OrderAnnotation;
import org.junit.jupiter.api.Order;
import org.junit.jupiter.api.TestMethodOrder;
import org.junit.jupiter.api.TestInstance;
import org.junit.jupiter.api.Test;
import org.junit.jupiter.api.Timeout;
import org.slf4j.event.Level;
Expand All @@ -58,12 +62,14 @@
* Integration Tests for Recon's tasks.
*/
@Timeout(300)
@TestInstance(TestInstance.Lifecycle.PER_CLASS)
@TestMethodOrder(OrderAnnotation.class)
public class TestReconTasks {
private MiniOzoneCluster cluster = null;
private OzoneConfiguration conf;

@BeforeEach
public void init() throws Exception {
@BeforeAll
void init() throws Exception {
conf = new OzoneConfiguration();
conf.set(HDDS_CONTAINER_REPORT_INTERVAL, "5s");
conf.set(HDDS_PIPELINE_REPORT_INTERVAL, "5s");
Expand All @@ -74,21 +80,22 @@ public void init() throws Exception {

conf.set("ozone.scm.stale.node.interval", "6s");
conf.set("ozone.scm.dead.node.interval", "8s");
cluster = MiniOzoneCluster.newBuilder(conf).setNumDatanodes(1)
cluster = MiniOzoneCluster.newBuilder(conf).setNumDatanodes(3)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @raju-balpande for working on this patch. What is the need to increase the number of datanodes in cluster ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With 1 datanode it was working fine in local but was getting stuck in a wait condition in CI. After multiple such tries I found the CI working when we switch to 3 datanodes.

The performance after changes can be viewed in https://github.com/raju-balpande/apache_ozone/actions/runs/8091976415/job/22112283975
image

and previously it was https://github.com/raju-balpande/apache_ozone/actions/runs/7843423779/job/21404234425
image

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With 1 datanode it was working fine in local but was getting stuck in a wait condition in CI. After multiple such tries I found the CI working when we switch to 3 datanodes.

The performance after changes can be viewed in https://github.com/raju-balpande/apache_ozone/actions/runs/8091976415/job/22112283975 image

and previously it was https://github.com/raju-balpande/apache_ozone/actions/runs/7843423779/job/21404234425 image

Can you tell which test case and which wait condition it was getting stuck when 1 DN was used for cluster ?

Copy link
Contributor Author

@raju-balpande raju-balpande Mar 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @devmadhuu ,
It was getting stuck on wait condition at TestReconTasks.java:173
LambdaTestUtils.await(120000, 6000, () -> { List<UnhealthyContainers> allMissingContainers = reconContainerManager.getContainerSchemaManager() .getUnhealthyContainers( ContainerSchemaDefinition.UnHealthyContainerStates.MISSING, 0, 1000); return (allMissingContainers.size() == 1); });

As I see in log https://github.com/raju-balpande/apache_ozone/actions/runs/7916623862/job/21611614999

Error: Tests run: 3, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 365.519 s <<< FAILURE! - in org.apache.hadoop.ozone.recon.TestReconTasks Error: org.apache.hadoop.ozone.recon.TestReconTasks.testMissingContainerDownNode Time elapsed: 300.006 s <<< ERROR! java.util.concurrent.TimeoutException: testMissingContainerDownNode() timed out after 300 seconds at java.util.ArrayList.forEach(ArrayList.java:1259) at java.util.ArrayList.forEach(ArrayList.java:1259) Suppressed: java.lang.InterruptedException: sleep interrupted at java.lang.Thread.sleep(Native Method) at org.apache.ozone.test.LambdaTestUtils.await(LambdaTestUtils.java:133) at org.apache.ozone.test.LambdaTestUtils.await(LambdaTestUtils.java:180) at org.apache.hadoop.ozone.recon.TestReconTasks.testMissingContainerDownNode(TestReconTasks.java:173) at java.lang.reflect.Method.invoke(Method.java:498)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, thanks for the update @raju-balpande , However I am not sure why above test condition should timeout with just 1 DN in CI and pass in local, because with 1 DN in cluster, we are shutting down that only DN and so missing container count should be 1.

.includeRecon(true).build();
cluster.waitForClusterToBeReady();
GenericTestUtils.setLogLevel(SCMDatanodeHeartbeatDispatcher.LOG,
Level.DEBUG);
}

@AfterEach
public void shutdown() {
@AfterAll
void shutdown() {
if (cluster != null) {
cluster.shutdown();
}
}

@Test
@Order(3)
public void testSyncSCMContainerInfo() throws Exception {
ReconStorageContainerManagerFacade reconScm =
(ReconStorageContainerManagerFacade)
Expand Down Expand Up @@ -121,6 +128,7 @@ public void testSyncSCMContainerInfo() throws Exception {
}

@Test
@Order(1)
public void testMissingContainerDownNode() throws Exception {
ReconStorageContainerManagerFacade reconScm =
(ReconStorageContainerManagerFacade)
Expand All @@ -141,7 +149,7 @@ public void testMissingContainerDownNode() throws Exception {
(ReconContainerManager) reconScm.getContainerManager();
ContainerInfo containerInfo =
scmContainerManager
.allocateContainer(RatisReplicationConfig.getInstance(ONE), "test");
.allocateContainer(RatisReplicationConfig.getInstance(ONE), "testMissingContainer");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you are increasing datanodes, then better to keep replication factor also as THREE

long containerID = containerInfo.getContainerID();

try (RDBBatchOperation rdbBatchOperation = new RDBBatchOperation()) {
Expand Down Expand Up @@ -181,6 +189,8 @@ public void testMissingContainerDownNode() throws Exception {
0, 1000);
return (allMissingContainers.isEmpty());
});
// Cleaning up some data
scmContainerManager.deleteContainer(containerInfo.containerID());
IOUtils.closeQuietly(client);
}

Expand All @@ -202,6 +212,7 @@ public void testMissingContainerDownNode() throws Exception {
* @throws Exception
*/
@Test
@Order(2)
public void testEmptyMissingContainerDownNode() throws Exception {
ReconStorageContainerManagerFacade reconScm =
(ReconStorageContainerManagerFacade)
Expand All @@ -219,9 +230,10 @@ public void testEmptyMissingContainerDownNode() throws Exception {
ContainerManager scmContainerManager = scm.getContainerManager();
ReconContainerManager reconContainerManager =
(ReconContainerManager) reconScm.getContainerManager();
int previousContainerCount = reconContainerManager.getContainers().size();
ContainerInfo containerInfo =
scmContainerManager
.allocateContainer(RatisReplicationConfig.getInstance(ONE), "test");
.allocateContainer(RatisReplicationConfig.getInstance(ONE), "testEmptyMissingContainer");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you are increasing datanodes, then better to keep replication factor also as THREE

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried changing it to HddsProtos.ReplicationFactor.THREE but it seems having problem with number of pipelines..
java.io.IOException: Could not allocate container. Cannot get any matching pipeline for replicationConfig: RATIS/THREE, State:PipelineState.OPEN

at org.apache.hadoop.hdds.scm.container.ContainerManagerImpl.allocateContainer(ContainerManagerImpl.java:202)
at org.apache.hadoop.ozone.recon.TestReconTasks.testEmptyMissingContainerDownNode(TestReconTasks.java:236)
at java.lang.reflect.Method.invoke(Method.java:498)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, this might be because of not meeting the criteria of sufficient healthy nodes because default minRatisVolumeSizeBytes is 1 GB and containerSizeBytes is 5GB. For test case it is okay then to use ReplicationFactor.ONE

long containerID = containerInfo.getContainerID();

Pipeline pipeline =
Expand All @@ -230,8 +242,8 @@ public void testEmptyMissingContainerDownNode() throws Exception {
runTestOzoneContainerViaDataNode(containerID, client);

// Make sure Recon got the container report with new container.
assertEquals(scmContainerManager.getContainers(),
reconContainerManager.getContainers());
assertEquals(scmContainerManager.getContainers().size(),
reconContainerManager.getContainers().size() - previousContainerCount);

// Bring down the Datanode that had the container replica.
cluster.shutdownHddsDatanode(pipeline.getFirstNode());
Expand Down Expand Up @@ -305,7 +317,8 @@ public void testEmptyMissingContainerDownNode() throws Exception {
0, 1000);
return (allMissingContainers.isEmpty());
});

// Cleaning up some data
reconContainerManager.deleteContainer(containerInfo.containerID());
IOUtils.closeQuietly(client);
}
}