Skip to content

Conversation

@sreejasahithi
Copy link
Contributor

@sreejasahithi sreejasahithi commented May 29, 2025

What changes were proposed in this pull request?

Updated ozone admin datanode list --json to include only key fields that are relevant for a listing command and removed duplicate fields. The command now returns only minimal and essential information per datanode. Detailed information for each node can be obtained via ozone admin datanode info HDDS-13097.
This makes the list output cleaner and more readable, especially in large clusters.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-13101

How was this patch tested?

Sample output for single node:

[ {
  "id" : "ee40eaae-a0a7-4d00-a965-34973d5d7f5f",
  "hostName" : "ozone-datanode-3.ozone_default",
  "ipAddress" : "172.18.0.6",
  "ports" : [ {
    "name" : "HTTP",
    "value" : 9882
  }, {
    "name" : "CLIENT_RPC",
    "value" : 19864
  }, {
    "name" : "REPLICATION",
    "value" : 9886
  }, {
    "name" : "RATIS",
    "value" : 9858
  }, {
    "name" : "RATIS_ADMIN",
    "value" : 9857
  }, {
    "name" : "RATIS_SERVER",
    "value" : 9856
  }, {
    "name" : "RATIS_DATASTREAM",
    "value" : 9855
  }, {
    "name" : "STANDALONE",
    "value" : 9859
  } ],
  "setupTime" : 0,
  "currentVersion" : 2,
  "persistedOpState" : "IN_SERVICE",
  "opState" : "IN_SERVICE",
  "persistedOpStateExpiryEpochSec" : 0,
  "healthState" : "HEALTHY",
  "decommissioned" : false,
  "maintenance" : false,
  "level" : 3,
  "cost" : 0,
  "numOfLeaves" : 1,
  "networkFullPath" : "/default-rack/ee40eaae-a0a7-4d00-a965-34973d5d7f5f",
  "networkLocation" : "/default-rack",
  "networkName" : "ee40eaae-a0a7-4d00-a965-34973d5d7f5f"
} ]

https://github.com/sreejasahithi/ozone/actions/runs/15328473637

Copy link
Contributor

@aryangupta1998 aryangupta1998 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the patch @sreejasahithi, overall changes look good to me!

@sodonnel
Copy link
Contributor

Which fields were removed? I think we have to be a little bit careful here, as our CLI compatibility contract, is that the JSON output should be compatible, but the "human readable" text output could change.

The reason being, that JSON is by definition not intended for human consumption so readability is not a concern.

If some field has been removed that a used was getting from ozone admin datanode list --json, then now they may have to run that, and then run datanode info for each DN, which could be slow.

@errose28
Copy link
Contributor

If some field has been removed that a used was getting from ozone admin datanode list --json, then now they may have to run that, and then run datanode info for each DN, which could be slow.

Most of the fields I proposed removing in the Jira are either:

  • duplicate information
  • information internal to ozone for testing
  • information that SCM was not populating, which was therefore incorrect.

Although removing these is technically incompatible by our definition, I would consider their inclusion to be bugs in the first place making this a bug fix. We have used similar reasoning to "break compatibility" by adding json lists to list output commands. cc @adoroszlai if you have any thoughts on this as well.

There are two proposed changes that fall outside of this:

  • Grouping the topology information into one object
  • Removing the port information (saved for ozone admin datanode info)

If there are more concerns about that part of the change we can look at them in a separate Jira/PR, but IMO ozone admin datanode list is not very usable with the verbose port list taking up half the object (after removing other duplicate content), and the mixed placement of topology information is difficult to make sense out of. I don't see much use case for aggregating ports across all nodes in one shot either.

@ssulav
Copy link
Contributor

ssulav commented Jun 2, 2025

It's ok to have some changes that may break compatibility, we can adjust our test accordingly, but we must not remove any information from the json output.
We may collate and de-duplicate.
Similar comments may be relevant for container list/info --json with datanode details.

@errose28
Copy link
Contributor

errose28 commented Jun 2, 2025

Maybe we just do this subset of changes in the PR:

  • initialVersion is removed because it is for testing only. Exposing it in the Json was a bug.
  • ID and UUID is duplicated four times in various formats. We should have one ID string field.
  • IP and hostnames are duplicated. We should have one IP and one hostname string field.
  • Ratis and standalone ports are duplicated. Let's keep just the main ports list intact and remove the ratisPort and standalonePort fields.
  • Operational state is duplicated as opState and persistedOpState. We can pick one and remove the other.

We can leave the following changes to a follow up task since they might need more discussion:

  • SCM doesn't populate the value of setupTime, only datanodes set this when they start, so it is always 0.
    • I'm not sure whether we want to fix this so SCM is tracking this info for each node, or remove the attribute.
  • Level and cost should be moved to a dedicated topology section because the terms are ambiguous on their own.
  • Removing port information from list all together and having it only present in the proposed ozone admin datanode info command.

Similar comments may be relevant for container list/info --json with datanode details.

+1 maybe as part of this change we can delegate to a centralized DatanodeDetails serializer to handle this. We use a mix of patterns in the code currently where sometimes we use Jackson's ObjectMapper to automatically build the json from an annotated pojo, and sometimes we build the object from scratch. IMO building from scratch is better because we have to "opt-in" to adding fields, whereas the object mapper is "opt-out". "opt-in" is safer with our compat requirements to not remove fields once they are published.

@Tejaskriya Tejaskriya self-requested a review June 10, 2025 12:31
@sreejasahithi sreejasahithi marked this pull request as draft July 7, 2025 15:11
@sreejasahithi sreejasahithi marked this pull request as ready for review July 14, 2025 14:36
Copy link
Contributor

@sarvekshayr sarvekshayr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for updating the PR @sreejasahithi.
Looks like all the required changes have been addressed.

LGTM.

Copy link
Contributor

@aryangupta1998 aryangupta1998 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM, just one minor comment!

@sreejasahithi
Copy link
Contributor Author

TestListInfoSubcommand is failing after merging master into this branch, I am fixing it.

Copy link
Contributor

@Tejaskriya Tejaskriya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this @sreejasahithi , LGTM!

@sreejasahithi
Copy link
Contributor Author

sreejasahithi commented Jul 21, 2025

@ssulav , @errose28 , @sodonnel Could you please review the changes in this PR.

Copy link
Contributor

@ssulav ssulav left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Tejaskriya Tejaskriya requested a review from errose28 July 22, 2025 11:46
Copy link
Contributor

@errose28 errose28 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few minor comments but looks good otherwise.

@sreejasahithi sreejasahithi requested a review from errose28 July 24, 2025 12:05
@sreejasahithi
Copy link
Contributor Author

sreejasahithi commented Jul 24, 2025

Some of the tests are failing in the CI (testBalancer.robot and datanode.robot) , I am fixing it.

@sreejasahithi
Copy link
Contributor Author

@errose28 , I have addressed your comments could you please review the changes done.

Copy link
Contributor

@errose28 errose28 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates. Just a few more comments.

@sreejasahithi sreejasahithi requested a review from errose28 July 31, 2025 10:13
Copy link
Contributor

@errose28 errose28 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates. Also from #8520 the exception handling in getAllNodes with sorting is not correct and it would be good to fix that here. Currently if one of the calls to SCM fails we will get an NPE with no error message. If one of the node's info cannot be obtained we should instead print the message to stderr and continue processing the rest of the nodes.

@sreejasahithi sreejasahithi requested a review from errose28 August 1, 2025 09:08
Copy link
Contributor

@errose28 errose28 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the continued updates. LGTM, just started the CI run.

@errose28 errose28 merged commit 4cbf5ce into apache:master Aug 7, 2025
42 checks passed
errose28 added a commit to errose28/ozone that referenced this pull request Aug 12, 2025
* master: (55 commits)
  HDDS-13525. Rename configuration property to ozone.om.compaction.service.enabled (apache#8928)
  HDDS-13519. Reconciliation should continue if a peer datanode is unreachable (apache#8908)
  HDDS-13566. Fix incorrect authorizer class in ACL documentation (apache#8931)
  HDDS-13084. Trigger on-demand container scan when a container moves from open to unhealthy. (apache#8904)
  HDDS-13432. Accelerating Namespace Usage Calculation in Recon using - Materialised Approach (apache#8797)
  HDDS-13557. Bump jline to 3.30.5 (apache#8920)
  HDDS-13556. Bump assertj-core to 3.27.4 (apache#8919)
  HDDS-13543. [Docs] Design doc for OM bootstrapping process with snapshots. (apache#8900)
  HDDS-13541. Bump sonar-maven-plugin to 5.1.0.4751 (apache#8911)
  HDDS-13101. Remove duplicate information in datanode list output (apache#8523)
  HDDS-13528. Handle null paths when the NSSummary is initializing (apache#8901)
  HDDS-12990. (addendum) Generate tree from metadata when it does not exist during getContainerChecksumInfo call (apache#8881)
  HDDS-13086. Block duplicate reconciliation requests for the same container and datanode within the datanode. (apache#8905)
  HDDS-12990. Generate tree from metadata when it doesn't exist during getContainerChecksumInfo call (apache#8881)
  HDDS-12824. Optimize container checksum read during datanode startup (apache#8604)
  HDDS-13522. Rename axisLabel for No. of delete request received (apache#8879)
  HDDS-12196. Document ozone repair cli (apache#8849)
  HDDS-13514. Intermittent failure in TestNSSummaryMemoryLeak (apache#8889)
  HDDS-13423. Log reason for triggering on-demand container scan (apache#8854)
  HDDS-13466. Disable flaky TestOmSnapshotFsoWithNativeLibWithLinkedBuckets
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants