Skip to content

Conversation

@sarvekshayr
Copy link
Contributor

@sarvekshayr sarvekshayr commented Mar 14, 2025

What changes were proposed in this pull request?

Introduced a --metadata flag under ozone debug replicas verify command to check for block existence using GetBlock calls to the datanodes. For each key, it iterates through all replicas and verifies block presence.

What is the link to the Apache JIRA

HDDS-12495

How was this patch tested?

Tested the patch on a docker cluster.

  1. If block data is present on all nodes, it will be indicated with "status": "BLOCK_EXISTS".
bash-5.1$ ozone debug replicas verify --metadata / --output-dir /tmp | jq  
{
  "key": "ockrwvolume/ockrwbucket/vnmnn1ltsx/1679091",
  "blockID": "conID: 1 locID: 115816896921600007 bcsId: 25 replicaIndex: null",
  "status": "BLOCK_EXISTS",
  "pass": true
}
{
  "key": "ockrwvolume/ockrwbucket/vnmnn1ltsx/45c48cc",
  "blockID": "conID: 1 locID: 115816896921600010 bcsId: 38 replicaIndex: null",
  "status": "BLOCK_EXISTS",
  "pass": true
}
{
  "key": "ockrwvolume/ockrwbucket/vnmnn1ltsx/8f14e45",
  "blockID": "conID: 1 locID: 115816896921600008 bcsId: 29 replicaIndex: null",
  "status": "BLOCK_EXISTS",
  "pass": true
}

  1. If block data is missing in some or all nodes, it will be indicated with "status": "MISSING_REPLICAS".
bash-5.1$ ozone debug replicas verify --metadata / --output-dir /tmp | jq
{
  "key": "ockrwvolume/ockrwbucket/vnmnn1ltsx/1679091",
  "blockID": "conID: 1 locID: 115816896921600007 bcsId: 25 replicaIndex: null",
  "status": "MISSING_REPLICAS",
  "pass": false
}
{   
  "key": "ockrwvolume/ockrwbucket/vnmnn1ltsx/45c48cc",
  "blockID": "conID: 1 locID: 115816896921600010 bcsId: 38 replicaIndex: null",
  "status": "MISSING_REPLICAS",
  "pass": false
}
{
  "key": "ockrwvolume/ockrwbucket/vnmnn1ltsx/8f14e45",
  "blockID": "conID: 1 locID: 115816896921600008 bcsId: 29 replicaIndex: null",
  "status": "MISSING_REPLICAS",
  "pass": false
}
  1. If some error is encountered while fetching details, it throws "status": "ERROR" along with the error message.
bash-5.1$ ozone debug replicas verify --metadata / --output-dir /tmp | jq
{
  "key": "ockrwvolume/ockrwbucket/vnmnn1ltsx/45c48cc",
  "status": "ERROR",
  "message": "No Route to Host from  10a39f1ddd74/172.31.0.7 to scm:9860 failed on socket timeout exception: java.net.NoRouteToHostException: No route to host; For more details see:  http://wiki.apache.org/hadoop/NoRouteToHost",
  "pass": false
}

@sadanand48
Copy link
Contributor

This is exactly what ozone debug chunkinfo does i.e gets block info from the OM and performs getBlock from all its replica in the pipeline and prints it. Why add another tool to do the same thing? If the block info is missing from any node in the debug chunkinfo output it means MISSING else BLOCK_EXISTS

@sarvekshayr
Copy link
Contributor Author

This is exactly what ozone debug chunkinfo does i.e gets block info from the OM and performs getBlock from all its replica in the pipeline and prints it. Why add another tool to do the same thing? If the block info is missing from any node in the debug chunkinfo output it means MISSING else BLOCK_EXISTS

Both the commands do have similar functionality.
@errose28 please take a look.

@errose28 errose28 added the tools Tools that helps with debugging label Mar 18, 2025
@errose28
Copy link
Contributor

Thanks for looking at this @sadanand48

Although the two tools use the same backend API, they don't do the same thing. ozone debug replicas verify is designed to be run for large amounts of data and produce minimal pass/fail output for each key and replica. In the final version output will be able to be further filtered to only show failures and omit failure reasons. See HDDS-12207 for a better idea of what this will look like.

ozone debug replicas chunk-info should give far more verbose output for a single key. The information does not make sense to aggregate at scale the same way a list of pass/fail checks does. After looking at this command it is going to need some work to improve usability which is described under HDDS-12645.

Copy link
Contributor

@errose28 errose28 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this @sarvekshayr. Can we add an integration test for the failure cases, and an acceptance test for the CLI in the regular cases?

@adoroszlai adoroszlai marked this pull request as draft March 26, 2025 08:01
@sarvekshayr sarvekshayr marked this pull request as ready for review March 28, 2025 11:30
Copy link
Contributor

@errose28 errose28 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this! Just some minor comments.

@adoroszlai adoroszlai requested a review from errose28 April 7, 2025 15:55
@errose28
Copy link
Contributor

errose28 commented Apr 8, 2025

I think this has been rolled into #8248. @sarvekshayr should we close this and the corresponding Jira and continue work there?

@sarvekshayr
Copy link
Contributor Author

I think this has been rolled into #8248. @sarvekshayr should we close this and the corresponding Jira and continue work there?

Yes, we can close this PR.

@sarvekshayr sarvekshayr closed this Apr 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

tools Tools that helps with debugging

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants