Skip to content

Conversation

@elek
Copy link
Member

@elek elek commented Jul 29, 2020

What changes were proposed in this pull request?

The goal of this task is to create a new insight point for the datanode container protocol (HddsDispatcher) to be able to debug client<->datanode communication.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-2660

How was this patch tested?

Manually. See the video:

https://www.youtube.com/watch?v=msQgfF95ivc&list=PLCaV-jpCBO8UK5Ged2A_iv3eHuozzMsYv&index=7&t=0s

;-)

Copy link
Contributor

@adoroszlai adoroszlai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @elek for implementing this (and creating the video, too).

In my experience:

$ ozone insight log datanode.dispatcher -f pipeline=...
$ ozone insight log -v datanode.dispatcher -f pipeline=...

works, but the second command by itself does not. I can see the TRACE level messages in datanode logs, but not in the terminal where insight is run. No idea why.

Comment on lines 78 to 79
addProtocolMessageMetrics(metrics, "hdds_dispatcher",
Type.SCM, ScmBlockLocationProtocolProtos.Type.values());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems to be leftover from ScmProtocolBlockLocationInsight. I think it should be something like Type.DATANODE, ContainerProtos.Type.values() instead. (BaseInsightPoint needs to be tweaked to allow it.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is something which couldn't be easily fixed. Type.DATANODE is not supported for metrics. I can throw and UnsupportedOperationException instead of showing the metrics.


@Override
public boolean filterLog(Map<String, String> filters, String logLine) {
return true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please explain why this is needed? I think this may cause logs from other pipelines to be printed.

For example, I ran ozone insight logs on a 3-node pipeline, then:

$ ozone sh key put -r ONE /vol1/bucket1/passwd /etc/passwd

and the logs from the single datanode to which this block was written (part of the 3-node pipeline, too) appeared in the console.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two parts of the filtering:

  1. where to connect
  2. what to display (based on [...] text from the log

Unfortunately the pipeline id is not available in the HddsDispatcher, therefore we couldn't filter out the other pipelines (and we need this code segment to show all the lines)

@elek
Copy link
Member Author

elek commented Aug 31, 2020

Closing this pr temporary. This is on my list but today, it's pending.

I will improve the insight first to support multiple nodes, or making the datanode insight is to be a one-node insight point...

@elek elek closed this Aug 31, 2020
@elek
Copy link
Member Author

elek commented Sep 3, 2020

Both of the open questions are answered with an improvement: instead of using pipeline as a filter I started to use datanode (host or ip) as a filter. With this approach we can check the metrics and it's more obvious what is printed out.

@elek elek reopened this Sep 3, 2020
@elek
Copy link
Member Author

elek commented Sep 7, 2020

@adoroszlai

I think the new approach addressed all the earlier problems. Can we merge it now?

@adoroszlai
Copy link
Contributor

Thanks @elek for updating the patch. Interestingly log -v stopped working, even if I execute log first (as mentioned previously).

@elek
Copy link
Member Author

elek commented Sep 21, 2020

Thanks @elek for updating the patch. Interestingly log -v stopped working, even if I execute log first (as mentioned previously).

I double-checked and it worked for me. When you use leader, the messages are displayed immediately, when you use follower, the messages will be appeared only after the commit...

@adoroszlai
Copy link
Contributor

Thanks @elek for updating the patch. Interestingly log -v stopped working, even if I execute log first (as mentioned previously).

I double-checked and it worked for me. When you use leader, the messages are displayed immediately, when you use follower, the messages will be appeared only after the commit...

It seems to depend on the content. Plain text files work fine, but it stops working on the first binary, eg. ozone freon ockg -n1 -t1. I guess it's caused by control chars in the random data.

I think we should avoid logging chunk content. ContainerCommandRequestMessage implements related logic to clear data.

@elek
Copy link
Member Author

elek commented Sep 23, 2020

I think we should avoid logging chunk content. ContainerCommandRequestMessage implements related logic to clear data.

I checked but it's not something which can be added easily. The OzoneProtocolMessageDispatcher is very generic. To support this one, it should be modified to accept a very specific function as an argument which transforms the original message just for the tracing.

And would also introduce significant overhead in case of tracing is turned on.

I am not sure if it's worth it, but I can add it if you think so.

@adoroszlai
Copy link
Contributor

adoroszlai commented Sep 23, 2020

I think we should avoid logging chunk content. ContainerCommandRequestMessage implements related logic to clear data.

I checked but it's not something which can be added easily.

Thanks for checking. I'm fine with doing it as a follow-up (created HDDS-4271).

@adoroszlai adoroszlai merged commit 64026dd into apache:master Sep 24, 2020
errose28 added a commit to errose28/ozone that referenced this pull request Sep 28, 2020
* master:
  HDDS-4102. Normalize Keypath for lookupKey. (apache#1328)
  HDDS-4263. ReplicatiomManager shouldn't consider origin node Id for CLOSED containers. (apache#1438)
  HDDS-4282. Improve the emptyDir syntax (apache#1450)
  HDDS-4194. Create a script to check AWS S3 compatibility (apache#1383)
  HDDS-4270. Add more reusable byteman scripts to debug ofs/o3fs performance (apache#1443)
  HDDS-2660. Create insight point for datanode container protocol (apache#1272)
  HDDS-3297. Enable TestOzoneClientKeyGenerator. (apache#1442)
  HDDS-4324. Add important comment to ListVolumes logic (apache#1417)
  HDDS-4236. Move "Om*Codec.java" to new project hadoop-ozone/interface-storage (apache#1424)
  HDDS-4254. Bucket space: add usedBytes and update it when create and delete key. (apache#1431)
  HDDS-2766. security/SecuringDataNodes.md (apache#1175)
  HDDS-4206. Attempt pipeline creation more frequently in acceptance tests (apache#1389)
  HDDS-4233. Interrupted exeception printed out from DatanodeStateMachine (apache#1416)
  HDDS-3947: Sort DNs for client when the key is a file for #getFileStatus #listStatus APIs (apache#1385)
  HDDS-3102. ozone getconf command should use the GenericCli parent class (apache#1410)
  HDDS-3981. Add more debug level log to XceiverClientGrpc for debug purpose (apache#1214)
  HDDS-4255. Remove unused Ant and Jdiff dependency versions (apache#1433)
  HDDS-4247. Fixed log4j usage in some places (apache#1426)
  HDDS-4241. Support HADOOP_TOKEN_FILE_LOCATION for Ozone token CLI. (apache#1422)
errose28 added a commit to errose28/ozone that referenced this pull request Sep 28, 2020
* HDDS-4122-remove-code-consolidation: (21 commits)
  Restore files that had deduplicated code from master
  Revert other delete request/response files back to their original states on master
  HDDS-4102. Normalize Keypath for lookupKey. (apache#1328)
  HDDS-4263. ReplicatiomManager shouldn't consider origin node Id for CLOSED containers. (apache#1438)
  HDDS-4282. Improve the emptyDir syntax (apache#1450)
  HDDS-4194. Create a script to check AWS S3 compatibility (apache#1383)
  HDDS-4270. Add more reusable byteman scripts to debug ofs/o3fs performance (apache#1443)
  HDDS-2660. Create insight point for datanode container protocol (apache#1272)
  HDDS-3297. Enable TestOzoneClientKeyGenerator. (apache#1442)
  HDDS-4324. Add important comment to ListVolumes logic (apache#1417)
  HDDS-4236. Move "Om*Codec.java" to new project hadoop-ozone/interface-storage (apache#1424)
  HDDS-4254. Bucket space: add usedBytes and update it when create and delete key. (apache#1431)
  HDDS-2766. security/SecuringDataNodes.md (apache#1175)
  HDDS-4206. Attempt pipeline creation more frequently in acceptance tests (apache#1389)
  HDDS-4233. Interrupted exeception printed out from DatanodeStateMachine (apache#1416)
  HDDS-3947: Sort DNs for client when the key is a file for #getFileStatus #listStatus APIs (apache#1385)
  HDDS-3102. ozone getconf command should use the GenericCli parent class (apache#1410)
  HDDS-3981. Add more debug level log to XceiverClientGrpc for debug purpose (apache#1214)
  HDDS-4255. Remove unused Ant and Jdiff dependency versions (apache#1433)
  HDDS-4247. Fixed log4j usage in some places (apache#1426)
  ...
@elek
Copy link
Member Author

elek commented Oct 5, 2020

I checked but it's not something which can be added easily.

Thanks for checking. I'm fine with doing it as a follow-up (created HDDS-4271).

You were right, you did it easily ;-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants