GH-33737: [C++] simplify exec plan tracing #33738

westonpace · 2023-01-18T07:34:46Z

This PR does two things. First, it requires that all "tasks" (for the AsyncTaskScheduler, not the executor) have a name. Second, it simplifies and cleans up the way that exec nodes report their tracing using a TracedNode helper class.

Closes: [C++] Simplify tracing in exec plan #33737

github-actions · 2023-01-18T07:35:12Z

Closes: [C++] Simplify tracing in exec plan #33737

westonpace · 2023-01-18T07:38:30Z

Example Plan Trace:

Note: I can already see something I find rather interesting in the above trace. The final pipeline calls ScalarAggregateNode::Finish which finalizes the aggregates and calls SinkNode::InputReceived. Admittedly, finalizing aggregates is not much work, however, all InputReceived should be doing is pushing the batch onto a queue. I'm surprised how much longer it takes to run InputReceived.

westonpace · 2023-01-18T07:42:24Z

CC @mbrobbel please review if you have a chance

westonpace · 2023-01-18T07:59:33Z

Looks like I still have a CI issue. Converting to draft while I work that out.

joosthooz

Thanks a lot for this, I still have to try this branch for myself but I left some questions!

joosthooz · 2023-01-18T15:14:35Z

cpp/src/arrow/compute/exec/sink_node.cc

-        span, span_, "InputReceived",
-        {{"node.label", label()}, {"batch.length", batch.length}});
-
+    auto scope = TraceInputReceived(batch);


If the sink applies backpressure (e.g. the datasetwriter), it does not result in an event being created on the span as is the case for the normal SinkNode here https://github.com/apache/arrow/pull/33738/files#diff-967cff6ef1964402635ac0769dece741ca0ed58bcefb8d669ecfe3ed8371998eR175
Is there a way to do this? For example, we could compare the value of backpressure_counter_ before and after calling Consume(), but then we wouldn't be able to know if backpressure was applied or released.

The backpressure is applied in the dataset writer (it has it's own queue). However, we can add tracing events in the dataset writer. I'm kind of interested now to see what a dataset writer trace looks like. I will add something.

Yes, I think that makes a lot of sense especially because the dataset writer submits tasks to the IO executor that can run in parallel. All that would otherwise be lost behind a single ConsumingSinkNode span

joosthooz · 2023-01-18T15:25:41Z

cpp/src/arrow/compute/exec/filter_node.cc

  }

  void InputReceived(ExecNode* input, ExecBatch batch) override {
-    EVENT(span_, "InputReceived", {{"batch.length", batch.length}});


Does this need a TraceInputReceived or NoteInputReceived?

It gets it from MapNode. This will be more obvious in #15253 because subclasses will no longer implement InputReceived (they will use MapNode::InputReceived and instead just implement ProcessBatch)

joosthooz · 2023-01-18T15:31:51Z

cpp/src/arrow/compute/exec/project_node.cc

  }

  void InputReceived(ExecNode* input, ExecBatch batch) override {
-    EVENT(span_, "InputReceived", {{"batch.length", batch.length}});


Does this need a TraceInputReceived?

joosthooz · 2023-01-18T15:33:13Z

cpp/src/arrow/dataset/file_base.cc

  }

  void InputReceived(compute::ExecNode* input, compute::ExecBatch batch) override {
-    EVENT(span_, "InputReceived", {{"batch.length", batch.length}});


Does this need a TraceInputReceived?

joosthooz · 2023-01-18T15:45:55Z

cpp/src/arrow/dataset/file_base.cc

      MapNode::Finish(std::move(finish_st));
      return;
    }
+    auto scope = TraceFinish();


Why do we do a TraceFinish here and not inside DatasetWriter::Finish? That's the only thing inside this trace and it seems weird to only trace it if it is inside a TeeNode (e.g. it is missing here https://github.com/apache/arrow/pull/33738/files#diff-2caf4e9bd3f139e05e55dca80725d8a9c436f5ccf65c76a37cebfa6ee9b36a6aL418)

joosthooz · 2023-01-18T16:02:18Z

In the figure, why does WaitForFinish(SinkNode:) end earlier than the ScalarAggregate? Can we add a name (and maybe even an id number in case there are multiple) to the names so that we know which node a span refers to? Lastly, there is a SinkNode but it doesn't seem to perform any work

westonpace · 2023-01-19T00:50:36Z

In the figure, why does WaitForFinish(SinkNode:) end earlier than the ScalarAggregate?

The code looks roughly like:

void SinkNode::ReceiveLastBatch(batch) {
  output_queue.Enqueue(batch); // 4
  finished.MarkFinished(); // 5
}

void AggregateNode::ReceiveLastBatch(batch) {
    Enqueue(batch); // 2
    aggregates = ComputeAggregates(); // 3
    output->ReceiveLastBatch(batch);
    finished_.MarkFinished(); // 6
}

void SourceNode::ReceiveLastBatch(batch) {
  output->ReceiveLastBatch(batch); // 1
  finished_.MarkFinished(); // 7
}

Can we add a name (and maybe even an id number in case there are multiple) to the names so that we know which node a span refers to?

All of the node-specific spans and events should have the node label as an attribute. I don't think it's displayed here. The node label defaults (I think) to NodeType:NodeCounter but I'll check

Lastly, there is a SinkNode but it doesn't seem to perform any work

There are two kinds of sinks. The SinkNode has an external queue. All it does is push batches into the queue. So no, it should not be doing any work. The ConsumingSinkNode assumes the batch is consumed as part of the plan (e.g. dataset write) and has no output but it does do work.

westonpace · 2023-01-19T01:54:22Z

Here's a trace from a dataset write. There are still things that could be cleaned up here. Pretty much all of the DatasetWriter:: traces are a mix of active CPU time and idle I/O time and it isn't clear what is what.

joosthooz · 2023-01-19T11:17:54Z

Nice, I think the WriteAndCheckBackpressure span is important because that's where the backpressure is checked and also it performs some work combining staged batches (in PopStagedBatch). Maybe that even deserves its own span because it is not always called (only if enough rows have arrived in the Push function).
Shouldn't there also be a span created in the delta that gets submitted to the IO executor in WriteNext? That's where the actual writing (and parquet encoding & compression) is being performed

…t for more clarity

westonpace · 2023-01-19T21:28:51Z

@joosthooz I slightly changed things so the current task will be used as parent and not the scheduler. This makes it more clear that WriteAndCheckBackpressure is actually creating some of those following spans.

However, at this point, I think we are veering from my original goal which was "remove the dependence on the exec node finished future so I can get away with it but don't break OT worse than it already was".

I think I'd like to merge this in as it is. Would you be interested in investigating better ways of handling spans in a future PR?

Shouldn't there also be a span created in the delta that gets submitted to the IO executor in WriteNext? That's where the actual writing (and parquet encoding & compression) is being performed

I think I/O of any kind is generally interesting enough to always justify a span.

joosthooz · 2023-01-23T13:12:20Z

Hi, I gave this branch a spin (this code reads in a partitioned csv dataset and writes it to a partitioned parquet dataset), and it seems that the nesting has become inconsistent:

There's 2 ReadBatch spans under InitialTask. 1 of these has all the FragmentsToBatches as its child spans (these were nested under the SourceNode before). The other keeps recursively nesting more ReadBatch spans. Each has a ProcessMorsel, that has the filter, project and sink spans nested under each other. Then the dataset writer also keeps nesting WriteAndCheckBackpressure.

Is there a way to go back to making most of these spans siblings again?
Do we want to change the organization of the spans in this PR from having 1 span for each node in the graph, each having a span for every chunk of data it processes (how it was before), to having a ProcessMorsel for each chunk of data, each having a span for each node it traverses through?
I think I can help in a follow-up PR, especially for the dataset writer.

westonpace · 2023-01-23T14:17:27Z

Is there a way to go back to making most of these spans siblings again?

Yes. The recursion is probably somewhat accurate but I agree it makes it harder to read. Having them as siblings makes sense too. I will revert back to this understanding.

Do we want to change the organization of the spans in this PR from having 1 span for each node in the graph, each having a span for every chunk of data it processes (how it was before), to having a ProcessMorsel for each chunk of data, each having a span for each node it traverses through?

Yes, I believe so. A more generic term than ProcessMorsel would be "pipeline" or "plan fragment". Acero is (implicitly) a "plan fragment" engine in that we have one task per batch per fragment. Thinking about it this way it would be nice if we had something like the conceptual model:

ProcessMorsel
- Filter
- Project
- Sink

But today it ends up being (because the last part of each node is to call the downstream node):

ProcessMorsel
- Filter
  - Project
    - Sink

Yet another case where the conceptual/logical understanding is different than the physical understanding. Although, perhaps there is some merit for the physical understanding as it mirrors reality more closely. Perhaps it depends on the goal of the reader. If someone is trying to improve the threading and execution of the plan itself they might want the physical model. If someone is trying to focus on the performance of a single node they might want the logical model. I'll leave that for a follow-up PR.

I think I can help in a follow-up PR, especially for the dataset writer.

Great. Mentally, when I think of the dataset writer, I think there are two parts. The first part should be the trailing part of the fragment/pipeline that feeds the writer. In this first part we partition the batch, select the appropriate file queues, and deposit the batches into the queues. There is then a separate dedicated thread task to write each batch to the writer.

… of tasks

westonpace · 2023-01-23T16:23:05Z

I'm going to merge this as-is to unblock the error handling cleanup. We can fine tune in future PRs.

lidavidm · 2023-01-23T16:26:16Z

I was about to start reviewing it, sorry for the delay 😅 In any case, Joost already took a look here fortunately.

westonpace · 2023-01-23T16:32:59Z

I was about to start reviewing it, sorry for the delay sweat_smile In any case, Joost already took a look here fortunately.

No problem. I might be moving a little fast but I think the tracing stuff is still pretty experimental at the moment. One thing I forgot to note is that I have upgraded OT from 1.4 to 1.8. I found that 1.4 could not connect directly to a connector for some reason. 1.8 seems to work out of the box. Also, I noticed that 1.8 now has a jaeger exporter which could potentially avoid the need to have a collector at all. I tried to enable it but quickly ran into trouble with the bundled build which seems to be pretty custom.

lidavidm · 2023-01-23T16:39:28Z

The upgrade sounds fine. I think OT itself is also moving fast so that might explain the incompatibility.

As mentioned in the original OT PRs, there's a tension between whether Arrow counts as a library or an application to OT. Really we shouldn't be setting up any exporters at all, letting the application control it all, but that is inconvenient/impossible for Python users at the moment...

ursabot · 2023-01-23T17:12:26Z

Benchmark runs are scheduled for baseline = b9d1162 and contender = 589b5b2. 589b5b2 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Failed] ec2-t3-xlarge-us-east-2
[Failed] test-mac-arm
[Failed] ursa-i9-9960x
[Failed] ursa-thinkcentre-m75q
Buildkite builds:
[Failed] 589b5b2b ec2-t3-xlarge-us-east-2
[Failed] 589b5b2b test-mac-arm
[Failed] 589b5b2b ursa-i9-9960x
[Failed] 589b5b2b ursa-thinkcentre-m75q
[Failed] b9d11627 ec2-t3-xlarge-us-east-2
[Failed] b9d11627 test-mac-arm
[Failed] b9d11627 ursa-i9-9960x
[Failed] b9d11627 ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

westonpace added 2 commits January 16, 2023 12:40

All tasks must now have a name

14b941a

Clean up exec plan / tracing interface

15b0260

github-actions bot added the Component: C++ label Jan 18, 2023

westonpace requested a review from lidavidm January 18, 2023 07:41

checked_cast to const ptr instead of ptr to try and avoid compiler error

9599eb1

westonpace marked this pull request as draft January 18, 2023 07:59

joosthooz reviewed Jan 18, 2023

View reviewed changes

Addressing review comments. Bumping OTEL version. A few cleanups.

bf469c0

Do not use scheduler as part for tasks. Use the current task as paren…

2330d96

…t for more clarity

westonpace added 3 commits January 19, 2023 15:13

Adding nolint to sv literal declarations

cba8175

Lint

e81dbb0

Adding missing doxygen params

b87769e

westonpace marked this pull request as ready for review January 20, 2023 05:39

Reverted back to a model where the async task scheduler is the parent…

1901dcf

… of tasks

westonpace merged commit 589b5b2 into apache:master Jan 23, 2023

joosthooz mentioned this pull request Jan 26, 2023

Add/improve tracing in the dataset writer #33880

Open

joosthooz mentioned this pull request Feb 13, 2023

GH-33880: [C++] Improve I/O tracing #34168

Closed

GH-33737: [C++] simplify exec plan tracing #33738

GH-33737: [C++] simplify exec plan tracing #33738

Uh oh!

Conversation

westonpace commented Jan 18, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jan 18, 2023

Uh oh!

westonpace commented Jan 18, 2023

Uh oh!

westonpace commented Jan 18, 2023

Uh oh!

westonpace commented Jan 18, 2023

Uh oh!

joosthooz left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

joosthooz commented Jan 18, 2023

Uh oh!

westonpace commented Jan 19, 2023

Uh oh!

westonpace commented Jan 19, 2023

Uh oh!

joosthooz commented Jan 19, 2023

Uh oh!

westonpace commented Jan 19, 2023

Uh oh!

joosthooz commented Jan 23, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

westonpace commented Jan 23, 2023

Uh oh!

westonpace commented Jan 23, 2023

Uh oh!

lidavidm commented Jan 23, 2023

Uh oh!

westonpace commented Jan 23, 2023

Uh oh!

lidavidm commented Jan 23, 2023

Uh oh!

ursabot commented Jan 23, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

westonpace commented Jan 18, 2023 •

edited

Loading

joosthooz commented Jan 23, 2023 •

edited

Loading