feat: initial implementation of route to pipeline stage#1786
Conversation
route to pipeline stage
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #1786 +/- ##
========================================
Coverage 84.16% 84.17%
========================================
Files 482 485 +3
Lines 140170 140617 +447
========================================
+ Hits 117979 118369 +390
- Misses 21657 21714 +57
Partials 534 534
🚀 New features to boost your workflow:
|
lquerel
left a comment
There was a problem hiding this comment.
I made few non-blocking suggestions.
| let mut execution_state = ExecutionState::new(); | ||
| execution_state.set_extension::<RouterExtType>(Box::new(RouterImpl::new())); |
| .find(|p| p.as_ref() == route_name.as_str()) | ||
| .ok_or_else(|| EngineError::ProcessorError { | ||
| processor: effect_handler.processor_id(), | ||
| kind: ProcessorErrorKind::Transport, | ||
| error: "Routing error: ".into(), | ||
| source_detail: format!("out_port name {} not configured", route_name), | ||
| })? | ||
| .clone(); |
There was a problem hiding this comment.
Why not using the method send_message_to and checking the error ?
There was a problem hiding this comment.
send_message_to takes as an arg P: Into<PortName> where type PortName = Cow<'static, &str>
otel-arrow/rust/otap-dataflow/crates/engine/src/local/processor.rs
Lines 175 to 179 in 4e7f6a9
otel-arrow/rust/otap-dataflow/crates/config/src/lib.rs
Lines 64 to 65 in 4e7f6a9
But the route_name here is an Rc<String> that might not have the `static lifetime, so I think to make this into a PortName I'd need to copy it, which I was trying to avoid
…s.rs Co-authored-by: Laurent Quérel <laurent.querel@gmail.com>
| { | ||
| Ok(otap_batch) => { | ||
| self.metrics.msgs_transformed.inc(); | ||
| self.handle_routed_messages(effect_handler).await?; |
There was a problem hiding this comment.
This will drain/send the routed batches only when the pipeline execution succeed. What if the pipeline fails - these routed batches will remain cached, and leak into next successful run ? Should we be clearing the router in Err branch below (line 247) ?
| /// Output data to a sink identified by name. | ||
| /// Currently this contains a static string because it's the only way we handle identifying | ||
| /// where to output the data. In the future we could support dynamic sink identified by a | ||
| /// variable, result of a function call, or other some expression, at which point we can change | ||
| /// this to contain the more general `StaticExpression`. | ||
| NamedSink(StringScalarExpression), |
There was a problem hiding this comment.
@albertlockett I just noticed this. Sorry for the late feedback but here it is...
-
"Sink" What is a sink in the context of AST? Maybe "NamedDestination" would be better. Remember it is an explicit goal this AST\code NOT be coupled to the collector.
-
Current design seems very limited. As far as routing is concerned, I can see different needs. Fan out\send\split, stuff like that.
Consider something like this:
enum RouteDataExpression { MapTo(RouteDestinationExpression), // Send the data and quit SplitTo(RouteDestinationExpression) // Send a copy of the data and continue on } enum RouteDestinationExpression { Named(StringScalarExpression) }
There was a problem hiding this comment.
re point 1: imo "sink" is kind of a synonym for to "destination". I don't see how one is any more coupled to the collector than the other, but I'm fine to change it to destination if you think it's a better name. I'm not personally hung up on the naming one way or the other.
re point 2: the intention of the output expression was simply that the data would be emitted to some destination, and that the variants of the enum would control how the destination is defined. We could change the design to have the expression variants control how data is routed in the pipeline, but that could also be achieved through other types of data expressions. For example, we could imagine a fork with multiple branches as child expressions, which could achieve the "copy the data and continue on" scenario in your example.
There was a problem hiding this comment.
Sounds like we need a design discussion 😄
There was a problem hiding this comment.
Yeah that'd be great! We can discuss during wednesday's SIG meeting
Related to #1784
Adds an operator the columnar query engine that can be used to route an OTAP batch to some destination. The main use case is to have the transform processor capable of sending telemetry batches to different out ports, where the behaviour is defined by the query it is executing.
PipelineStageis implemented calledRouterPipelineStageOutputDataExpressionExample:
Some additional notes on the design:
Routing implementation is pluggable:
Although the main use case is to direct the batches to some out port, I didn't want to couple the implementation of the columnar query engine to the DF pipeline. This means I didn't want code in the query-engine crate referencing things that handle pdata routing like
EffectHandlerormessage::Senderfrom the engine crate.In general, I'm imagine use cases where pipelines driven by OPL could be executed in a variety of contexts, that may need to route data to a variety of destinations.
To make the router routing behaviour customizable, the
pipeline::routermodule exposes aRoutertrait which users of columnar query-engine can implement.Extensions & Execution State:
RouterPipelineStagewill need to be able to find the implementation ofRouter. This PR adds the concept ofExecutionStateand "extensions", which are a map of instances of types that pipeline stages may need during their execution.The benefit of this "extension" pattern is that it helps improve future extensibility. For example, we could imagine users may eventually implement custom
PipelineStages, which have external dependencies that need to be injected at runtime. Having these "extension"s available makes this possible.The concept of "extensions" is similar to datafusion's
SessionConfigextensions, but having our own implementation provides us with some benefits: our pipeline stages execute in a single threaded runtime, so extension's types don't need to beSend+Syncand can be accessed mutably.The
ExecutionStateas a concept also has some auxiliary benefits beyond simply being a repository of extensions. In the future, there may be other mutable state that needs to be updated by pipeline stages such as metrics or state related to stream processing. Introducing this type now is the foundation for these future features.Followups:
RouteToPipelineStageemits an empty batch after the incoming batch is sent elsewhere. It's forced to do this by the trait signature ofPipelineStage. This is OK for now, but in the future we probably want to introduce the concept of a "terminal pipeline stage" as a special type of pipeline stage consumes the batch.