-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GRPC trace API #199
GRPC trace API #199
Conversation
Thanks for this @esnible 👍 I've pinged people to take a look at this, I'll be back with my review as well. |
@matej-g Any feedback? |
Hey @esnible, I believe @squat could also give us qualified feedback here, as he was initially having thoughts about how to implement the gRPC interceptors (#158). |
Thanks for this and make sure to ping us on the #observatorium CNCF channel or through private means next - that got blocked too long 🙈 I would love if we can add gRPC API if Otel gRPC is more efficient and popular. That should fit perfectly in our space. In terms of https://github.com/mwitkow/grpc-proxy feel free to use it. I have somehow written permissions and looks like many wants to jump in and help: mwitkow/grpc-proxy#52. Definitely something running in production in many places. (: In terms of interceptors please use For authorization - this interceptor might be a good start: https://github.com/grpc-ecosystem/go-grpc-middleware/blob/v2/interceptors/auth/auth.go We might want to convert some middleware to interceptors: https://github.com/observatorium/api/blob/main/main.go#L552 Overall all decisions here makes sense to me! 💪🏽 |
Note: This code requires the 2018 version of mwitkow/grpc-proxy. The repo was untouched; I thought it was orphaned, but just a week ago it got an update that breaks this. I have opened an issue mwitkow/grpc-proxy#54 |
7f9346b
to
3d5880f
Compare
What seems to be the issue exactly with using latest version? On a quick look, the changes seem to be more on the surface to modernize the codebase, rather than changing how the library works. My thinking is, if we're already introducing this dependency and it has been just updated, I would not make these changes to require to use the old version. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this @esnible! I think the direction is looking good.
It would also be great to have E2E tests for tracing API, but I'm happy to have them after the fact (as long as we don't forget about it 🙂).
api/traces/v1/api.go
Outdated
if fullMethodName == traceRoute { | ||
var err error | ||
|
||
if conn == nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we know what happens if DialContext
below errors out? Is conn
guaranteed to be nil
in such case? I'm thinking about a case when we dial returns error, but because conn
is not nil, we would not try to connect again on the second method call.
Also, is there any downside to dialing and obtaining conn
outside of the director method? I would find that to be a bit simpler and if there's problem with dialing to collector, we can return error before we even enter director logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@matej-g I am now explicitly setting conn
to nil
on failure.
I haven't tested this code yet for performance or reliability. It passes my simple test: Observatorium-api forwards a trace. Even if the OTel collector is restarted, and the Observatorium-api forwards the next trace. More testing is needed for correct keep-alives and for the OTel collector restarting while Observatorium-api is streaming traces.
main.go
Outdated
|
||
//nolint:lll | ||
func gRPCServer(cfg *config, tenantHeader string, tenantIDs map[string]string, pm *authentication.ProviderManager, authorizers map[string]rbac.Authorizer, logger log.Logger) (*grpc.Server, error) { | ||
director := tracesv1.NewHandler(cfg.traces.writeEndpoint, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering whether in our case a simple NewProxy
would not suffice instead of introducing director. The way I see it, we don't need to necessarily intercept the method, as this is something that collector would refuse if e.g. a user tries to use unimplemented method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR only handles incoming traces. However, Jaeger supports both HTTP and gRPC query APIs. There will be different RBAC security for forwarding traces and queries.
authentication/oidc.go
Outdated
token = authorization[1] | ||
} | ||
|
||
idToken, err := a.verifier.Verify(oidc.ClientContext(ctx, a.client), token) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like from this line below, the code is very similar with how HTTP middleware handles this. Do you think we could extract those parts and reuse a single method for both HTTP and gRPC middleware?
@matej-g I have spent two days trying to get the latest version to proxy OpenTelemetry protobufs through gRPC. Sending data either with grpcurl or with the Open Telemetry gRPC trace exporter gives similar problems. I discuss the problem at mwitkow/grpc-proxy#55 I am extremely concerned about requiring a back-level mwitkow/grpc-proxy. I am looking for a work-around.
|
Understood, thanks for putting in the effort @esnible 👍, then let's see if get an answer from upstream and in the meantime we can perhaps settle the remaining questions. |
api/traces/v1/api.go
Outdated
// because the codec we need to register is also deprecated. A better fix, if Google removes | ||
// the deprecated type, is https://github.com/mwitkow/grpc-proxy/pull/48 | ||
grpc.WithCodec(grpcproxy.Codec()), // nolint: staticcheck | ||
grpc.WithInsecure(), // nolint: staticcheck |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
grpc.WithInsecure(), // nolint: staticcheck | |
grpc.WithTransportCredentials(insecure.NewCredentials()) |
secondary, but so the linter would give us a green light.
@matej-g I would also like to have tests. I see tests in the test/e2e directory. I did not understand them. Can you point me to some docs on test writing? I have some ideas on how to test the GRPC piece end-to-end with just 2 Open Telemetry agents -- no actual Jaeger or span storage. Would this be desirable? Or should the E2E tests install the full Jaeger, possibly including ElasticSearch, and be as close as possible to a valid deployment?
|
@matej-g I discovered a bug where the Observatorium-api gets an interrupt and restarts every few minutes. I am looking into it; please review but don't merge yet. I also notice that it is checking the tenancy header before checking the |
We are using the https://github.com/efficientgo/e2e framework, you'll find everything documented there. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks
Overall, functionally makes sense. Let us know if you need help in setting e2e tests. Code looks fine too, just make sure to not handle error twice. If you see something like
foo, err := bar()
if err != nil {
// produce error/warning/debug log line
return err
}
That is generally quite bad. The reason is that the passed error is most likely handled in some way - this some way is usually to log it on server side, put metric, return error to user. If every function would add log line when returning error we have lots of wasted information and effort.
HTTP middleware are tricky as usually indeed the returned error is handled by returning HTTP/gRPC response to client/user. But in order to log this I would rather implement logging middleware/interceptor than spending 100 code lines to add log line to every error handling. WDYT?
api/traces/v1/api.go
Outdated
} | ||
} | ||
|
||
// NewHandler creates the new traces v1 handler. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// NewHandler creates the new traces v1 handler. | |
// NewOTelConnection creates new GRPC connection to Otel handler. |
api/traces/v1/api.go
Outdated
} | ||
|
||
// HandlerOption modifies the handler's configuration. | ||
type HandlerOption func(h *handlerConfiguration) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it really handler? This looks like NewOtelConnection
looks like client rather?
api/traces/v1/api.go
Outdated
|
||
const TraceRoute = "/opentelemetry.proto.collector.trace.v1.TraceService/Export" | ||
|
||
type handlerConfiguration struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
type handlerConfiguration struct { | |
type connOptions struct { |
api/traces/v1/api.go
Outdated
o(c) | ||
} | ||
|
||
level.Info(c.logger).Log("msg", "gRPC dialing OTel collector", "endpoint", write) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this collector client? I would not presume what's the end target here. End target for this connection might be anything that supports Otel gRPC, no?
main.go
Outdated
connOtel, err := tracesv1.NewOTelConnection(cfg.traces.writeEndpoint, | ||
tracesv1.WithLogger(logger), | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
connOtel, err := tracesv1.NewOTelConnection(cfg.traces.writeEndpoint, | |
tracesv1.WithLogger(logger), | |
) | |
connOtel, err := tracesv1.NewOTelConnection( | |
cfg.traces.writeEndpoint, | |
tracesv1.WithLogger(logger), | |
) |
main.go
Outdated
} | ||
|
||
//nolint:lll | ||
func gRPCServer(cfg *config, tenantHeader string, tenantIDs map[string]string, pmis authentication.GRPCMiddlewareFunc, authorizers map[string]rbac.Authorizer, logger log.Logger) (*grpc.Server, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
func gRPCServer(cfg *config, tenantHeader string, tenantIDs map[string]string, pmis authentication.GRPCMiddlewareFunc, authorizers map[string]rbac.Authorizer, logger log.Logger) (*grpc.Server, error) { | |
func newOtelGRPCServer(cfg *config, tenantHeader string, tenantIDs map[string]string, pmis authentication.GRPCMiddlewareFunc, authorizers map[string]rbac.Authorizer, logger log.Logger) (*grpc.Server, error) { |
main.go
Outdated
|
||
// Currently we only proxy TraceService/Export to OTel collectors. | ||
// In the future we will pass queries to Jaeger, and possibly other | ||
// gRPC methods for logs and metrics to different connections |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// gRPC methods for logs and metrics to different connections | |
// gRPC methods for logs and metrics to different connections. |
main.go
Outdated
// the deprecated type, is https://github.com/mwitkow/grpc-proxy/pull/48 | ||
grpc.CustomCodec(grpcproxy.Codec()), // nolint: staticcheck | ||
grpc.UnknownServiceHandler(grpcproxy.TransparentHandler(director)), | ||
// Authorization (tenant) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's visible what interceptors are, we could remove this comment.
main.go
Outdated
authentication.WithGRPCTenantInterceptors(logger, pmis), | ||
auth.StreamServerInterceptor( | ||
authorization.WithGRPCAuthorizers(authorizers, gRPCRBAC, logger)), | ||
// Log when a trace cannot be forwarded |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.. why?
Signed-off-by: Ed Snible <[email protected]>
Signed-off-by: Ed Snible <[email protected]>
Signed-off-by: Ed Snible <[email protected]>
Signed-off-by: Ed Snible <[email protected]>
Signed-off-by: Ed Snible <[email protected]>
Signed-off-by: Ed Snible <[email protected]>
Signed-off-by: Ed Snible <[email protected]>
Signed-off-by: Ed Snible <[email protected]>
Signed-off-by: Ed Snible <[email protected]>
Signed-off-by: Ed Snible <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for addressing my comments @esnible, this is looking nicely 👍, I have few more comments but I would not like to stall this PR any further. I'll be happy to approve once you get a chance to respond to the latest review.
api/traces/v1/api.go
Outdated
// service supporting opentelemetry.proto.collector.trace.v1.TraceService | ||
level.Info(c.logger).Log("msg", "gRPC dialing OTel", "endpoint", write) | ||
|
||
ctx, cancel := context.WithTimeout(context.Background(), 60*time.Second) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason for the 1 minute time out here? Could we do with shorter time out?
authentication/oidc.go
Outdated
// it lets the caller see messages such as | ||
// "failed to authenticate: oidc: token is expired (Token Expiry: 2022-02-03 14:05:36 -0500 EST)" | ||
// which are super-helpful in troubleshooting. | ||
return ctx, fmt.Sprintf("%s: %v", "failed to authenticate", err), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the HTTP behavior might be by design, see discussion here #235 (comment), so worth considering if the same should be apply here?
Signed-off-by: Ed Snible <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks you @esnible! This is look good to me now!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! It looks definitely good enough 👍🏽
Signed-off-by: Ed Snible [email protected]
This PR creates an Open Telemetry gRPC endpoint for the method
/opentelemetry.proto.collector.trace.v1.TraceService/Export
. This proof-of-concept does not (yet) contain RBAC security or tenancy. This version uses https://github.com/mwitkow/grpc-proxy to proxy the gRPC without deserializing it.Previously I created pull 196 which exposed the OTLP Trace HTTP endpoint. That was straightforward using the Chi-router.
The final implementation needs RBAC and tenancy. I haven't yet found anything that is the equivalent to the
chi.Router
for gRPC. Maybe https://github.com/grpc-ecosystem/go-grpc-middleware ?It is too early to review this code, beyond the architectural ideas and the choice of libraries, parameter names, etc.