-
Notifications
You must be signed in to change notification settings - Fork 533
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
I'm expecting node aws lambda to export spans properly #647
Comments
Hi @longility - how did you build the lambda layer? You didn't mention it so I wonder if you didn't check out opentelemetry-lambda. https://github.com/open-telemetry/opentelemetry-lambda Also we have a customization to export to x-ray here if it provides more pointers https://github.com/aws-observability/aws-otel-lambda |
Hey @anuraaga thanks for getting back and commenting
In short, the lambda layer I created is built by using this as reference: https://github.com/open-telemetry/opentelemetry-lambda/tree/main/nodejs/packages/layer I went through many attempts to get to where I am at. My latest attempt is to have the lambda layer export to a container with a base image of I tested my collector by deploying the container into an ecs cluster and ran a local graphql server against that remote collector, and that works as expected. I see the expected spans such as graphQL and aws sdk spans from dynamodb. Another reason why I think it has to do with spans not being exported and not the connection to the remote collector is that when I have a bad OTEL endpoint, it would log the errors as I have enabled logging to see that locally. I am not receiving the error logs or the success logs. I also logged the endpoint to make sure it is the same endpoint I used locally as I am using for the lambda to double check and the endpoint was the same. AWS ADOT node lambda layer attemptFollowed this: https://aws-otel.github.io/docs/getting-started/lambda/lambda-js Latest conclusion: I did not see aws sdk or graphQL spans so I think the spans are not being exported through otel instrumentation and otel collector, but I think the traces I see in x-ray are being exported through whatever built-in x-ray instrumentation, not otel. That is why I decided to go through my latest experiment. Basically, I thinking the ADOT node lambda layer is not working b/c the spans are not being exported perhaps like the issue I'm theorizing. Alternative attemptI was wondering if x-ray is the issue so I made an attempt to export to honeycomb following this: https://aws-otel.github.io/docs/getting-started/lambda#custom-configuration-for-the-adot-collector-on-lambda However, my latest attempt I decoupled the collector to point to both honeycomb and x-ray and they work as I expect as mentioned earlier. so I have the collector side working to where I conclude it is the lambda part of exporting spans. My next debugging and exploringOn Monday, I think I will copy in AWS lambda instrumentation so I can do more logging to see what paths it is taking to understand how that instrumentation works. I'm new to all this so I'm slowly have to unravel to learn and debug. I am under the impression that the AWS lambda instrumentation has an expected responsibility to force flush the spans before the function terminates. That is why I'm going down this path to see where it ends up. Additionally, I may attempt to short circuit to force flush to see if it goes to the collector to further identify what is working and what is not working. @anuraaga do you have any other pointers? Not sure if you tried recently to test out ADOT node lambda layer to have good confidence that it should be working. |
If possible we need the long, not the short ;) So it sounds like you are building a layer from scratch, without the collector. And you are using the same exec wrapper technique? What changes have you made to the code? Is the collector running externally on ECS and the lambda exports to that?
The instrumentation force flushes so user code doesn't have to call it.
We run canary tests periodically which verify that spans for aws SDK are sent to xray. So that is working for the public lambda layers. However honeycomb isn't yet supported on the public layer which might be why you didn't see spans there? But if you saw problems with the x-ray export can you share the relevant parts of the lambda function definition? |
Yes, I’m using the same wrapper technique. I think I found one of the issues. The It is related to this refactor: open-telemetry/opentelemetry-js#2340 (comment) I’m not sure what the fix should be due to the interesting coupling that may not be straightforward for me such as the coupling to the dependency versions the node lambda layer is referencing. |
@longility Are you calling the OpenTelemetry API in your function, or just need it auto-instrumented? I think if you just use the same versions as the lambda layer's package.json (notably using 0.24.0 of opentelemetry/node) it should work ok. If you use the API in your function, you'd need to make sure it's 1.0.1 to match the layer. |
Yes, I’m using the API. I’ll double check that tomorrow as I try to get things working. Thanks! I’m still new and challenged by so many packages! The debugging exercise is forcing me to learn. |
I'm going to take that back on my conclusion of version mismatch. I think what I have in code is fine. I think I got confused as I was jumping between what is latest in github and what is on my local. |
@longility one other question - are you enabling the AWS X-Ray propagator as described here: https://github.com/open-telemetry/opentelemetry-js-contrib/blob/main/propagators/opentelemetry-propagator-aws-xray/README.md? |
I am not. I just double checked as I thought that was added by the aws lambda instrumentation. I was wrong. It just has a reference but it doesn't enable it. Thanks! |
@anuraaga I found the issue as with a code change, I was able to get working. It has to do with this line: Line 270 in ee0debf
What I determined for at least my case is that the Which I created a helper method to get private _getBasicTracerProvider() {
let realTracerProvider = this._tracerProvider;
if (this._tracerProvider instanceof ProxyTracerProvider) {
realTracerProvider = this._tracerProvider.getDelegate();
}
return realTracerProvider instanceof BasicTracerProvider
? realTracerProvider
: undefined;
} then the caller looks like this now const basicTracerProvider = this._getBasicTracerProvider();
if (basicTracerProvider) {
basicTracerProvider
.getActiveSpanProcessor()
.forceFlush()
.then(
() => callback(),
() => callback()
);
} else {
callback();
} If this makes sense, I can put in a PR for it. |
Thanks for the dig @longility! @dyladan do you have any suggestion on how to invoke forceFlush from the instrumentation? Is there a better pattern than having to check instanceof for both BasicTracerProvider and ProxyTracerProvider? |
I think we should add Even with above fix you could easily end up in problems if someone implements it's own |
Should this problem be occurring for a subset of use cases? I didn’t customize much in regards to tracer provider and it uses the |
No. I tried to add a force flush to the spec a while ago but it didn't get enough support. |
From my quick investigation with limited knowledge, it seems to have been With that said, I probably recommend getting the fix in even if it isn’t ideal. Seems critical. Not sure how come it hasn’t come up before. I don’t have enough context and history to piece this together. |
I have 2 suggestions which would not require spec changes:
I know both of these aren't great solutions. I would far prefer to have a forceFlush function in the API, but that would require a spec change. |
If the
Can you elaborate on this as I was not able to follow? I'm not sure whose constructor you are referring to. |
I was suggesting you do something very similar to the code you linked above but instead of using private _getBasicTracerProvider() {
let realTracerProvider = this._tracerProvider;
if (typeof this._tracerProvider.getDelegate === 'function') {
realTracerProvider = this._tracerProvider.getDelegate();
}
if (typeof realTracerProvider.getActiveSpanProcessor === 'function') {
return realTracerProvider
}
return undefined;
}
I was suggesting the aws instrumentation config options be extended to have a flush option, then when it is configured you can pass a function which has the tracerprovider in closure. Something along the lines of this: const tp = new BasicTracerProvider(opts);
tp.addSpanProcessor(spanProcessor);
const awsInstr = new AwsLambdaInstrumenatation({
flush: async () => {
await tp.getActiveSpanProcessor().forceFlush();
},
}) |
Honestly, I would much rather have a force flush in the API as I've already said above. |
We could avoid the binding of instrumentations to SDK by adding above flush code in |
While I like the approach if there are many scenarios and we want to make it easier explicit and be DRY, I’m wondering how many scenarios we have to need to force flush. Force flush in itself seems not ideal, and it just so happens that lambda may be the only special case that I can see. So maybe unless there is more need, hide the nastiness in the instrumentation that is needed. In this case, aws lambda instrumentation. |
I guess all FaaS environments would need this not just AWS lambda. |
@Flarna Thanks for adding details to my gap knowledge. If that is the case and if I understand you correctly, then it would look something like this... export interface Instrumentation {
// ..
// removed for brevity
forceFlushSpans(): Promise<void>;
} |
As for implementation detail, I’m thinking we should force flush if available. if not available, then One can spin their wheels wondering why the spans are not exported like me. |
@longility have you taken a crack at a PR here? If not, I might try to implement @dyladan's simpler suggestion to just do type checking on the I've also confirmed that this bug is what's causing tests to fail in #660, so I'd like to get a fix out ASAP and will probably make that PR tomorrow. |
@willarmiros I didn't know the direction so I didn't create a PR. I just copied in the code and made it work for us. I can send a quick PR to get things started. |
I just read over this whole conversation because I am currently facing a similar issue, trying to instrument an Apollo GraphQL Server Lambda function and getting no exported spans. Based on what I've read here it seems like this issue should have been resolved, and I made sure I'm using the most up-to-date packages. I detailed what I've tried and a link to a repo with a demo in this issue under Apollo Server's repo (which I realized after the fact is more likely an issue here rather than there): apollographql/apollo-server#5923 Please let me know if I am doing something incorrectly! I can also open a new issue if it seems like this has a different root cause, just seemed pretty heavily related to this issue so figured I'd comment here first. |
@mkossoris I got your exact scenario working. What do you see in your lambda logs? You may need more logging for me to help. |
@longility Interesting, I went ahead and re-deployed the repo to a new AWS account just to make sure I was starting from a clean slate. I am still not seeing the trace data in the logs. Here is some of the output I see when I run the local server:
As you can see, the trace data shows up properly when running it locally. And here is the output I see in the CloudWatch logs when running it on a deployed Lambda:
So none of the trace data seems to be exporting when running within the Lambda. What logs are you seeing when running it on Lambda? |
Can you enable otel logging so we have a better idea? import { diag, DiagConsoleLogger } from "@opentelemetry/api";
import { getEnv } from "@opentelemetry/core";
diag.setLogger(new DiagConsoleLogger(), getEnv().OTEL_LOG_LEVEL); env var to enable OTEL_LOG_LEVEL=all |
Actually, I forgot, we created our own lambda layer and not rely on what aws provided. Not sure if you went that route. |
@mkossoris do you have active tracing enabled on your Lambda? If not, have you manually added X-Ray permissions to your function role? Also, are you setting a different endpoint for the OTel collector via env var? |
@longility Here is the result of the adding the otel logging:
The debug logs are only displayed on startup - further requests only show the START, END, REPORT, and Billed Duration logs from Lambda. @willarmiros and @longility For this particular test, I'm trying to remove variables by just registering the instrumentation myself in the Lambda function and using the ConsoleExporter - not the CollectorExporter and not using the Lambda extension. I have also tried it with the the Lambda extension and had no luck with that. |
Here are some logs from running it just with the Lambda Layer:
With those logs, we can see that the Lambda layer is clearly running, but it is only collecting traces about the Lambda invocation and not the GraphQL spans. And this trace data does show in XRay so it isn't a permissions issue between the Lambda function and XRay. |
Hi @mkossoris
I think this is a great way to debug! I've personally found it useful to confirm that you can see the traces in the Lambda logs first before you expect them to arrive at the Collector and then to the AWS X-Ray console. Could you share the code you use to initialize OTel JS? Particularly, do you register the instrumentation for GraphQL? I can see from your debug logs above that your OTel JS seems to be instrumenting the
Maybe you could try making a request with the http.get('http://aws.amazon.com', () => {}); Also not sure if you are aware but we have documentation for using the published ADOT Lambda for NodeJS Lambda Layer (i.e. |
Hey @NathanielRN! I've created a repo with the code I've been using here: https://github.com/mkossoris/apollo-server-lambda-otel-issue. It does register the GraphQL instrumentation here: https://github.com/mkossoris/apollo-server-lambda-otel-issue/blob/master/src/open-telemetry.ts#L20. I'll definitely try adding the http call just to see what happens. As for the ADOT Lambda configuration I tried, you can see the CDK code for that here: https://github.com/mkossoris/apollo-server-lambda-otel-issue/blob/master/cdk/lib/cdk-stack.ts#L10 |
@mkossoris I couldn't make too much sense from log. However, looking at your code, know that opentelemtry-js does monkeypatching based on actual package names so if you use esbuild through NodejsFunction construct, it would minify and not able to instrument properly. Secondly, I actually have not gotten the Otel bootstrapping to work within lambda fn, only lambda layer extension. Thirdly, tracing needs to be enabled as mentioned by @willarmiros even if you aren't using x-ray. |
@longility I was just coming back here to say I found out it was because of using NodejsFunction construct after trying it with standard Lambda function instead... facepalm. That being said, the extension still did not work entirely by itself as far as pushing it to XRay goes. I was, however, able to get it to work by importing the open-telemetry file from my example and replacing the ConsoleExporter with the CollectorTraceExporter, which pushes the traces to the Lambda Extension's Collector port, and that did actually work. Progress at least! |
So then that leaves me with an important question: is there no way for us to be able to bundle our code and use OpenTelemetry's instrumentations? With Lambda size limits and how large node_modules dir can be, bundling is going to be pretty essential for us. |
I see there's this issue that discusses OpenTelemetry's issues with esbuild: aws-observability/aws-otel-lambda#99 At the end, the creator of the ticket says:
Which sort of implies there was some workaround at all? Or is there no known workaround for using esbuild or webpack with OTel instrumentations at this time? The only thing I can think of off the top of my head is to mark the OTel instrumentations and their related packages (e.x. @open-telemetry/instrumentation-graphql and graphql) as externals and upload the external packages unbundled in node_modules or in a layer. Haven't tested if this works yet though. |
As far as I know, there is no workaround for minify problem with Otel js monkey patch at this time. You would have to take the node_modules size hit. Since you are dealing with GraphQL and lambda, I assume you would compose with other lambda functions with Apollo federation. Are there concerns for you with latency for end user? I'm asking to think long term if lambda is the solution as we have been asking these types of questions on our end. |
@longility Yep that's correct, we're intending to follow a similar design to what IMDb is using, which they posted an article about here: https://aws.amazon.com/blogs/compute/building-federated-graphql-on-aws-lambda/. It uses a Lambda to host the gateway and one Lambda per subgraph. |
@longility So this is a pretty hacky initial attempt, but it surprisingly works lol: const oTelLambdaLayer = lambda.LayerVersion.fromLayerVersionArn(
this,
"oTelLambdaLayer",
"arn:aws:lambda:us-west-2:901920570463:layer:aws-otel-nodejs-ver-1-0-0:1"
);
const server = new NodejsFunction(this, "ServerFunction", {
entry: "../src/server.ts",
handler: "handler",
memorySize: 1500,
environment: {
OTEL_LOG_LEVEL: "all",
AWS_LAMBDA_EXEC_WRAPPER: "/opt/otel-handler",
},
layers: [oTelLambdaLayer],
tracing: lambda.Tracing.ACTIVE,
bundling: {
commandHooks: {
afterBundling(inputDir, outputDir) {
// Manually copy OpenTelemetry packages and OTel-instrumented packages to output dir
return [
`cp ../build/open-telemetry.js ${outputDir}`,
`mkdir ${outputDir}/node_modules`,
`cp -r ../node_modules/@opentelemetry ${outputDir}/node_modules/@opentelemetry`,
`cp -r ../node_modules/http ${outputDir}/node_modules/`,
`cp -r ../node_modules/graphql ${outputDir}/node_modules/`
]
},
beforeBundling() {
return []
},
beforeInstall() {
return[]
}
},
keepNames: true,
// Mark OpenTelemetry package and OTel-instrumented packages as external so they aren't bundled
externalModules: [
'@opentelemetry/instrumentation-http',
'@opentelemetry/instrumentation-express',
'@opentelemetry/instrumentation',
'@opentelemetry/instrumentation-graphql',
'@opentelemetry/instrumentation-aws-lambda',
'@opentelemetry/api',
'@opentelemetry/core',
'@opentelemetry/resources',
'@opentelemetry/tracing',
'@opentelemetry/node',
'@opentelemetry/exporter-collector',
'http',
'graphql'
]
},
}); Here's the resulting trace in X-Ray: And the resulting Lambda code asset is 6.9MB vs 13MB if I uploaded the entire thing, and the difference would likely get quite a bit larger as more dependencies are added. So it might be a little hacky, but it might not be too terrible of a hack to use... |
Glad you're making progress. |
Just to update, I believe the layer actually did start working once I removed esbuild. The CloudWatch logs were showing the Lambda invocations from the Collector so I was under the impression it would show the other trace data as well. But in fact the trace data does appear to be showing up in XRay. |
Hey all, I found a couple of solutions to the esbuild issues:
Here is how I solved it, within the build script:
|
What version of OpenTelemetry are you using?
What version of Node are you using?
v14.17.4
What did you do?
Create a lambda layer where I started from this: https://github.com/open-telemetry/opentelemetry-js#instantiate-tracing
What did you expect to see?
I was expecting to see
statusCode: 200
to indicate that it got exported correctly. I saw this in the debug log when I was running a server, not lambda function. Also, I am not seeing any associated traces. I am testing against the same collector.What did you see instead?
Here is the actual log
Additional context
Any ideas how to debug or understand what the issue is on why the spans are not exporting?
I was lead from here: open-telemetry/opentelemetry-js#1295 (comment)
/cc @willarmiros @anuraaga
The text was updated successfully, but these errors were encountered: