-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cold starts double to quadruple when layer is included #228
Comments
Hi @ianwremmel, thanks for raising this issue! We've actually gotten issues on things like this before. A user on the OpenTelemetry CNCF Slack channel did a useful deep dive into investigating cold start impacts of Lambda with the ADOT NodeJS Lambda Layer. Unfortunately it seems that the best solution is to bump up the memory capacity of your Lambda which does improve Cold Start time. If your Lambda is being used a lot this should prove to be a good idea. Let me know if that helps! |
Gotcha, thanks! Could you share that post somehow? I don't seem to be able to access that Slack. |
After a bit of experimenting (and seeing a serendipitous tweet this morning), I've determined that my cold start times without the layer were bad due to running the the With the unexpected overhead of So, unless you've got other suggestions, it sounds like the AWS otel lambda support isn't viable for my project. |
|
Thanks! yea, it looks I'm in kind of the same boat. I didn't see the kind of speedup he did by increasing memory, but even if I had, that would be way too expensive. I don't think there's any way I can use AWS otel, given these numbers, but maybe there's still a chance to use otel instrumentation and ship directly to third-party receivers (e.g., honeycomb). |
One improvement you might try to reduce the cold start time is to use ESBuild instead of CommonJS build for your Node JS app on Lambda. ESBuild is nicer because it does operations like tree-shaking to seriously reduce the size of imported libraries which reduces memory and improves performance. There are discussions upstream on OpenTelemetry JS Contrib to improve support for ESBuild because it requires more careful configuration. You might find the comment thread at open-telemetry/opentelemetry-js-contrib#647 (comment) helpful towards this goal and improve your Node App's memory impact in a way that would make ADOT Lambda layers usable for your purpose 🙂 The solution utilizes esbuild's "external" option and a custom plugin. "external" is an array of packages you tell esbuild to not bundle, and the plugin picks up those external packages and installs them into a raw node_modules directory when building. OpenTelemetry needs to monkey patch methods, so you can't bundle the libraries it needs to instrument (e.g. |
I'm already using rollup to do pretty aggressive treeshaking and I've tried putting all the otel packages on my dependencies layer. Unfortunately, as soon as I include the AWS Open Telemetry layer, cold boot times skyrocket. I've made a few attempts at switch my lambdas over to Note that the ping lambda I've been talking about above is about 40k after being processed by rollup. |
came back to poke at this a little bit more and it seems like most of the overhead comes from adding |
Another interesting detail. This line from the docs site, when used with a remix app, seems to mostly prevent the function from returning. From what I can tell, initialization takes more than 30 seconds. (For reasons not entirely clear to me, I couldn't get the lambda timeout to be more than 30 seconds, but that might be a sleep-deprivation issue on my part).
|
Actually, that's the wrong line. I think it's this line:
Apparently there's some side effect somewhere in |
It's the API gateway timeout that has hard limit at 30 seconds. Looking at the corresponding xray trace, it looks like the lambda takes 1.2 minutes to initialize when both remix and Admittedly, remix is only the most obvious package in this lambda. There are any number of frontend packages that I could imagine might not play well with otel; the bundle file is 3MB. |
After a bunch of refactoring to get |
@ianwremmel why did you need to refactor to get these components to work? They should work out of the box with the Lambda layer instrumentation |
@willarmiros I was having a whole lot of trouble getting the layer to actually collect my spans and send them anywhere. First, the layer doesn't configure a few things that one would otherwise expect it to configure:
a side effect is that following, say, the aws otel javascript tracing guide has no effect, because there's already been a tracer configured by the layer. Since otel doesn't provide a way to get access to the original Second, again because there's no way to get the Finally, there have been a lot of API changes the majority of aws-otel.github.io was written, so they're all slightly incorrect. In any case, after working through those issues, my original coldstart issue and the super high duration on my Remix lambdas went away. This is what my final set of config overrides looks like // Copied from
// https://github.com/open-telemetry/opentelemetry-lambda/blob/0a83149fe2f23a7dab64df6108cfa35f18cc2ae5/nodejs/packages/layer/src/wrapper.ts#L42-L50
declare global {
// in case of downstream configuring span processors etc
function configureTracerProvider(tracerProvider: NodeTracerProvider): void;
function configureTracer(defaultConfig: NodeTracerConfig): NodeTracerConfig;
function configureSdkRegistration(
defaultSdkRegistration: SDKRegistrationConfig
): SDKRegistrationConfig;
function configureLambdaInstrumentation(
config: AwsLambdaInstrumentationConfig
): AwsLambdaInstrumentationConfig;
}
// And this needs to be defined so that we can flush at the end of each
// invocation.
declare global {
// var seems to be the only typescript keyword that works here.
// eslint-disable-next-line no-var
var tracerProvider: NodeTracerProvider;
}
// Remove some noise from the service name
const serviceName = process.env.AWS_LAMBDA_FUNCTION_NAME?.replace(
my-stack-name',
''
)?.split('-')?.[0];
// This produces the config passed to new NodeTracerProvider()
global.configureTracer = ({resource, ...config}) => {
diag.debug('Telemetry: generating tracer config');
if (serviceName) {
const r = new Resource({
[SemanticResourceAttributes.DEPLOYMENT_ENVIRONMENT]:
process.env.STAGE_NAME ?? 'development',
[SemanticResourceAttributes.FAAS_INSTANCE]:
process.env.AWS_LAMBDA_LOG_STREAM_NAME ?? '',
[SemanticResourceAttributes.FAAS_MAX_MEMORY]:
process.env.AWS_LAMBDA_FUNCTION_MEMORY_SIZE ?? '',
[SemanticResourceAttributes.SERVICE_NAME]: serviceName,
});
resource = resource ? resource.merge(r) : r;
}
return {
...config,
idGenerator: new AWSXRayIdGenerator(),
resource,
};
};
// This gets applied to the tracerProvider
global.configureTracerProvider = (tracerProvider) => {
diag.debug('Telemetry: configuring tracer provider');
const exporter = new OTLPTraceExporter({
url: 'grpc://localhost:4317',
});
tracerProvider.addSpanProcessor(new BatchSpanProcessor(exporter));
global.tracerProvider = tracerProvider;
};
// This produces the config passed to tracerProvider.register();
global.configureSdkRegistration = (config) => {
diag.debug('Telemetry: generating SDK registration config');
const contextManager = new AsyncHooksContextManager();
contextManager.enable();
return {
...config,
contextManager
};
}; |
So, it wasn't really |
Having similar performance issue. Nodejs Lambda with 1536mb: cold start goes from 500ms with x-ray sdk to 1700ms with this layer and default configuration from aws docs. What makes me crazy is that aws documentation suggests to use this OTEL layer over x-ray while still having such issues. |
Similar situation with Go. I was testing out following the advice to switch to Open Telemetry, spent a few hours on troubleshooting etc., only to find that the Lambda layer adds 200ms(!) to cold start times compared to using the X-Ray SDK. Disappointed to see the advice to look at Open Telemetry, without any warning of the performance penalty. The only reason I did was because I thought I could cut out some of the cold start time (the Go AWS X-Ray has about 6MB of dependency bloat for me). With 1GB of RAM, I'm seeing init duration go from around 100ms, to 300ms. I don't see how this is workable. 🤷🏻 I have a test Lambda function. Used CloudWatch Log Insights to compare before and after.
(The first 50ms coming off the cold start time in the data is due to other optimisations). |
@a-h It might be worth cross-posting that to https://github.com/open-telemetry/opentelemetry-lambda/ (which is the underlying repo) as that means the collector itself is taking that long to boot. |
This issue is stale because it has been open 90 days with no activity. If you want to keep this issue open, please just leave a comment below and auto-close will be canceled |
Still a very valid issue. |
In fact, I think it’s gotten worse with the latest releases. |
Any updates? Is this being looked into? |
Achievement unlocked: It's not everyday you stumble upon a Github thread with screenshots of your Slack conversation! I've been looking into this problem again and it does seem as though ADOT performance on cold starts is still rather poor, but my guess is that it's being caused by the underlying repo https://github.com/open-telemetry/opentelemetry-lambda If anyone is interested, I'm putting together some automated benchmarks to find the optimal memory setting and implementation when tracing in lambda: https://github.com/mattfysh/lambda-trace-perf I do find it ironic that we're seeing these bottlenecks. Tracing instrumentations are typically used to measure and monitor application performance, so it does give pause when I see vendors are not fine-tuning their own shared libraries and runtimes required by developers. |
Agreed. Honestly, I think the real solution here is "stop running the collector in a Lambda layer". I haven't looked into it yet, but I've been noodling on how to run the collector in ECS. I did some manual benchmarking a few weeks ago (I don't have the numbers anymore) and just adding the ADOT layer without even setting the env vars to induce instrumentation adds nearly 800ms. Somewhere, I heard that AWS was planning to make xray more OTel native, so hopefully someday it'll be possible to just let the XRay infrastructure do the collection. |
I think it's still worth have an "agent" in a layer to take advantage of things like the Telemetry API and handling batch. That said, that agent doesn't necessarily need to be the Otel collector. Maybe something more purpose built would make more sense like a very thin rust agent to collect logs, traces, and batch and forward otel data? (I promise I'm not mentioning rust cuz it's the cool kid, but rather because real world perf in lambda: https://maxday.github.io/lambda-perf/). Edit: A native, AWS/Lambda level support for Otel would def be the best solution so there's no (user visible) overhead whatsoever. |
yea, there are certainly advantages, but at least for my purposes, the overhead of using any layer seems to be too much of a performance impact to justify using that layer. I've jumped through a lot of goofy hoops to remove multiple layers. From what I can tell, layers only have "good" performance when you hit a scale where coldstarts are an adequately small portion of your traffic that they average their way off of your dashboards. |
This issue is stale because it has been open 90 days with no activity. If you want to keep this issue open, please just leave a comment below and auto-close will be canceled |
Bump. As far as I'm aware, this issue hasn't been resolved even with the introduction of the new lambda repo. |
Performance of OTel in Lambda is receiving attention again. If anyone would like to benchmark the current solutions I put together a rudimentary repo here: https://github.com/mattfysh/lambda-trace-perf I've not looked at it in a while (I've since moved away from serverless because of various cold boot issues, including this one) but if anyone wants to work together on it, I'd be happy to jump back into it |
See also: open-telemetry/opentelemetry-lambda#727 |
Any updates? |
I'm going to dig into this a little bit this week to try to understand where the startup time is rooted in ADOT's setup. I'll get a baseline with zero components registered and introduce them one by one. If this isn't resolved by the team, I'm going to resort to running a central collector to which my lambdas export, and avoid a collector as a layer altogether. Until the Lambda runtime makes an OTel collector a first-class citizen like the X-Ray daemon is, I suspect an OTel collector is simply too heavyweight to be a layer. Unless of course, someone decides to write an OTel collector in Rust. Which would be substantially better imho. |
This issue is stale because it has been open 90 days with no activity. If you want to keep this issue open, please just leave a comment below and auto-close will be canceled |
This is still a concern as far as I'm aware. |
This is a blocking issue for us to use opentelemetry in lambda |
Any updates on this issue? We had to remove the layer from our lambdas due to these issues. |
I too removed the layer. It was a nonstarter. An external collector makes more sense for Lambda at this time. |
@jnicholls would you be able to share anything about your external collector setup? I've been interested in doing something like that, but the pricing has always looked prohibitive. |
This issue is stale because it has been open 90 days with no activity. If you want to keep this issue open, please just leave a comment below and auto-close will be canceled |
This is still a major issue and it's causing me to run new workloads on fly.io instead of lambda. |
any update on this ? |
Is your feature request related to a problem? Please describe.
I'm just getting started with OpenTelemetry and just the act of adding the ADOT NodeJS layer to my functions makes the simplest ping function from 1-2 to 4-5 seconds and my remix app's cold start goes from 8 seconds to more than 30 (I've got the function set to timeout at 30).
Is there any configuration that I can do to improve that or is that just the way it has to be?
Describe the solution you'd like
Minimal impact on cold start times.
Describe alternatives you've considered
The text was updated successfully, but these errors were encountered: