improve cli startup time #36237

celesteking · 2024-12-18T16:41:13Z

Terraform Version

Terraform v1.9.5-dev
on linux_amd64
+ provider registry.terraform.io/hashicorp/aws v5.80.0

Proposal

Something needs to be done in regards to terraform cli startup time. I'm using aws provider which currently weighs 621MB. Even a simple terraform validate on an almost empty project takes 3 seconds over here.
As discussed in Slack, it seems that significant amount of that is taken up by the checksum calculation. I can't even imagine what happens in big projects where several providers are used, with, I guess, would amount to several GBs of monster mono-files.

Looking at strace over here, terraform spends 2.5 seconds between open on that provider and close, between which there are reads of 32K size. I assume that's the hash calculation.
You really really should stop doing the work of the filesystem and the underlying drives to try to safeguard against file corruption. Why is the CRC needed on startup? If you want to protect against accidental provider editing, just check against mtime, but even then, why would you need that? Are you protecting against the version misuse? Name the provider files properly, or pre/suf-fix them with a hash. I don't think there's ambiguity in linux_amd64/terraform-provider-aws_v5.80.0_x5, so why is all this stuff needed at all?

The text was updated successfully, but these errors were encountered:

jbardin · 2024-12-18T19:05:24Z

Hi @celesteking,

What you're seeing is probably verifying the binary for the dependency lock file. This isn't to try and prevent filesystem corruption, it's to verify the provider binary is the one expected to ensure the security of the entire system.

There's definitely some improvement to be had regarding optimizing the provider binary sizes.

celesteking · 2024-12-18T19:23:01Z

That's security through obscurity. Nothing stops malicious actor from modifying the lock file to align with a compromised provider. I don't mind verifying downloaded/"pulled" provider on first use against some sort of super-trusted hash table (or x509 sig or whatever), but I mind when it's done on each invocation. At least provide me the option to opt out.

jbardin · 2024-12-18T19:52:53Z

The lock file is checked into VCS and stored outside of the provider cache, and providers are often fetched on demand. There is no obscurity to the system as it is well-defined, and there are customers with security needs which are satisfied by the lock file system, so the utility of the system is a separate concern from the startup time taken.

It would be useful to verify the timings here though. On my system much more time is spent executing the binary and making the RPC calls than generating the hash of the provider. I wouldn't be surprised that removing the hash generation doesn't show a significant change since the binary would still need to be loaded into the page cache anyway during execution.

Testing that theory with the AWS provider,

% time tf validate
Success! The configuration is valid.

terraform validate  1.39s user 0.21s system 153% cpu 1.038 total

and the hash check took 263ms.

After removing the hash check:

% time tf validate
Success! The configuration is valid.

terraform validate  1.33s user 0.18s system 205% cpu 0.735 total

This is obviously faster storage than may always be available, but it shows that generating the hash is not the entire story.

celesteking · 2024-12-18T21:55:59Z

That doesn't look like a Linux system and you'll need to find one to test that. I wouldn't be surprised if this drama isn't prominent on OSX somehow.
strace -fx -tt -e trace=openat,close,execve -e s=none terraform validate will highlight the key events there.
1.9s between provider open and close over here. A bit later there's an execve which you have referred to as "executing the binary".
I'm running off a tmpfs, but even if I were to run from an FS, it would've indicated zero I/O as everything's been pagecached anyway.

just tried on aws smallest arm instance, it's even worse: t4g.nano.txt

celesteking · 2024-12-19T00:12:26Z

Alright, so the story is:

--- internal/providercache/cached_provider.go.orig      2024-12-19 01:08:54.254959255 +0000
+++ internal/providercache/cached_provider.go   2024-12-19 01:08:37.901876303 +0000
@@ -68,6 +68,7 @@
 // Unlike the singular MatchesHash, MatchesAnyHash considers unsupported hash
 // formats as successfully non-matching, rather than returning an error.
 func (cp *CachedProvider) MatchesAnyHash(allowed []getproviders.Hash) (bool, error) {
+       return true, nil
        return getproviders.PackageMatchesAnyHash(cp.PackageLocation(), allowed)
 }

And the results are:

      flat  flat%   sum%        cum   cum%
    1570ms 56.07% 56.07%     1570ms 56.07%  crypto/sha256.block
     200ms  7.14% 63.21%      450ms 16.07%  runtime.scanobject
     100ms  3.57% 66.79%      100ms  3.57%  internal/runtime/syscall.Syscall6

vs

      flat  flat%   sum%        cum   cum%
     160ms 13.79% 13.79%      430ms 37.07%  runtime.scanobject
      70ms  6.03% 19.83%      210ms 18.10%  runtime.mallocgc
      50ms  4.31% 24.14%       50ms  4.31%  runtime.memmove
      50ms  4.31% 28.45%       50ms  4.31%  runtime.typePointers.next

~3sec vs ~1.3sec.

jbardin · 2024-12-19T14:33:16Z

Thanks @celesteking, that gives us a good example of what is probably the worst real-world case for startup time.

jen20 · 2024-12-21T20:41:13Z

It feels like it would be enough to add a "fast path" behind a flag to avoid verifying checksums - similar to how provider dev overrides work.

I fundamentally disagree that this is "security through obscurity" though - and even if it was, obscurity is a valid layer of defense.

celesteking added enhancement new new issue not yet triaged labels Dec 18, 2024

jbardin added cli performance and removed new new issue not yet triaged labels Dec 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

improve cli startup time #36237

improve cli startup time #36237

celesteking commented Dec 18, 2024 •

edited

Loading

jbardin commented Dec 18, 2024

celesteking commented Dec 18, 2024

jbardin commented Dec 18, 2024 •

edited

Loading

celesteking commented Dec 18, 2024 •

edited

Loading

celesteking commented Dec 19, 2024 •

edited

Loading

jbardin commented Dec 19, 2024

jen20 commented Dec 21, 2024

improve cli startup time #36237

improve cli startup time #36237

Comments

celesteking commented Dec 18, 2024 • edited Loading

Terraform Version

Proposal

jbardin commented Dec 18, 2024

celesteking commented Dec 18, 2024

jbardin commented Dec 18, 2024 • edited Loading

celesteking commented Dec 18, 2024 • edited Loading

celesteking commented Dec 19, 2024 • edited Loading

jbardin commented Dec 19, 2024

jen20 commented Dec 21, 2024

celesteking commented Dec 18, 2024 •

edited

Loading

jbardin commented Dec 18, 2024 •

edited

Loading

celesteking commented Dec 18, 2024 •

edited

Loading

celesteking commented Dec 19, 2024 •

edited

Loading