Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve cli startup time #36237

Open
celesteking opened this issue Dec 18, 2024 · 7 comments
Open

improve cli startup time #36237

celesteking opened this issue Dec 18, 2024 · 7 comments

Comments

@celesteking
Copy link

celesteking commented Dec 18, 2024

Terraform Version

Terraform v1.9.5-dev
on linux_amd64
+ provider registry.terraform.io/hashicorp/aws v5.80.0

Proposal

Something needs to be done in regards to terraform cli startup time. I'm using aws provider which currently weighs 621MB. Even a simple terraform validate on an almost empty project takes 3 seconds over here.
As discussed in Slack, it seems that significant amount of that is taken up by the checksum calculation. I can't even imagine what happens in big projects where several providers are used, with, I guess, would amount to several GBs of monster mono-files.

Looking at strace over here, terraform spends 2.5 seconds between open on that provider and close, between which there are reads of 32K size. I assume that's the hash calculation.
You really really should stop doing the work of the filesystem and the underlying drives to try to safeguard against file corruption. Why is the CRC needed on startup? If you want to protect against accidental provider editing, just check against mtime, but even then, why would you need that? Are you protecting against the version misuse? Name the provider files properly, or pre/suf-fix them with a hash. I don't think there's ambiguity in linux_amd64/terraform-provider-aws_v5.80.0_x5, so why is all this stuff needed at all?

@celesteking celesteking added enhancement new new issue not yet triaged labels Dec 18, 2024
@jbardin
Copy link
Member

jbardin commented Dec 18, 2024

Hi @celesteking,

What you're seeing is probably verifying the binary for the dependency lock file. This isn't to try and prevent filesystem corruption, it's to verify the provider binary is the one expected to ensure the security of the entire system.

There's definitely some improvement to be had regarding optimizing the provider binary sizes.

@celesteking
Copy link
Author

That's security through obscurity. Nothing stops malicious actor from modifying the lock file to align with a compromised provider. I don't mind verifying downloaded/"pulled" provider on first use against some sort of super-trusted hash table (or x509 sig or whatever), but I mind when it's done on each invocation. At least provide me the option to opt out.

@jbardin
Copy link
Member

jbardin commented Dec 18, 2024

The lock file is checked into VCS and stored outside of the provider cache, and providers are often fetched on demand. There is no obscurity to the system as it is well-defined, and there are customers with security needs which are satisfied by the lock file system, so the utility of the system is a separate concern from the startup time taken.

It would be useful to verify the timings here though. On my system much more time is spent executing the binary and making the RPC calls than generating the hash of the provider. I wouldn't be surprised that removing the hash generation doesn't show a significant change since the binary would still need to be loaded into the page cache anyway during execution.

Testing that theory with the AWS provider,

% time tf validate
Success! The configuration is valid.

terraform validate  1.39s user 0.21s system 153% cpu 1.038 total

and the hash check took 263ms.

After removing the hash check:

% time tf validate
Success! The configuration is valid.

terraform validate  1.33s user 0.18s system 205% cpu 0.735 total

This is obviously faster storage than may always be available, but it shows that generating the hash is not the entire story.

@celesteking
Copy link
Author

celesteking commented Dec 18, 2024

That doesn't look like a Linux system and you'll need to find one to test that. I wouldn't be surprised if this drama isn't prominent on OSX somehow.
strace -fx -tt -e trace=openat,close,execve -e s=none terraform validate will highlight the key events there.
1.9s between provider open and close over here. A bit later there's an execve which you have referred to as "executing the binary".
I'm running off a tmpfs, but even if I were to run from an FS, it would've indicated zero I/O as everything's been pagecached anyway.

just tried on aws smallest arm instance, it's even worse: t4g.nano.txt

@celesteking
Copy link
Author

celesteking commented Dec 19, 2024

Alright, so the story is:

--- internal/providercache/cached_provider.go.orig      2024-12-19 01:08:54.254959255 +0000
+++ internal/providercache/cached_provider.go   2024-12-19 01:08:37.901876303 +0000
@@ -68,6 +68,7 @@
 // Unlike the singular MatchesHash, MatchesAnyHash considers unsupported hash
 // formats as successfully non-matching, rather than returning an error.
 func (cp *CachedProvider) MatchesAnyHash(allowed []getproviders.Hash) (bool, error) {
+       return true, nil
        return getproviders.PackageMatchesAnyHash(cp.PackageLocation(), allowed)
 }

And the results are:

      flat  flat%   sum%        cum   cum%
    1570ms 56.07% 56.07%     1570ms 56.07%  crypto/sha256.block
     200ms  7.14% 63.21%      450ms 16.07%  runtime.scanobject
     100ms  3.57% 66.79%      100ms  3.57%  internal/runtime/syscall.Syscall6

vs

      flat  flat%   sum%        cum   cum%
     160ms 13.79% 13.79%      430ms 37.07%  runtime.scanobject
      70ms  6.03% 19.83%      210ms 18.10%  runtime.mallocgc
      50ms  4.31% 24.14%       50ms  4.31%  runtime.memmove
      50ms  4.31% 28.45%       50ms  4.31%  runtime.typePointers.next

~3sec vs ~1.3sec.

@jbardin
Copy link
Member

jbardin commented Dec 19, 2024

Thanks @celesteking, that gives us a good example of what is probably the worst real-world case for startup time.

@jbardin jbardin added cli performance and removed new new issue not yet triaged labels Dec 19, 2024
@jen20
Copy link
Contributor

jen20 commented Dec 21, 2024

It feels like it would be enough to add a "fast path" behind a flag to avoid verifying checksums - similar to how provider dev overrides work.

I fundamentally disagree that this is "security through obscurity" though - and even if it was, obscurity is a valid layer of defense.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants