-
Notifications
You must be signed in to change notification settings - Fork 9.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
improve cli startup time #36237
Comments
Hi @celesteking, What you're seeing is probably verifying the binary for the dependency lock file. This isn't to try and prevent filesystem corruption, it's to verify the provider binary is the one expected to ensure the security of the entire system. There's definitely some improvement to be had regarding optimizing the provider binary sizes. |
That's security through obscurity. Nothing stops malicious actor from modifying the lock file to align with a compromised provider. I don't mind verifying downloaded/"pulled" provider on first use against some sort of super-trusted hash table (or x509 sig or whatever), but I mind when it's done on each invocation. At least provide me the option to opt out. |
The lock file is checked into VCS and stored outside of the provider cache, and providers are often fetched on demand. There is no obscurity to the system as it is well-defined, and there are customers with security needs which are satisfied by the lock file system, so the utility of the system is a separate concern from the startup time taken. It would be useful to verify the timings here though. On my system much more time is spent executing the binary and making the RPC calls than generating the hash of the provider. I wouldn't be surprised that removing the hash generation doesn't show a significant change since the binary would still need to be loaded into the page cache anyway during execution. Testing that theory with the AWS provider,
and the hash check took 263ms. After removing the hash check:
This is obviously faster storage than may always be available, but it shows that generating the hash is not the entire story. |
That doesn't look like a Linux system and you'll need to find one to test that. I wouldn't be surprised if this drama isn't prominent on OSX somehow. just tried on aws smallest arm instance, it's even worse: t4g.nano.txt |
Alright, so the story is: --- internal/providercache/cached_provider.go.orig 2024-12-19 01:08:54.254959255 +0000
+++ internal/providercache/cached_provider.go 2024-12-19 01:08:37.901876303 +0000
@@ -68,6 +68,7 @@
// Unlike the singular MatchesHash, MatchesAnyHash considers unsupported hash
// formats as successfully non-matching, rather than returning an error.
func (cp *CachedProvider) MatchesAnyHash(allowed []getproviders.Hash) (bool, error) {
+ return true, nil
return getproviders.PackageMatchesAnyHash(cp.PackageLocation(), allowed)
} And the results are:
vs
~3sec vs ~1.3sec. |
Thanks @celesteking, that gives us a good example of what is probably the worst real-world case for startup time. |
It feels like it would be enough to add a "fast path" behind a flag to avoid verifying checksums - similar to how provider dev overrides work. I fundamentally disagree that this is "security through obscurity" though - and even if it was, obscurity is a valid layer of defense. |
Terraform Version
Proposal
Something needs to be done in regards to terraform cli startup time. I'm using
aws
provider which currently weighs 621MB. Even a simpleterraform validate
on an almost empty project takes 3 seconds over here.As discussed in Slack, it seems that significant amount of that is taken up by the checksum calculation. I can't even imagine what happens in big projects where several providers are used, with, I guess, would amount to several GBs of monster mono-files.
Looking at
strace
over here, terraform spends 2.5 seconds betweenopen
on that provider andclose
, between which there areread
s of 32K size. I assume that's the hash calculation.You really really should stop doing the work of the filesystem and the underlying drives to try to safeguard against file corruption. Why is the CRC needed on startup? If you want to protect against accidental provider editing, just check against mtime, but even then, why would you need that? Are you protecting against the version misuse? Name the provider files properly, or pre/suf-fix them with a hash. I don't think there's ambiguity in
linux_amd64/terraform-provider-aws_v5.80.0_x5
, so why is all this stuff needed at all?The text was updated successfully, but these errors were encountered: