-
-
Notifications
You must be signed in to change notification settings - Fork 644
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix concurrency issue of Terraform provider cache #21805
base: main
Are you sure you want to change the base?
Fix concurrency issue of Terraform provider cache #21805
Conversation
9d22f89
to
e975123
Compare
The worst concurrency bug with the Terraform provider cache is when nonatomic moves result in incorrect hashes in the lockfile which gets pulled into the Pants cache and then poisons it. So we just don't use the provider cache in that case. I think there's still the possibility that the nonatomic move triggers a problem with a module with a lockfile when the cache is being populated, but at least that won't enter the cache. I think that risk is lower since the files are fetched during `init` but used later (ex during `validate`), by which time the copy operation has completed.
Terraform downloads providers, while locking providers. We therefore skip the cache while generating lockfiles
it's not ideal but it works. In the worst case we're the same as a .terraform folder
This reverts commit 18e96b5.
This will cause it to override anything set in the .terraformrc file
Terraform expects the directory to already exist and will not create it.
This reverts commit eca01cc.
e975123
to
d97a1be
Compare
If there's any chance this could be cherry-picked that would be great, it does fix a fairly tricky-to-debug issue |
It is straightforward enough to mark this PR for the CI cherry picking automation. That said, seems more of a question for @lilatomic about how far back this can be back-ported. |
@@ -450,6 +452,17 @@ async def setup_terraform_process( | |||
def prepend_paths(paths: Tuple[str, ...]) -> Tuple[str, ...]: | |||
return tuple((Path(request.chdir) / path).as_posix() for path in paths) | |||
|
|||
# Initialise the Terraform provider cache, since Terraform expects the directory to already exist. | |||
await Get( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Separate invocations of mkdir
and terraform
would not work in remote execution because the mkdir
and terraform
invocations could potentially end up on separate workers. Moreover, it can even be in an issue in local execution because the mkdir
could be cached and the terraform
invocation uncached. Thus, the side effect of making the cache directory would not occur because Pants would see the mkdir
process as cached.
They should be a singe process invocation running the shell with a custom script (doing the necessary replacements):
script = f"""mkdir -p {tf_plugin_cache_dir} && exec __terraform/terraform -chdir={shlex.quote(request.chdir)} {shlex.join(request.args)}"""
process = Process(argv=(bash.path, "-c", script), ...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: bash.path
comes from a bash: BashBinary
instance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, that makes a lot of sense. I've fixed it, thank you for the direction on how.
This ensures that they're executed on the same machine. As separate processes, a remote execution might execute these on different machines. As well, it's possible that locally the mkdir is cached even if it needs to be run again
ddfe4ea
to
d26e69e
Compare
I've marked it for porting back to 2.23 . I think it will merge cleanly. |
The Terraform provider cache is not concurrency safe.
The worst concurrency bug with the Terraform provider cache is a race condition when nonatomic moves result in another process capturing incorrect hashes in the lockfile which gets pulled into the Pants cache and then poisons it.
After lots of debugging and help from the community, we've determined that the best way forward is to isolate each module into its own Terraform cache. This prevents concurrent access to the Terraform cache.
One possible solution would have been to not cache during cache-unsafe operations. This is harder than it seems, because envvars can override user cache settings in Terraform RC files but they cannot unset it.
fixes #21804