Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix concurrency issue of Terraform provider cache #21805

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

lilatomic
Copy link
Contributor

@lilatomic lilatomic commented Jan 2, 2025

The Terraform provider cache is not concurrency safe.
The worst concurrency bug with the Terraform provider cache is a race condition when nonatomic moves result in another process capturing incorrect hashes in the lockfile which gets pulled into the Pants cache and then poisons it.

After lots of debugging and help from the community, we've determined that the best way forward is to isolate each module into its own Terraform cache. This prevents concurrent access to the Terraform cache.

One possible solution would have been to not cache during cache-unsafe operations. This is harder than it seems, because envvars can override user cache settings in Terraform RC files but they cannot unset it.

fixes #21804

@lilatomic lilatomic added category:bugfix Bug fixes for released features backend: Terraform Terraform backend-related issues labels Jan 2, 2025
@lilatomic lilatomic force-pushed the terraform/fix/cache-concurrency branch from 9d22f89 to e975123 Compare January 19, 2025 04:45
The worst concurrency bug with the Terraform provider cache
is when nonatomic moves result in incorrect hashes in the lockfile
which gets pulled into the Pants cache and then poisons it.
So we just don't use the provider cache in that case.

I think there's still the possibility that the nonatomic move triggers a problem with a module with a lockfile when the cache is being populated,
but at least that won't enter the cache.
I think that risk is lower since the files are fetched during `init` but used later (ex during `validate`),
by which time the copy operation has completed.
Terraform downloads providers, while locking providers.
We therefore skip the cache while generating lockfiles
it's not ideal but it works. In the worst case we're the same as a .terraform folder
This will cause it to override anything set in the .terraformrc file
Terraform expects the directory to already exist and will not create it.
@lilatomic lilatomic force-pushed the terraform/fix/cache-concurrency branch from e975123 to d97a1be Compare January 22, 2025 23:40
@lilatomic lilatomic added this to the 2.23.x milestone Jan 22, 2025
@lilatomic lilatomic requested a review from tdyas January 22, 2025 23:58
@jgranstrom
Copy link
Contributor

If there's any chance this could be cherry-picked that would be great, it does fix a fairly tricky-to-debug issue

@tdyas
Copy link
Contributor

tdyas commented Jan 23, 2025

If there's any chance this could be cherry-picked that would be great, it does fix a fairly tricky-to-debug issue

It is straightforward enough to mark this PR for the CI cherry picking automation. That said, seems more of a question for @lilatomic about how far back this can be back-ported.

@@ -450,6 +452,17 @@ async def setup_terraform_process(
def prepend_paths(paths: Tuple[str, ...]) -> Tuple[str, ...]:
return tuple((Path(request.chdir) / path).as_posix() for path in paths)

# Initialise the Terraform provider cache, since Terraform expects the directory to already exist.
await Get(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Separate invocations of mkdir and terraform would not work in remote execution because the mkdir and terraform invocations could potentially end up on separate workers. Moreover, it can even be in an issue in local execution because the mkdir could be cached and the terraform invocation uncached. Thus, the side effect of making the cache directory would not occur because Pants would see the mkdir process as cached.

They should be a singe process invocation running the shell with a custom script (doing the necessary replacements):

script = f"""mkdir -p {tf_plugin_cache_dir} && exec  __terraform/terraform  -chdir={shlex.quote(request.chdir)} {shlex.join(request.args)}"""

process = Process(argv=(bash.path, "-c", script), ...)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: bash.path comes from a bash: BashBinary instance.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, that makes a lot of sense. I've fixed it, thank you for the direction on how.

This ensures that they're executed on the same machine.
As separate processes, a remote execution might execute these on different machines.
As well, it's possible that locally the mkdir is cached even  if it needs to be run again
@lilatomic lilatomic force-pushed the terraform/fix/cache-concurrency branch from ddfe4ea to d26e69e Compare January 26, 2025 16:11
@lilatomic
Copy link
Contributor Author

If there's any chance this could be cherry-picked that would be great, it does fix a fairly tricky-to-debug issue

It is straightforward enough to mark this PR for the CI cherry picking automation. That said, seems more of a question for @lilatomic about how far back this can be back-ported.

I've marked it for porting back to 2.23 . I think it will merge cleanly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend: Terraform Terraform backend-related issues category:bugfix Bug fixes for released features
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Terraform cache is not concurrency safe
3 participants