-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Transient errors during parallel plan from .terraform.lock
: plugins are not consistent
#2412
Comments
Here's an example one. Note:
|
is this still happening with |
Hard to know since it's transient, but I would assume so. The more I think about it, the more I am convinced it's due to the plugin cache thing and the init happening in parallel. I can think of some possible fixes, but they're beyond what I can implement. High level, though
|
Got this again just now:
|
I just got it with v0.19.8 and terraform 1.2.9. What I saw in Atlantis terraform output is it did not fetch all the providers. |
We also have these Is there a way to un-close this issue ? |
The error shown by @wyardley
Seems like that could be resolved by stomping over the .terraform.lock file prior to Is that the same issue you're hitting @vmdude ? |
.lock files are not versioned (through git I mean) on our side, and we're using the same cache dir for all parallel plans (same as issue creator). |
I'm curious if you get the error if you run |
Let me check and try (all parallel plan run in a few hours) and I'll get back to you with the output. |
.terraform.lock
: plugins are not consistent
We get it during first plan after version updates, and will go away on second plan. So pretty confident this is because of terraform’s known issues with init with shared cache not being parallel safe. Having a small amount of splay would be one fix that would probably help, though not sure if someone would be willing to implement that. btw, we don’t use a checked in lockfile, and do have |
@wyardley have you tried removing the Do you folks get stack traces in your logs? |
@nitrocode we don’t use or check in the lockfile. But see linked issue - I believe this has everything to do with tf not working well with parallel init. Once the new version is in the provider cache, the failure will not happen. |
@nitrocode We have not yet be able to reproduce these errors (and get stack information) as they don't appear every time. We'll keep you posted when they do |
Should be reproducible if you clear the plugin cache or update a version of a provider that exists in multiple states that are being planned in parallel. Once the provider cache is populated, the issue should not come up. There are some changes coming in 1.4 that might make this worse in the case that the lockfile is not checked in. |
It seems that parallel planning/applying is enabled, each of their terraform inits impact each other. Some options for contributors
Please let me know if i missed any options. As always, we welcome prs |
For (1) Here is the current code atlantis/server/events/project_command_pool_executor.go Lines 13 to 40 in 890a5f7
Here is a possible change }
+ time.Sleep(1 * time.Second)
go execute()
} That should at least start each job with a second in between. Or perhaps the first job can start then pause for a couple seconds to ensure the init stage has passed and then the subsequent jobs can start all at once? |
@nitrocode yeah, agree. some kind of configurable (e.g., |
Couple issues with option 2 (staggering the runs) is that
option 3 - init retry scriptIt might be something we could solve in the workflows:
default:
init:
steps:
- run: /usr/local/bin/terraform_init_retry
#!/usr/bin/env sh
declare -i attempt=0
max_attempts=10
# until this works
until terraform$ATLANTIS_TERRAFORM_VERSION init -no-color; do
attempt+1
# check error code
if [ $? -gt 0 ]; then
if [ $attempt -le $max_attempts ]; then
echo "$attempt / $max_attempts: Error thrown. Rerunning init"
else
echo "$attempt / $max_attempts: giving up"
exit $?
fi
else
# zero error code break. May not be required
break
fi
done option 4 - set a unique provider cacheThis can be set to be inside the working directory unique to the workspace. This would ensure that the provider cache would be isolated per run. workflows:
default:
init:
steps:
- env:
name: TF_PLUGIN_CACHE_DIR
command: 'echo "$(pwd)/.terraform-cache-$WORKSPACE" '
plan:
steps:
- env:
name: TF_PLUGIN_CACHE_DIR
command: 'echo "$(pwd)/.terraform-cache-$WORKSPACE" '
apply:
steps:
- env:
name: TF_PLUGIN_CACHE_DIR
command: 'echo "$(pwd)/.terraform-cache-$WORKSPACE" ' What's odd about option 4 is that it's the default behavior to cache the providers in the |
Regarding terraform 1.4 and the new kocking behavior, it seems that hashicorp has added a new flag that needs to be set to retain 1.3.x and below behavior in 1.4.x+.
|
If it were a cache per state, you wouldn’t need the new flag - it just helps avoid redownloading when there’s no lockfile and the cache is already populated. I would guess users that both don’t check in a lockfile and set I agree with you that it’s odd that this issue comes up at all, if Atlantis doesn’t already do something to encourage terraform to share a cache directory, and especially since I think the |
#82 - looks like Atlantis may set it |
for those who are still having this issue. There is now a setting that disables the plugin cache (thanks to #3720 !), so option 4 in #2412 (comment) is not needed anymore (but thank you for the solution!) https://www.runatlantis.io/docs/server-configuration.html#use-tf-plugin-cache |
Community Note
Overview of the Issue
I'm occasionally getting some transient errors when running
atlantis plan
; currently, I haveATLANTIS_PARALLEL_POOL_SIZE
set to3
. It's most typically on states that have a lot of providers, giving me two possible theories:Using integrations/foo vN.N.N from the shared cache directory
in the output; see link below for why this is not guaranteed to be safe)I'm not able to reproduce it right at the moment, and don't have the exact error handy, so I'll update here next time the issue comes up
Reproduction Steps
atlantis plan
Note: this is not consistently reproducible
Logs
n/a
Environment details
Atlantis server-side config file:
All config is from env vars / flags (with some kube secret references / other irrelevant stuff omitted)
Repo
atlantis.yaml
file:Any other information you can provide about the environment/deployment.
--->
Additional Context
The text was updated successfully, but these errors were encountered: