Transient errors during parallel plan from `.terraform.lock`: `plugins are not consistent` #2412

wyardley · 2022-07-22T16:26:28Z

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request. Searching for pre-existing feature requests helps us consolidate datapoints for identical requirements into a single place, thank you!
Please do not leave "+1" or other comments that do not add relevant new information or questions, they generate extra noise for issue followers and do not help prioritize the request.
If you are interested in working on this issue or have submitted a pull request, please leave a comment.

Overview of the Issue

I'm occasionally getting some transient errors when running atlantis plan; currently, I have ATLANTIS_PARALLEL_POOL_SIZE set to 3. It's most typically on states that have a lot of providers, giving me two possible theories:

all the states planning at once are causing the Terraform registry to rate-limit us, but the client is giving a confusing error message
OR, it's a race condition due to parallel plans possibly using the same cache (Using integrations/foo vN.N.N from the shared cache directory in the output; see link below for why this is not guaranteed to be safe)

I'm not able to reproduce it right at the moment, and don't have the exact error handy, so I'll update here next time the issue comes up

Reproduction Steps

atlantis plan

Note: this is not consistently reproducible

Logs

n/a

Environment details

Atlantis version: v0.19.7
Atlantis flags: see below

Atlantis server-side config file:
All config is from env vars / flags (with some kube secret references / other irrelevant stuff omitted)

            - name: ATLANTIS_ATLANTIS_URL
              value: https://xxx/
            - name: ATLANTIS_DEFAULT_TF_VERSION
              value: "1.2.5"
            - name: ATLANTIS_HIDE_PREV_PLAN_COMMENTS
              value: "true"
            - name: ATLANTIS_PARALLEL_POOL_SIZE
              value: "3"
            - name: ATLANTIS_PORT
              value: "4141"
            - name: ATLANTIS_REPO_ALLOWLIST
              value: github.com/orgname/*
            - name: TF_CLI_ARGS_apply
              value: "-compact-warnings"
            - name: TF_CLI_ARGS_init
              value: "-compact-warnings"
            - name: TF_CLI_ARGS_plan
              value: "-compact-warnings"

Repo atlantis.yaml file:

---
version: 3
parallel_plan: true

projects:
# [...]

Any other information you can provide about the environment/deployment.
--->

Additional Context

parallel init failure when using plugin cache hashicorp/terraform#25849

The text was updated successfully, but these errors were encountered:

wyardley · 2022-07-27T20:18:54Z

Here's an example one.

Note:

We do not check in our lockfiles, so any lockfile here is one that's local
Re-running the plan typically resolves the issue

Warnings:

- Incomplete lock file information for providers

To see the full warning notes, run Terraform without -compact-warnings.

Terraform has been successfully initialized!
╷
│ Error: Required plugins are not installed
│ 
│ The installed provider plugins are not consistent with the packages
│ selected in the dependency lock file:
│   - registry.terraform.io/hashicorp/kubernetes: the cached package for registry.terraform.io/hashicorp/kubernetes 2.12.1 (in .terraform/providers) does not match any of the checksums recorded in the dependency lock file
│ 
│ Terraform uses external plugins to integrate with a variety of different
│ infrastructure services. You must install the required plugins before
│ running Terraform operations.
╵

jamengual · 2022-08-26T04:50:11Z

is this still happening with v0.19.8?

wyardley · 2022-08-26T14:13:28Z

Hard to know since it's transient, but I would assume so. The more I think about it, the more I am convinced it's due to the plugin cache thing and the init happening in parallel.

I can think of some possible fixes, but they're beyond what I can implement. High level, though

having a way to have init happen in serial, but plan in parallel (this might slow things down, but seems the simplest?)
separate plugin cache dir per state
some way of pre-downloading all the providers (e.g., using terraform providers mirror)

wyardley · 2022-08-26T17:38:56Z

Got this again just now:

Initializing the backend...

Successfully configured the backend "gcs"! Terraform will automatically
use this backend unless the backend configuration changes.

Initializing provider plugins...
- Finding gavinbunney/kubectl versions matching "1.14.0"...
- Finding integrations/github versions matching "4.29.0"...
- Finding hashicorp/google versions matching ">= 4.18.0, 4.33.0"...
- Finding hashicorp/kubernetes versions matching "~> 2.10, 2.13.0"...
- Finding fluxcd/flux versions matching "0.16.0"...
- Finding hashicorp/google-beta versions matching ">= 4.29.0, < 5.0.0"...
- Finding latest version of hashicorp/random...
- Using hashicorp/google-beta v4.33.0 from the shared cache directory
- Using hashicorp/random v3.3.2 from the shared cache directory
- Using gavinbunney/kubectl v1.14.0 from the shared cache directory
- Using integrations/github v4.29.0 from the shared cache directory
- Using hashicorp/google v4.33.0 from the shared cache directory
- Installing hashicorp/kubernetes v2.13.0...
- Installed hashicorp/kubernetes v2.13.0 (signed by HashiCorp)
- Using fluxcd/flux v0.16.0 from the shared cache directory

Terraform has created a lock file .terraform.lock.hcl to record the provider
selections it made above. Include this file in your version control repository
so that Terraform can guarantee to make the same selections by default when
you run "terraform init" in the future.


Warnings:

- Incomplete lock file information for providers

To see the full warning notes, run Terraform without -compact-warnings.

Terraform has been successfully initialized!
╷
│ Error: Required plugins are not installed
│ 
│ The installed provider plugins are not consistent with the packages
│ selected in the dependency lock file:
│   - registry.terraform.io/hashicorp/kubernetes: the cached package for registry.terraform.io/hashicorp/kubernetes 2.13.0 (in .terraform/providers) does not match any of the checksums recorded in the dependency lock file
│ 
│ Terraform uses external plugins to integrate with a variety of different
│ infrastructure services. You must install the required plugins before
│ running Terraform operations.
╵

nokernel · 2022-09-23T17:11:50Z

@jamengual

is this still happening with v0.19.8?

I just got it with v0.19.8 and terraform 1.2.9.

What I saw in Atlantis terraform output is it did not fetch all the providers.

vmdude · 2023-02-01T10:08:53Z

We also have these Error: Required plugins are not installed issues, and it still happening in v0.22.2.

Is there a way to un-close this issue ?

nitrocode · 2023-02-01T11:53:35Z

The error shown by @wyardley

│ Error: Required plugins are not installed
│
│ The installed provider plugins are not consistent with the packages
│ selected in the dependency lock file:
│ - registry.terraform.io/hashicorp/kubernetes: the cached package for registry.terraform.io/hashicorp/kubernetes 2.13.0 (in .terraform/providers) does not match any of the checksums recorded in the dependency lock file

Seems like that could be resolved by stomping over the .terraform.lock file prior to terraform init or doing terraform init -upgrade.

Is that the same issue you're hitting @vmdude ?

vmdude · 2023-02-01T12:06:38Z

.lock files are not versioned (through git I mean) on our side, and we're using the same cache dir for all parallel plans (same as issue creator).
Removing .lock file in the pre_workflow_hooks could not cause another race condition where we removing a lock file used by another parallel plan ?

nitrocode · 2023-02-01T13:09:44Z

I'm curious if you get the error if you run terraform init with the extra args -upgrade to stomp over the lock files on every run

vmdude · 2023-02-01T13:34:50Z

Let me check and try (all parallel plan run in a few hours) and I'll get back to you with the output.

wyardley · 2023-02-01T15:20:14Z

We get it during first plan after version updates, and will go away on second plan. So pretty confident this is because of terraform’s known issues with init with shared cache not being parallel safe.

Having a small amount of splay would be one fix that would probably help, though not sure if someone would be willing to implement that.

btw, we don’t use a checked in lockfile, and do have -upgrade in the init args

nitrocode · 2023-02-07T19:42:57Z

@wyardley have you tried removing the .terraform.lock.hcl file and running a terraform init -upgrade ?

Do you folks get stack traces in your logs?

wyardley · 2023-02-07T20:10:40Z

@nitrocode we don’t use or check in the lockfile. But see linked issue - I believe this has everything to do with tf not working well with parallel init. Once the new version is in the provider cache, the failure will not happen.

vmdude · 2023-02-08T07:40:07Z

@nitrocode We have not yet be able to reproduce these errors (and get stack information) as they don't appear every time. We'll keep you posted when they do

wyardley · 2023-02-08T07:50:49Z

Should be reproducible if you clear the plugin cache or update a version of a provider that exists in multiple states that are being planned in parallel.

Once the provider cache is populated, the issue should not come up.

There are some changes coming in 1.4 that might make this worse in the case that the lockfile is not checked in.

hashicorp/terraform#32205

nitrocode · 2023-02-08T12:21:25Z

It seems that parallel planning/applying is enabled, each of their terraform inits impact each other.

Some options for contributors

Stagger the parallel runs so they do not conflict with each other
- seems easy to implement by adding an arbitrary wait
separate the runner from the server and allow the runner its own isolating
- arguably the correct way to resolve this
- would have additional benefits
- large change which would require a lot of work

Please let me know if i missed any options. As always, we welcome prs

nitrocode · 2023-03-02T21:24:27Z

For (1)

Here is the current code

atlantis/server/events/project_command_pool_executor.go

Lines 13 to 40 in 890a5f7

    
           func runProjectCmdsParallel( 
        
           	cmds []command.ProjectContext, 
        
           	runnerFunc prjCmdRunnerFunc, 
        
           	poolSize int, 
        
           ) command.Result { 
        
           	var results []command.ProjectResult 
        
           	mux := &sync.Mutex{} 
        
           	wg := sizedwaitgroup.New(poolSize) 
        
           	for _, pCmd := range cmds { 
        
           		pCmd := pCmd 
        
           		var execute func() 
        
           		wg.Add() 
        
           		execute = func() { 
        
           			defer wg.Done() 
        
           			res := runnerFunc(pCmd) 
        
           			mux.Lock() 
        
           			results = append(results, res) 
        
           			mux.Unlock() 
        
           		} 
        
           		go execute() 
        
           	} 
        
           	wg.Wait() 
        
           	return command.Result{ProjectResults: results} 
        
           }

Here is a possible change

		}

+		time.Sleep(1 * time.Second)

		go execute()
	}

That should at least start each job with a second in between.

Or perhaps the first job can start then pause for a couple seconds to ensure the init stage has passed and then the subsequent jobs can start all at once?

wyardley · 2023-03-02T21:52:15Z

@nitrocode yeah, agree. some kind of configurable (e.g., ATLANTIS_PARALLEL_PLAN_SPLAY or ATLANTIS_PARALLEL_PLAN_SLEEP / non-configurable (or even random) sleep could help a lot. Similarly, with parallel planning some big states, we sometimes see the pod Atlantis is running on crash from resource exhaustion.

nitrocode · 2023-03-05T12:48:11Z

Couple issues with option 2 (staggering the runs) is that

if you plan multiple terraform root dirs, it's possible they are all using different versions of the same providers and some may or may not be cached.
an arbitrary wait would slow down all runs

option 3 - init retry script

It might be something we could solve in the init step. Below is untested code.

workflows:
  default:
    init:
      steps:
        - run: /usr/local/bin/terraform_init_retry

/usr/local/bin/terraform_init_retry

#!/usr/bin/env sh

declare -i attempt=0
max_attempts=10

# until this works
until terraform$ATLANTIS_TERRAFORM_VERSION init -no-color; do
    attempt+1
    # check error code
    if [ $? -gt 0 ]; then
        if [ $attempt -le $max_attempts ]; then
            echo "$attempt / $max_attempts: Error thrown. Rerunning init"
        else
            echo "$attempt / $max_attempts: giving up"
            exit $?
        fi
    else
        # zero error code break. May not be required
        break
    fi
done

option 4 - set a unique provider cache

This can be set to be inside the working directory unique to the workspace. This would ensure that the provider cache would be isolated per run.

workflows:
  default:
    init:
      steps:
        - env:
            name: TF_PLUGIN_CACHE_DIR
            command: 'echo "$(pwd)/.terraform-cache-$WORKSPACE" '
    plan:
      steps:
        - env:
            name: TF_PLUGIN_CACHE_DIR
            command: 'echo "$(pwd)/.terraform-cache-$WORKSPACE" '
    apply:
      steps:
        - env:
            name: TF_PLUGIN_CACHE_DIR
            command: 'echo "$(pwd)/.terraform-cache-$WORKSPACE" '

What's odd about option 4 is that it's the default behavior to cache the providers in the .terraform directory already if a TF_PLUGIN_CACHE_DIR is unset.

nitrocode · 2023-03-05T21:49:05Z

Regarding terraform 1.4 and the new kocking behavior, it seems that hashicorp has added a new flag that needs to be set to retain 1.3.x and below behavior in 1.4.x+.

TF_PLUGIN_CACHE_MAY_BREAK_DEPENDENCY_LOCK_FILE=true

wyardley · 2023-03-05T22:19:25Z

If it were a cache per state, you wouldn’t need the new flag - it just helps avoid redownloading when there’s no lockfile and the cache is already populated.

I would guess users that both don’t check in a lockfile and set TF_PLUGIN_CACHE_DIR would want to set that flag once upgrading to 1.4.x

I agree with you that it’s odd that this issue comes up at all, if Atlantis doesn’t already do something to encourage terraform to share a cache directory, and especially since I think the .terraform directory would also be new / unique per PR, per state.

wyardley · 2023-03-05T22:23:05Z

#82 - looks like Atlantis may set it

c-ameron · 2024-03-21T12:52:27Z

for those who are still having this issue. There is now a setting that disables the plugin cache (thanks to #3720 !), so option 4 in #2412 (comment) is not needed anymore (but thank you for the solution!)

https://www.runatlantis.io/docs/server-configuration.html#use-tf-plugin-cache

wyardley added the bug Something isn't working label Jul 22, 2022

jamengual added the waiting-on-response Waiting for a response from the user label Aug 26, 2022

jamengual added the help wanted Good feature for contributors label Aug 26, 2022

github-actions bot added the Stale label Oct 24, 2022

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Oct 30, 2022

nitrocode reopened this Feb 1, 2023

nitrocode removed Stale waiting-on-response Waiting for a response from the user labels Feb 1, 2023

nitrocode changed the title ~~transient errors during parallel plan~~ Transient errors during parallel plan from .terraform.lock: plugins are not consistent Feb 1, 2023

nitrocode mentioned this issue Feb 7, 2023

parallel planning causes signal: killed errors #3109

Closed

wosiu mentioned this issue Nov 19, 2024

Speed up by caching plugins and providers #34

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Transient errors during parallel plan from `.terraform.lock`: `plugins are not consistent` #2412

Transient errors during parallel plan from `.terraform.lock`: `plugins are not consistent` #2412

wyardley commented Jul 22, 2022 •

edited by nitrocode

Loading

wyardley commented Jul 27, 2022

jamengual commented Aug 26, 2022

wyardley commented Aug 26, 2022

wyardley commented Aug 26, 2022 •

edited by nitrocode

Loading

nokernel commented Sep 23, 2022 •

edited

Loading

vmdude commented Feb 1, 2023 •

edited

Loading

nitrocode commented Feb 1, 2023

vmdude commented Feb 1, 2023 •

edited

Loading

nitrocode commented Feb 1, 2023

vmdude commented Feb 1, 2023

wyardley commented Feb 1, 2023 •

edited

Loading

nitrocode commented Feb 7, 2023

wyardley commented Feb 7, 2023

vmdude commented Feb 8, 2023

wyardley commented Feb 8, 2023 •

edited

Loading

nitrocode commented Feb 8, 2023

nitrocode commented Mar 2, 2023

wyardley commented Mar 2, 2023

nitrocode commented Mar 5, 2023

nitrocode commented Mar 5, 2023

wyardley commented Mar 5, 2023

wyardley commented Mar 5, 2023

c-ameron commented Mar 21, 2024

Transient errors during parallel plan from .terraform.lock: plugins are not consistent #2412

Transient errors during parallel plan from .terraform.lock: plugins are not consistent #2412

Comments

wyardley commented Jul 22, 2022 • edited by nitrocode Loading

Community Note

Overview of the Issue

Reproduction Steps

Logs

Environment details

Additional Context

wyardley commented Jul 27, 2022

jamengual commented Aug 26, 2022

wyardley commented Aug 26, 2022

wyardley commented Aug 26, 2022 • edited by nitrocode Loading

nokernel commented Sep 23, 2022 • edited Loading

vmdude commented Feb 1, 2023 • edited Loading

nitrocode commented Feb 1, 2023

vmdude commented Feb 1, 2023 • edited Loading

nitrocode commented Feb 1, 2023

vmdude commented Feb 1, 2023

wyardley commented Feb 1, 2023 • edited Loading

nitrocode commented Feb 7, 2023

wyardley commented Feb 7, 2023

vmdude commented Feb 8, 2023

wyardley commented Feb 8, 2023 • edited Loading

nitrocode commented Feb 8, 2023

nitrocode commented Mar 2, 2023

wyardley commented Mar 2, 2023

nitrocode commented Mar 5, 2023

option 3 - init retry script

option 4 - set a unique provider cache

nitrocode commented Mar 5, 2023

wyardley commented Mar 5, 2023

wyardley commented Mar 5, 2023

c-ameron commented Mar 21, 2024

Transient errors during parallel plan from `.terraform.lock`: `plugins are not consistent` #2412

Transient errors during parallel plan from `.terraform.lock`: `plugins are not consistent` #2412

wyardley commented Jul 22, 2022 •

edited by nitrocode

Loading

wyardley commented Aug 26, 2022 •

edited by nitrocode

Loading

nokernel commented Sep 23, 2022 •

edited

Loading

vmdude commented Feb 1, 2023 •

edited

Loading

vmdude commented Feb 1, 2023 •

edited

Loading

wyardley commented Feb 1, 2023 •

edited

Loading

wyardley commented Feb 8, 2023 •

edited

Loading