Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix nat_gateway_id network_interface_id variable defaults conflict #36

Merged
merged 3 commits into from
Apr 19, 2021

Conversation

3h4x
Copy link

@3h4x 3h4x commented Apr 18, 2021

what

  • Convert input values of "" for eni_id, igw_id, and ngw_id to null before passing them to aws_route
  • Correct the documentation to properly reflect that it is eni_id and not igw_id that conficts with ngw_id
  • Add smoke test for creating private subnets with NAT gateway

why

  • For aws_route specifying both network_interface_id and nat_gateway_id is not valid. Prior to Terraform AWS Provider 3.34.0, a value of "" was treated as "not present" so, for example, specifying a valid nat_gateway_id and a network_interface_id = "" was accepted. With 3.34.0, however, "" is interpreted as a value by and triggers the error:
     "vpc_endpoint_id": only one of `egress_only_gateway_id,gateway_id,instance_id,local_gateway_id,nat_gateway_id,network_interface_id,transit_gateway_id,vpc_endpoint_id,vpc_peering_connection_id`
      can be specified, but `nat_gateway_id,network_interface_id` were specified.
  • In this module igw_id is only used in the route table for the public subnets, so it does not conflict with eni_id origw_id which are only used in the route table for the private subnets.
  • Catch this problem if it recurs

references

@3h4x 3h4x requested review from a team as code owners April 18, 2021 06:32
@3h4x 3h4x requested a review from a team as a code owner April 18, 2021 06:33
@3h4x 3h4x requested review from jhosteny and RothAndrew and removed request for a team April 18, 2021 06:33
@3h4x
Copy link
Author

3h4x commented Apr 18, 2021

/test all

@Nuru Nuru self-requested a review April 18, 2021 08:20
Copy link
Contributor

@Nuru Nuru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for trying, but please think of another solution.

I have both general comments and specific comments.

In general, there are some problems with having defaults of null:

It is not always possible to find a satisfactory resolution, but for strings we have adopted the standard that we use empty strings ("") to indicate omission and defaults where you might be inclined to use null if it were not for these problems. Inside the module, values of null and empty strings should be detected and treated equally. I like to do it this way for compactness if not necessarily readability:

  input = length(compact([var.optional])) > 0 ? var.optional : null

@osterman @aknysh @Gowiem do you have anything to add?

Specific comments:

  1. This module seems pretty tedious to use, and makes me wonder if we should deprecate this module in favor of https://github.com/cloudposse/terraform-aws-dynamic-subnets and/or terraform-aws-multi-az-subnets, enhancing one or both of those to the extent this module provides working and useful features the other 2 do not.

  2. This bug is not our fault. This module has, for 4 years (see Add option to set ENI as a default route for private subnets #7, 2017-11-16), set both

  network_interface_id = ""
  nat_gateway_id       = ""

If this now suddenly stops working, which appears it may have due to hashicorp/terraform-provider-aws#16930 and AWS Provider release 3.34.0, this should be reported as an issue to the hashicorp/terraform-provider-aws project and the issue linked to this PR and any issue in this repo that refers to it (e.g. #35).

Also, I know this impractical, but for bug fixes, I would like to see the examples/complete test enhanced so that it catches the bug being fixed. That means that

  • The current release would fail the test
  • The PR fixes pass the test (of course, PR is required to pass all tests)

This will both verify that the bug being fixed is a real bug and that the PR fixes it. The tests are not that helpful if they are insensitive to the changes being made in the PR.

@Nuru Nuru requested review from maximmi and aknysh April 18, 2021 10:34
@osterman
Copy link
Member

@Nuru how is our usage of null in all of our other modules different?

We should no longer use "" as the default anywhere and should use null. We use null all over the place and should use null over "".

Also, in this case, it seems like 2 variables are mutually exclusive so we should probably add a validator. I know from first-hand experience setting values to null correctly fixes certain errors in HCL2, while in HCL1 the empty string was required. In HCL2, passing null is frequently desirable as it behaves as though it was not passed.

@Nuru In this PR specifically, what is the specific problem of changing "" to null?

@osterman
Copy link
Member

osterman commented Apr 19, 2021

Inside the module, values of null and empty strings should be detected and treated equally

I disagree. I don't think we should continue to maintain backward compatibility with empty strings and null. If we did, then we should do the same for booleans, so we preserve the behavior or "false" and false, etc.

@Nuru
Copy link
Contributor

Nuru commented Apr 19, 2021

@osterman wrote

I know from first-hand experience setting values to null correctly fixes certain errors in HCL2, while in HCL1 the empty string was required. In HCL2, passing null is frequently desirable as it behaves as though it was not passed.

This is true when passing null to a resource or data source (and is, in fact, the solution here to the newly introduced problem with aws_route), but it is unfortunately and emphatically not true when passing null to a module. Until that gets sorted out, we are better off continuing to accept "empty" objects as module inputs and convert them to null before passing them to resources or data sources.

@Nuru In this PR specifically, what is the specific problem of changing "" to null?

There are several problems with changing the default value of the input from "" to null:

  1. var.eni_id and var.ngw_id are mutually exclusive, but if both default to null, Hashicorp will list them both as required. This is confusing at best.
  2. Validation rules can refer only to the variable that the condition applies to, so we can not validate a mutual exclusion.
  3. The semantics of using null as an input to a module are currently broken, meaning that people are (or should be) uncomfortable about using it as a module input. Many people are hopeful the semantics will change, which will make things better in the long run but even more confusing in the short run.
  4. We do not gain enough here by making null the default to be worth the hassle of causing users to change the inputs if they are doing conditional inputs in their root module, especially since they would be changing them to null, which is problematic. Since "" is not otherwise a valid value and we have been using it forever to mean "no choice", I think we should continue to do so. This solves the case for people who want to use null and people who do not, without taking anything away from anyone else.
  5. We have a better option, which is to leave the defaults as they are and conditionally pass null to the aws_route resource.

With this solution, old code like this will still work:

  eni_id = local.have_eni ? var.eni_id : ""
  ngw_id = local.have_eni ? "" : var.ngw_id

Newer code written like this will work under the present behavior of passing null to modules

  eni_id = local.have_eni ? var.eni_id : null
  ngw_id = local.have_eni ? null : var.ngw_id

and it will continue to work when Terraform starts correctly treating null inputs as meaning to select the defaults.

We also avoid having Hashicorp unhelpfully describe 2 conflicting inputs as being both required.

@3h4x 3h4x requested a review from a team as a code owner April 19, 2021 04:47
@Nuru
Copy link
Contributor

Nuru commented Apr 19, 2021

/test all

@Nuru Nuru added the bug 🐛 An issue with the system label Apr 19, 2021
@Nuru Nuru merged commit 7a796c6 into cloudposse:master Apr 19, 2021
@3h4x
Copy link
Author

3h4x commented Apr 19, 2021

Thanks for great discussion here! I think I have digested all bullet points and can formulate my own opinion about null being an issue.

  • I usually use github as source of truth rather than terraform registry so I didn't even know that this bug exist. Either way I think that crippling CloudPosse modules by not using null values but empty string "" is not worth it. It's a terraform registry bug and should be treated as such, CloudPosse modules shouldn't adjust to that bug by making modules harder to implement, understand or by making inputs opaque.
  • I'd love to add validation but it's not possible at this stage of terraform and I believe adding local exec hacks validation is not a way to go.
  • I'm honestly not sure if the null is broken. I think the misconception is trying to pass null value from shell to terraform which is at this stage not possible due to string encapsulation. I actually never had to pass TF_VAR with null or flag -var with null but BTW who does that in enterprise environment? :)

We do not gain enough here by making null the default to be worth the hassle

I disagree with that. I tried to use that module and with default values it is broken, because it tries to pass wrong values to aws_route. If we don't want to pass a value to resource creation then null should be our first choice.
By changes proposed in this PR newcomers to the module won't be stopped with error that one need time to understand and eventually pass obsolete eni_id = null. In my opinion, and I think we all agree here, this should be done by the module itself.

I think going null route is a better option but if it would be CloudPosse policy I could change it to proposed:

  eni_id = local.have_eni ? var.eni_id : null
  ngw_id = local.have_eni ? null : var.ngw_id

Either way the PR is merged already so thank you! ❤️ 🦄 🚀

@Nuru
Copy link
Contributor

Nuru commented May 8, 2021

@3h4x wrote:

...I think that crippling CloudPosse modules by not using null values but empty string "" is not worth it. It's a terraform registry bug and should be treated as such, CloudPosse modules shouldn't adjust to that bug by making modules harder to implement, understand or by making inputs opaque.

I don't understand what you mean by "crippling" modules. The current module, after the changes in this PR, accepts either null or "" as input to mean "no setting". Users of this module can use whichever they prefer.

I'm honestly not sure if the null is broken.

See these bugs, both still open/unresolved as of this writing:

We do not gain enough here by making null the default to be worth the hassle

I disagree with that. I tried to use that module and with default values it is broken, because it tries to pass wrong values to aws_route. If we don't want to pass a value to resource creation then null should be our first choice.

I agree that, because of changes in the AWS provider, this module was broken, and this module (all our modules) should work with default values when given reasonable values of required variables. That does not mean we need to make null the default input value, it just means we need to make the default input value work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🐛 An issue with the system
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Possible incompatibility with latest AWS provider 3.34.0
4 participants