Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure api_fqdn Is Resolvable Within Milliseconds or log #75

Open
sean-horn opened this issue Jan 30, 2015 · 3 comments
Open

Ensure api_fqdn Is Resolvable Within Milliseconds or log #75

sean-horn opened this issue Jan 30, 2015 · 3 comments
Labels
Status: To be prioritized Indicates that product needs to prioritize this issue. Triage: Confirmed Indicates and issue has been confirmed as described. Type: Enhancement Adds new functionality.

Comments

@sean-horn
Copy link
Contributor

ZenDesk 2718, Zendesk 2393

As a customer and employee of Chef, I would like Chef Server and Enterprise Chef server to warn me when name resolution is compromised with delay or failure.

  1. Basic infrastructure like DNS should be reliable and performant, yes.
  2. Chef Server should behave gracefully in the event of DNS misbehavior, or at the very least warn of the problem.

Currently, EC11.2.6 and presumably later versions are unable to commit sandboxes in the presence of multi second delays in node['api_fqdn'] resolution. The following is the only indication of the problem.

      2015-01-29_21:28:28.69569 [error] Checking presence of checksum: <<"59b2fe13f9c6a776b9f69eb00ac2b49f">> for org <<"c8d751a6b6d1445aa5cdc6c7552e4dee">>  from bucket "bookshelf" has taken longer than 5000 ms

      2015-01-29_21:28:28.72001 [error] {<<"method=PUT; path=/organizations/pedant-testorg-12311/sandboxes/c6c7552e4deef662ca00efba97d6f846; status=500; ">>, {error,{throw,{checksum_check_error,1},[{chef_wm_named_sandbox,validate_checksums_uploaded,2,[{file,"src/chef_wm_named_sandbox.erl"},{line,144}]}, {chef_wm_named_sandbox,from_json,2,[{file,"src/chef_wm_named_sandbox.erl"},{line,99}]},{webmachine_resource,resource_call,3,[{file,"src/webmachine_resource.erl"},{line,186}]}, {webmachine_resource,do,3,[{file,"src/webmachine_resource.erl"},{line,142}]},{webmachine_decision_core,resource_call,1,[{file,"src/webmachine_decision_core.erl"},{line,48}]}, {webmachine_decision_core,accept_helper,1,[{file,"src/webmachine_decision_core.erl"},{line,612}]},{webmachine_decision_core,decision,1,[{file,"src/webmachine_decision_core.erl"}, {line,517}]},{webmachine_decision_core,handle_request,2,[{file,"src/webmachine_decision_core.erl"},{line,33}]}]}}}

We should add monitoring of the time required for a gethostbyname or getaddress or whatever forces a name resolution. If it creeps above a default of 1000ms, we should begin to warn periodically in the erchef logfile.

The above would be much easier to diagnose and workaround if the erchef log contained something like this

2015-01-29_21:28:28.72001 [error] {<<"Heyo, I am seeing an average of 1800s delays for resolving chef-server1.something.local">>, {error,{throw,{name_resolution_check_error,1}
@sean-horn sean-horn added the bug label Jan 30, 2015
@sean-horn
Copy link
Contributor Author

A simple test in Ruby might be

time /opt/opscode/embedded/bin/ruby -e "require 'resolv'; p Resolv.getaddress('YOUR_API_FQDN')"

@sean-horn sean-horn changed the title Ensure api_fqdn Is Resolvable Within Milliseconds Ensure api_fqdn Is Resolvable Within Milliseconds or log Jan 30, 2015
@jeremiahsnapp
Copy link
Contributor

Zendesk 3319

@stevendanna stevendanna added this to the accepted-minor milestone Jun 17, 2015
@tas50 tas50 added Type: Bug Does not work as expected. and removed bug labels Jan 4, 2019
@PrajaktaPurohit PrajaktaPurohit added Status: To be prioritized Indicates that product needs to prioritize this issue. Triage: Confirmed Indicates and issue has been confirmed as described. Type: Enhancement Adds new functionality. and removed Type: Bug Does not work as expected. labels Jul 24, 2020
@markan
Copy link
Contributor

markan commented Jul 24, 2020

This might be a natural thing to put in the status endpoint/promethus endpoint. Something where we resolve FQDN and a ping of all critical nodes we talk to and log the status would simplify a lot of debugging efforts

@stevendanna stevendanna removed this from the accepted-minor milestone Sep 29, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: To be prioritized Indicates that product needs to prioritize this issue. Triage: Confirmed Indicates and issue has been confirmed as described. Type: Enhancement Adds new functionality.
Projects
None yet
Development

No branches or pull requests

6 participants