Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What's the purpose of the response_check parameter in HttpOperator? #45237

Closed
2 tasks done
dabla opened this issue Dec 27, 2024 · 11 comments
Closed
2 tasks done

What's the purpose of the response_check parameter in HttpOperator? #45237

dabla opened this issue Dec 27, 2024 · 11 comments
Labels
area:core kind:bug This is a clearly a bug needs-triage label for new issues that we didn't triage yet provider:http

Comments

@dabla
Copy link
Contributor

dabla commented Dec 27, 2024

Apache Airflow version

2.10.4

If "Other Airflow 2 version" selected, which one?

No response

What happened?

I wanted to raise an AirflowSkipException if a HTTP 404 is returned by a REST endpoint if the requested resource isn't found

What you think should happen instead?

I would have expected the custom response_check code being invoked, but it isn't, as the response status_code is already checked within the check_response method of the HttpHook. There, when the status_code isn't within the 2x or 3x range, an AirflowException is being raised, thus making the possibility to check the response yourself through a custom response_check obsolete, as the code will never be invoked. A possible solution would be to pass the custom response_check defined int the operator to the HttpHook, that way the same logic would be applied.

How to reproduce

Just define a custom response_check which ignores a 404 and run the HttpOperator for a non existing resource which will return a 404, you'll see that the custom response_check will never be invoked and the task will fail.

Operating System

RedHat

Versions of Apache Airflow Providers

apache-airflow-providers-http 5.0.0

Deployment

Docker-Compose

Deployment details

No response

Anything else?

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@dabla dabla added area:core kind:bug This is a clearly a bug needs-triage label for new issues that we didn't triage yet labels Dec 27, 2024
@raphaelauv
Copy link
Contributor

set

extra_options = {'check_response': False}

@dabla
Copy link
Contributor Author

dabla commented Dec 28, 2024

set

extra_options = {'check_response': False}

Thanks didn’t know that. Wouldn’t it be more logical that when you pass a custom_response check the hook doesn’t chech the response? Now this is confusing imho

@dabla
Copy link
Contributor Author

dabla commented Dec 29, 2024

{'check_response': False}

Just tested it and it fails with following error as it considers the check_response passed in the extra_options as a HTTP header:

requests.exceptions.InvalidHeader: Header part (False) from ('check_response', False) must be of type str or bytes, not <class 'bool'>

Will do a PR to fix this.

@topherinternational
Copy link
Contributor

Just dropping some thoughts -

I agree with the OP that this is unclear and confusing.

It looks like response_check is designed to be a callback that examines the text/bytes from a successful (2xx/3xx) response, and under normal use, the hook will raise an exception when a 4xx/5xx status is received from the server. This has an unfortunate name collision with HttpHook's check_response, which passes through to the Response object's raise_for_status method and then massages a raised exception for 4xx/5xx responses (checking only status code and not the response content).

There's a lot of leaky API between the HttpOperator which wraps the HttpHook which wraps the requests library, and the operator user/DAG writer has to be aware of all of them to really understand how the operator is going to behave. (And keep aware if those wrapped components change behavior.)

One simple way around this particular issue might be to have the HttpOperator take an optional argument like raise_for_status whose behavior is to pass {'check_response': False} to the HttpHook, and document the parameter and the response_check parameter better so it's clear how to use them if you want to do what the OP does - interrogate the response code yourself.

A more fluent alternative might be to have the Operator take a check_status parameter that is a callback function to examine the response code, and if it is set, the operator code will (a) set the Hook to not do its own status check, and (b) get called after the hook is run and before any response_check function is called. That way response_check can preserve its semantics of only getting called for a successful response, and DAG writer can either use the hook's status check or substitute their own function (but never both).

@dabla
Copy link
Contributor Author

dabla commented Dec 29, 2024

Just dropping some thoughts -

I agree with the OP that this is unclear and confusing.

It looks like response_check is designed to be a callback that examines the text/bytes from a successful (2xx/3xx) response, and under normal use, the hook will raise an exception when a 4xx/5xx status is received from the server. This has an unfortunate name collision with HttpHook's check_response, which passes through to the Response object's raise_for_status method and then massages a raised exception for 4xx/5xx responses (checking only status code and not the response content).

There's a lot of leaky API between the HttpOperator which wraps the HttpHook which wraps the requests library, and the operator user/DAG writer has to be aware of all of them to really understand how the operator is going to behave. (And keep aware if those wrapped components change behavior.)

One simple way around this particular issue might be to have the HttpOperator take an optional argument like raise_for_status whose behavior is to pass {'check_response': False} to the HttpHook, and document the parameter and the response_check parameter better so it's clear how to use them if you want to do what the OP does - interrogate the response code yourself.

A more fluent alternative might be to have the Operator take a check_status parameter that is a callback function to examine the response code, and if it is set, the operator code will (a) set the Hook to not do its own status check, and (b) get called after the hook is run and before any response_check function is called. That way response_check can preserve its semantics of only getting called for a successful response, and DAG writer can either use the hook's status check or substitute their own function (but never both).

Completely agree with this detailed elaboration.

I would suggest that when a custom check_response callback is specified within the operator, it would then in turn be passed through the HttpHook run method so that it doesn’t use the default check_status. Also specifying the check_response in the extra options didn’t work and only makes the useage overly complicated for nothing.

@topherinternational
Copy link
Contributor

topherinternational commented Dec 29, 2024

{'check_response': False}

Just tested it and it fails with following error as it considers the check_response passed in the extra_options as a HTTP header:

requests.exceptions.InvalidHeader: Header part (False) from ('check_response', False) must be of type str or bytes, not <class 'bool'>

What exactly was your code here? Did you put the check_response pair in the Operator constructor's extra_options? Or in the Connection object?

Edit: I ask bc I played with the tests and the only way I could get that InvalidHeader error was to pass the check_response param in the operator's headers arg instead of the extra_options, or to put check_response=false in the Connection's extra table.

@dabla
Copy link
Contributor Author

dabla commented Jan 6, 2025

{'check_response': False}

Just tested it and it fails with following error as it considers the check_response passed in the extra_options as a HTTP header:
requests.exceptions.InvalidHeader: Header part (False) from ('check_response', False) must be of type str or bytes, not <class 'bool'>

What exactly was your code here? Did you put the check_response pair in the Operator constructor's extra_options? Or in the Connection object?

Edit: I ask bc I played with the tests and the only way I could get that InvalidHeader error was to pass the check_response param in the operator's headers arg instead of the extra_options, or to put check_response=false in the Connection's extra table.

I've defined those in the extra_options of the connection, tested on the latest Airflow version.

@dabla
Copy link
Contributor Author

dabla commented Jan 6, 2025

The issue is in this method, it pops all non related extra_options parameters and keeps the rest to pass it as headers, an extra pop should be needed on the check_response key to avoid the issue, I'll open a PR to fix that:

    def _configure_session_from_extra(
            self, session: requests.Session, connection: Connection
        ) -> requests.Session:
            extra = connection.extra_dejson
            extra.pop("timeout", None)
            extra.pop("allow_redirects", None)
            extra.pop("check_response", None)  # this should be added
            session.proxies = extra.pop("proxies", extra.pop("proxy", {}))
            session.stream = extra.pop("stream", False)
            session.verify = extra.pop("verify", extra.pop("verify_ssl", True))
            session.cert = extra.pop("cert", None)
            session.max_redirects = extra.pop("max_redirects", DEFAULT_REDIRECT_LIMIT)
            session.trust_env = extra.pop("trust_env", True)
            try:
                session.headers.update(extra)
            except TypeError:
                self.log.warning("Connection to %s has invalid extra field.", connection.host)
            return session

@topherinternational
Copy link
Contributor

{'check_response': False}

Just tested it and it fails with following error as it considers the check_response passed in the extra_options as a HTTP header:

requests.exceptions.InvalidHeader: Header part (False) from ('check_response', False) must be of type str or bytes, not <class 'bool'>

What exactly was your code here? Did you put the check_response pair in the Operator constructor's extra_options? Or in the Connection object?

Edit: I ask bc I played with the tests and the only way I could get that InvalidHeader error was to pass the check_response param in the operator's headers arg instead of the extra_options, or to put check_response=false in the Connection's extra table.

I've defined those in the extra_options of the connection, tested on the latest Airflow version.

Ah, I think what the other commentor meant was to pass extra_options to the HttpOperator, not the Connection. When I pass it to the operator the setting is passed to the hook as expected and the response_check function is called.

This should solve your immediate problem, checking status and response content with your callback instead of the hook raising a preemptive exception.

@topherinternational
Copy link
Contributor

The issue is in this method, it pops all non related extra_options parameters and keeps the rest to pass it as headers, an extra pop should be needed on the check_response key to avoid the issue, I'll open a PR to fix that:


    def _configure_session_from_extra(

            self, session: requests.Session, connection: Connection

        ) -> requests.Session:

            extra = connection.extra_dejson

            extra.pop("timeout", None)

            extra.pop("allow_redirects", None)

            extra.pop("check_response", None)  # this should be added

            session.proxies = extra.pop("proxies", extra.pop("proxy", {}))

            session.stream = extra.pop("stream", False)

            session.verify = extra.pop("verify", extra.pop("verify_ssl", True))

            session.cert = extra.pop("cert", None)

            session.max_redirects = extra.pop("max_redirects", DEFAULT_REDIRECT_LIMIT)

            session.trust_env = extra.pop("trust_env", True)

            try:

                session.headers.update(extra)

            except TypeError:

                self.log.warning("Connection to %s has invalid extra field.", connection.host)

            return session

This is the call of someone in the core contributor club, so I'm just giving my opinion -

This does look like a bug in the HTTP hook, but it seems like a separate issue to this issue. I see the two breaking down like this:

  • When I use the HttpOperator, I want to tell it to skip any status checking and let me check status through the existing response_check callback argument
  • When I declare a connection to use with the HttpHook, I want to specify that the hook should not do a status check on any call made with this connection

Using the hook-connection contract to configure the control flow of the operator perpetuates the API leaking problem that spawned this issue in the first place.

It's a separate issue/question whether the hook should check it status based on an option in the connection object, or if that behavior should always be configured by the caller (args to the constructor or run()). I lean towards yes but that can be discussed in a new issue filing.

@dabla
Copy link
Contributor Author

dabla commented Jan 7, 2025

I've created a PR which solves the issue of the check_response being interpreted as a HTTP header when defined under the extras of the HTTP connection.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:core kind:bug This is a clearly a bug needs-triage label for new issues that we didn't triage yet provider:http
Projects
None yet
Development

No branches or pull requests

3 participants