Skip to content

Conversation

@mhenc
Copy link
Contributor

@mhenc mhenc commented Nov 2, 2022


^ Add meaningful description above

Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

@boring-cyborg boring-cyborg bot added area:API Airflow's REST/HTTP API area:CLI area:Scheduler including HA (high availability) scheduler labels Nov 2, 2022
@bolkedebruin
Copy link
Contributor

This needs an AIP.

And for what it matters please no JSON-RPC imho, but I would be much more in favor of something like grpc. Json is nice for interop but nothing else.

@mhenc
Copy link
Contributor Author

mhenc commented Nov 8, 2022

There is AIP for this:
https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-44+Airflow+Internal+API
GRPC was initial idea, but JSON-RPC is a result of discussion :
https://lists.apache.org/thread/dybm8mkxpllghg820swb9pcdbs3hsto8
and TBH JSON RPC looks better from developer perspective - compare this to
#25094

@potiuk
Copy link
Member

potiuk commented Nov 8, 2022

Yep. AIP is even approved - conditionally re- on-the-wire transport mechanism (i.e remote method execution choice and serialization choice) which was deferred to "before founding PR".

And while I iniitially advocated for GRPC (which answer both questions), there are some drawbacks of using GRPC (debuggability and threading model used) - this has been pointed by @ashb and @andrewgodwin mostly.

There are some benefitst of using more "web standard" approach - JSON-RPC allows us to plug-in better into already existing mechanisms we use and deos not require from us to learn and operate any new technology. The tests performed (initally by me then later by Mateusz) have shown that the performance overhead from using binary RPC vs. JSON are not really meaningful in our context, so they are not decisive factor.

I think the last thing that we need to see if whether we can replace Pickling with our own JSON serialization we alredy use in airflow - it seems like bad idea initially (because of the overhead) but again, it's likely negigible and the benefit of using more "standard" for us approach (we serialize internal objects using JSON Serializer a lot in various cases) and being able to debug it "on-the-wire" without extra tools are I think important benefits that should not be under-rated. I think here our own JSON Serializer trumps Pickle, because it already transparently handles a number of structures we will have to serialize including some K8S configuraitons (which require custom serialization when using pickle.

The "easiness of development" and maintainability were for me important but as @mhenc mentioned - we did both POCs - GRPC here (I did it): #25094 and this one is using JSON-RPC and for me, it looks like the approach proposed by Mateusz is even easier to develop and maintain - and as long as our serialization approch works transparently for all the object that we need to send/retrieve. So i got convinced that what Ash and Andrew proposed is a good idea. It does not change the principle of the AIP - it's just a matter of implementation detail. From the development perspective, this will be mostly decorating of the methods that we want to turn into "remote" ones.

@bolkedebruin - I think if you read through the discussions - maybe you will have other specifc comments about those choices - if they have anything to add on top of the above POCs. Tests and conclusions we have based on those. We deliberately took it slower with Mateusz and gave both of us (and the community) time to comment and raise their concerns, the AIP general approach is already approved, but if there are any specific concerns on those technology choices - we are still open to hear a bit more comments before we make the first foundng "real PR" rather than POC - as we promised when we called for voting on the AIP.

Do you have some specific concerns or other criteria that we should consider?

Just to summarize those that we took into account:

  • easines of development/maintenance
  • debuggability
  • familiarlity of the community with underlying technologies
  • speed/performance of serialisation and transport

They are now listed in order of preference - mostly because we found out that speed/performance of serialization and transport is literally orders of magnitude less important than speed/performance of the operations we are going to serialize and there is very little difference of the overhead.

@bolkedebruin
Copy link
Contributor

I stand corrected, the AIP was there and I missed the voting. So any argument I make below is mood.

The issue I find with JSON-RPC and with JSON in general is its lack of strong typing (yes I know of JSON Schema, but that is like XML and XSL) and that when thing become complex it starts becoming unreadable and you need a seperate IDE/parser for it (ohhh I forgot a ","). It is great for public APIs and interop but for internal ones not so much as you control both ends. This means lack of maintainability ("ohhh you forgot to update the schema") for me which trumps imho the readability over the wire part (as you control both ends and can be enabled for debugging purposes wth grpc/protobuf). Too much stuff happens at runtime rather than at compile time.

I have never been happy with any choice for JSON anywhere except public facing APIs. Now I am gonna cry in a corner somewhere.

@potiuk
Copy link
Member

potiuk commented Nov 14, 2022

I stand corrected, the AIP was there and I missed the voting. So any argument I make below is mood.

The JSON-RPC is actually up for discussion still. Until the founding PR (or even after - I think the way we will implement int the way that we will be able to replace the implementation in the future if we find this is not good. In the implementation proposal of @mhenc this is literally an implementation detail - and vast majority of changes will be to make sure we implement tests amd make sure all the calls are mapped.

This is because we are using "standard" HTTP stack rather than Grpc-like solution, because GRPC gets a bit more of a "tie-in" about the processing model for threading etc. And after a lot of reading and talking to some of my friends, and analysing issues - I tend to agree with @ash and @andrewgodwin that (among other issues), the threading model of GRPC is a big, unknown risk (Python Threads and GIL issues are still unsolved issue, and using proven set of backend fori/process-driven/gevent etc. approach is much more reilable, debuggable and scaleable). And JSON RPC is basically piggy-backing on that while providing a simple and standard semantics of mapping input/output parameters of methods.

I'd say - let's give it a try and see from the final tests, if JSON-RPC is as good solution as we think it is, and we can revisit it later. Eventually the whole test harness and making sure that there are no DB methods left and that some of the components of ours are truly db-less will let us move forward with it. Once we get it implemented, Replacing the internals of communication between the components will be far less of a problem.

@mhenc mhenc closed this Nov 30, 2022
@mhenc mhenc deleted the poc_44 branch November 30, 2022 13:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:API Airflow's REST/HTTP API area:CLI area:Scheduler including HA (high availability) scheduler

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants