Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Octavia CLI has option to use existing IDs for sources, destinations, and connections #13203

Closed
evantahler opened this issue May 25, 2022 · 5 comments
Labels
autoteam team/tse Technical Support Engineers

Comments

@evantahler
Copy link
Contributor

evantahler commented May 25, 2022

The Problem

When using Octavia as a tool to backup an Airbyte project, or use it promote the same project from CI -> Staging -> Production, we should be able to set the IDs of our sources, destinations, and connections deterministically. The IDs are important because that is how other tools (e.g. the API) will interact the connection.

This allows easier use of the Airbyte API, because the URL to start a sync relies on the connection ID, which will be well-known. Today, running a specific connection on an airbyte server that is created by the Octavia CLI involves looping though all the connections and finding the ID of the one you want by name:

From the Airflow Airbyte Operator:

# https://github.com/airbytehq/airflow-summit-airbyte-2022/blob/main/airflow/dags/dag_airbyte_airflow_dbt.py
def get_ab_conn_id(ds=None, **kwargs):
    ab_url = "http://airbyte-server:8001/api/v1"
    headers = {"Accept": "application/json", "Content-Type": "application/json"}
    workspace_id = requests.post(f"{ab_url}/workspaces/list", headers=headers).json().get("workspaces")[0].get("workspaceId")
    payload = json.dumps({"workspaceId": workspace_id})
    connections = requests.post(f"{ab_url}/connections/list", headers=headers, data=payload).json().get("connections")
    for c in connections:
        if c.get("name") == "demo_connection":
            return c.get("connectionId")

# Then you can...

    airbyte_conn_id = PythonOperator(
        task_id="get_ab_conn_id",
        python_callable=get_ab_conn_id,
    )

    sync_source_destination = AirbyteTriggerSyncOperator(
        task_id='airbyte_sync_source_dest_example',
        connection_id=airbyte_conn_id.output,
        trigger_rule="none_failed",
    )

A similar problem arises when creating a new connection from a source and destination. If the IDs of the source and destination cannot be known ahead of time, building the connection is not possible without first re-reading the YML files of source and destination. Even though the source and destination YML files are checked into git, Octavia re-generates the IDs of those connectors on every run. This means that (1) the code in git does not match reality and (2) making the connection requires an extra programatic step - see https://github.com/airbytehq/airflow-summit-airbyte-2022/blob/main/tools/change_resource_id.py

Desired solution

A flag for the Octavia CLI to use the IDs already present in the YML files:

octavia apply -f sources/fake_users/configuration.yaml --use-id
octavia apply -f destinations/postgres_destination/configuration.yaml --use-id
octavia apply -f connections/demo_connection/configuration.yaml --use-id

This requires the Airbyte API to accept user-provided IDs and not always auto-generate a new UUID. These UUIDs/IDs should be validated for uniqueness on creation.

I would argue that --use-id should be the default behavior, as the expectation is that the server's state should match the YML files exactly.

Possible Side Effects

Airbyte assumes that the connection ID is unique in our metrics. If the same connection ID appears in multiple servers, things might get weird.

@evantahler
Copy link
Contributor Author

cc @ChristopheDuong to comment on what bad things can happen if we allow users to set the ID of their connections.

@alafanechere
Copy link
Contributor

Hey @evantahler ,
Thank you for the suggestion.
I took a different approach in this PR which is a work in progress.
I suggest storing the backend generated IDs in the state file and keeping one state per workspace.
With this approach, users do not have to manage their own ids and they will be able to deploy the same connection configuration on multiple workspaces.

@evantahler
Copy link
Contributor Author

evantahler commented May 31, 2022

I don't think this will solve the problem.

We need to know the ID of the connection before we use the CLI. If I'm trying to bootstrap an environment, I need to know that the connection ID /will be/ abc123 ahead of time so can orchestrate it and check on it with other tools. Learning the ID after creation doesn't really solve the problem. It also means that the code (state files) will not be the same between environments, which sort of defeats the purpose of sharing code between dev/staging/prod...

@alafanechere
Copy link
Contributor

alafanechere commented Jun 3, 2022

In your use case do you also want to orchestrate with Airflow the creation of the Airbyte resources from octavia?
In my opinion, the provisioning (octavia apply) of an Airbyte instance with Octavia should be done outside of Airflow because it's not a repeating task and #13070 will allow you to easily deploy the same configuration.yaml files on multiple Airbyte environments.

I think we could add a octavia get connection <name> command (which is currently being implemented here) which will output json metadata about the connection (including the connection ID). Then, for your get_ab_conn_id task you will be able to use a DockerOperator to run this command from you Airflow DAG. Does it make sense?

@evantahler
Copy link
Contributor Author

evantahler commented Jun 3, 2022

Ooh - I think the get commands are the missing piece of the puzzle! That would allow us to spin up the cluster, setup our connections, and then get the IDs to pass to whatever other tool we need. Awesome! I'll close this in favor of #13254

cc @marcosmarxm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
autoteam team/tse Technical Support Engineers
Projects
None yet
Development

No branches or pull requests

3 participants