diff --git a/docs/getting_started/index.rst b/docs/getting_started/index.rst index eeaa155232..ccad2a68f2 100644 --- a/docs/getting_started/index.rst +++ b/docs/getting_started/index.rst @@ -36,7 +36,7 @@ The recommended way to install and run Cosmos depends on how you run Airflow. Fo - `Getting Started on MWAA `__ - `Getting Started on GCC `__ -You might require a different setup depending on your particular configuration. See :ref:`exec-methods`. +You might require a different setup depending on your particular configuration. See :ref:`execution-modes`. Example Demo: Jaffle Shop Project __________________________________ @@ -76,24 +76,3 @@ as ``max_active_tasks``, ``max_active_runs``, and ``default_args``. With Cosmos, transitioning from a dbt workflow to an Airflow Dag is seamless, giving you the best of both tools for managing and scaling your data workflows. - -.. _exec-methods: - -Execution Methods ------------------ - -For more customization, check out the different execution modes that Cosmos supports on the `Execution Modes `__ page. - -For specific guides, see the following: - -- `Executing dbt DAGs with DockerOperators <../../guides/run_dbt/container/docker.html>`__ -- `Executing dbt DAGs with KubernetesPodOperators <../../guides/run_dbt/container/kubernetes.html>`__ -- `Executing dbt DAGs with Watcher Kubernetes Mode <../../guides/run_dbt/container/watcher-kubernetes-execution-mode.html>`__ -- `Executing dbt DAGs with AzureContainerInstancesOperators <../../guides/run_dbt/container/azure-container-instance.html>`__ -- `Executing dbt DAGs with GcpCloudRunExecuteJobOperators <../../guides/run_dbt/container/gcp-cloud-run-job.html>`__ - - -Concepts Overview ------------------ - -How do dbt and Airflow concepts map to each other? Learn more `in this link `__. diff --git a/docs/guides/run_dbt/airflow-worker/execution-modes-local-conflicts.rst b/docs/guides/dbt_setup/execution-modes-local-conflicts.rst similarity index 89% rename from docs/guides/run_dbt/airflow-worker/execution-modes-local-conflicts.rst rename to docs/guides/dbt_setup/execution-modes-local-conflicts.rst index 0f9120127c..60e28679fd 100644 --- a/docs/guides/run_dbt/airflow-worker/execution-modes-local-conflicts.rst +++ b/docs/guides/dbt_setup/execution-modes-local-conflicts.rst @@ -1,19 +1,17 @@ -:orphan: - .. _execution-modes-local-conflicts: Airflow and dbt dependencies conflicts ====================================== -When using the `Local Execution Mode `__, users may face dependency conflicts between -`Apache Airflow® `_ and dbt. The conflicts may increase depending on the Airflow providers and dbt adapters being used. +When using the :ref:`local-execution` without defining a custom ``ExecutionConfig.dbt_executable_path``, you might have dependency conflicts between +`Apache Airflow® `_ and dbt. The number of conflicts can increase depending on the Airflow providers and dbt adapters you use. If you find errors, we recommend users isolating the installation of dbt from the Airflow installation. -With the `Local Execution Mode `__, this can be accomplished by installing dbt in a separate -Python virtualenv and setting the `ExecutionConfig.dbt_executable_path <../guides/execution-config.html>`_ and -`RenderConfig.dbt_executable_path <../guides/render-config.html>`_ parameters. +With the ``local`` execution mode, this can be accomplished by installing dbt in a separate +Python virtualenv and setting the `ExecutionConfig.dbt_executable_path <../../reference/configs/execution-config.html>`_ and +`RenderConfig.dbt_executable_path <../../guides/translate_dbt_to_airflow/render-config.html>`_ parameters. -The page `execution modes `__ describes many other methods that support isolating dbt from Airflow. +The page, :ref:`execution-modes` describes many other methods that support isolating dbt from Airflow. In the following table, ``x`` represents combinations that lead to conflicts (vanilla ``apache-airflow`` and ``dbt-core`` packages): diff --git a/docs/guides/index.rst b/docs/guides/index.rst index b16685da0c..abd65a0deb 100644 --- a/docs/guides/index.rst +++ b/docs/guides/index.rst @@ -3,14 +3,33 @@ Guides ====== -Cosmos offers a number of configuration options to customize its behavior. For more info, check out the links on the left or the table of contents below. +.. toctree:: + :maxdepth: 0 + :hidden: + + self + +Cosmos offers a number of configuration options to customize how Airflow dags and dbt commands run. + +To set up a project, you follow the same general set of steps. + + +Set up dbt with Airflow +~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Make your dbt projects available to Airflow and install dbt into the environment where your dbt code runs. .. toctree:: :maxdepth: 1 - :hidden: :caption: Set up dbt with Airflow dbt_setup/dbt-fusion + dbt_setup/execution-modes-local-conflicts + +Connect to your dbt database +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Configure your Cosmos project to allow Airflow Dags to initiate dbt commands, and make data transformations and updates in your data warehouses. You can create these connections with your ``profiles.yml`` file in the dbt project, using profile mappings, or customizing ``ProfileConfig`` per dbt configuration. .. toctree:: :maxdepth: 1 @@ -22,6 +41,12 @@ Cosmos offers a number of configuration options to customize its behavior. For m connect_database/use-profile-mapping connect_database/profile-customise-per-node + +Translate your dbt code into Airflow Dags +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +You can customize how Cosmos parses your dbt workflows into Airflow Dags. Choosing how you want your dbt nodes to map to Airflow tasks within Dags can affect the time required for Cosmos to parse the dbt workflows and for Airflow to execute the resulting Dags. + .. toctree:: :maxdepth: 1 :hidden: @@ -34,9 +59,14 @@ Cosmos offers a number of configuration options to customize its behavior. For m translate_dbt_to_airflow/render-config Customize node conversion + +Run dbt +~~~~~~~~~~~~~ + +Specify more details about how Cosmos runs both dbt commands and Airflow Dags. This includes :ref:`execution-modes` , either one that runs dbt on an Airflow worker node or one that runs in a container. You can customize additional aspects of how your dbt code runs, like using particular operators that correspond to dbt commands. And, you can leverage Airflow's scheduling capabilities in your Cosmos Dags. + .. toctree:: - :maxdepth: 3 - :hidden: + :maxdepth: 1 :caption: How Cosmos runs dbt run_dbt/execution-modes @@ -46,24 +76,37 @@ Cosmos offers a number of configuration options to customize its behavior. For m run_dbt/operators/operators run_dbt/customization/index +Multi-project Setups +~~~~~~~~~~~~~~~~~~~~ + +If you have a multi-project architecture where you have multiple dbt projects that reference each others' models, you can set up ``dbt-loom`` with Cosmos to handle cross-project references. + .. toctree:: :maxdepth: 1 - :hidden: :caption: Multi-project Setups Handle cross-project references +Add your dbt documentation +~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Cosmos supports dbt's documentation capabilities. + .. toctree:: :maxdepth: 1 - :hidden: :caption: dbt Documentation dbt_docs/generating-docs dbt_docs/hosting-docs + +Cosmos DevEx +~~~~~~~~~~~~ + +You can configure Cosmos to improve your development experience. + .. toctree:: :maxdepth: 1 - :hidden: :caption: Cosmos DevEx cosmos_devex/lineage diff --git a/docs/guides/run_dbt/airflow-worker/async-execution-mode.rst b/docs/guides/run_dbt/airflow-worker/async-execution-mode.rst index 55d6778abc..0b7784a082 100644 --- a/docs/guides/run_dbt/airflow-worker/async-execution-mode.rst +++ b/docs/guides/run_dbt/airflow-worker/async-execution-mode.rst @@ -1,20 +1,23 @@ .. _async-execution-mode: -Airflow Async Execution Mode +Airflow async execution mode ============================ -This execution mode can reduce the runtime by 35% in comparison to Cosmos LOCAL execution mode, but is currently only available for BigQuery. While this mode was introduced in Cosmos 1.9, we strongly encourage users to use Cosmos 1.11, which has significant performance improvements. +This execution mode can reduce the runtime by 35% in comparison to Cosmos ``LOCAL`` execution mode, but is currently only available for BigQuery. While this mode was introduced in Cosmos 1.9, we strongly encourage users to use the latest version of Cosmos, which has significant performance improvements. -It can be particularly useful for long-running transformations, since it leverages Airflow's `deferrable operators `__. +The ``airflow_async`` execution mode is a way to run the dbt resources from your dbt project using Apache Airflow's +`Deferrable operators `__. +This execution mode is well-suited for when you have long-running resources and you want to run them asynchronously by +leveraging Airflow's deferrable operators. With deferrable operators, you can potentially observe higher throughput of tasks +because more dbt nodes run in parallel, since they won't be blocking Airflow's worker slots. -In this mode, there is a ``SetupAsyncOperator`` that will pre-generate the SQL files for the dbt project and upload them to Airflow XCom or a remote location. A remote location will only be used if users set ``AIRFLOW__COSMOS__REMOTE_TARGET_PATH`` and ``AIRFLOW__COSMOS__REMOTE_TARGET_PATH_CONN_ID``. This operator is run before the remaining pipeline. -All the pipeline dbt model transformations will be run using ``DbtRunAirflowAsyncOperator`` which, instead of running the ``dbt run`` command for each model. They will download the SQL files from the Airflow XCom or remote location and execute them directly leveraging the Airflow ``BigQueryInsertJobOperator``. - -Users can leverage other existing ``BigQueryInsertJobOperator`` features, such as the UI controls to link to the job in the BigQuery UI. +In this mode, there is a ``SetupAsyncOperator`` that pre-generates the SQL files for the dbt project and uploads them to Airflow XCom or a remote location. Airflow only uses a remote location if you set ``AIRFLOW__COSMOS__REMOTE_TARGET_PATH`` and ``AIRFLOW__COSMOS__REMOTE_TARGET_PATH_CONN_ID``. This operator runs before the remaining pipeline. +All the pipeline dbt model transformations run using ``DbtRunAirflowAsyncOperator`` instead of running the ``dbt run`` command for each model. They download the SQL files from the Airflow XCom or remote location, and then execute them directly using the Airflow ``BigQueryInsertJobOperator``. +You can also use other existing ``BigQueryInsertJobOperator`` features, such as the UI controls to link to the job in the BigQuery UI. Advantages of Airflow Async Mode -++++++++++++++++++++++++++++++++ +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - **Improved Task Throughput:** Async tasks free up Airflow workers by leveraging the Airflow Trigger framework. While long-running SQL transformations are executing in the data warehouse, the worker is released and can handle other tasks, increasing overall task throughput. - **Better Resource Utilization:** By minimizing idle time on Airflow workers, async tasks allow more efficient use of compute resources. Workers aren't blocked waiting for external systems and can be reused for other work while waiting on async operations. @@ -34,18 +37,18 @@ We have `observed `_ Getting Started with Airflow Async Mode -+++++++++++++++++++++++++++++++++++++++ +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ This guide walks you through setting up an Astro CLI project and running a Cosmos-based DAG with a deferrable operator, enabling asynchronous task execution in Apache Airflow. Prerequisites -+++++++++++++ +------------- - `Astro CLI `_ - Airflow>=2.9 1. Create Astro-CLI Project -+++++++++++++++++++++++++++ +--------------------------- Run the following command in your terminal: @@ -70,7 +73,7 @@ This will create an Astro project with the following structure: 2. Update Dockerfile -++++++++++++++++++++ +-------------------- Edit your Dockerfile to ensure all necessary requirements are included. @@ -80,7 +83,7 @@ Edit your Dockerfile to ensure all necessary requirements are included. 3. Add astronomer-cosmos Dependency -+++++++++++++++++++++++++++++++++++ +----------------------------------- In your ``requirements.txt``, add: @@ -90,7 +93,7 @@ In your ``requirements.txt``, add: 4. Create Airflow DAG -+++++++++++++++++++++ +--------------------- 1. Create a new DAG file: ``dags/cosmos_async_dag.py`` @@ -152,8 +155,8 @@ In your ``requirements.txt``, add: - Add a valid dbt project inside your Airflow project under ``dags/dbt/``. -5. Start the Project -++++++++++++++++++++ +5. Start the project +-------------------- Launch the Airflow project locally: @@ -166,8 +169,8 @@ This will: - Spin up the scheduler, webserver, and triggerer (needed for deferrable operators) - Expose Airflow UI at http://localhost:8080 -6. Create Airflow Connection -++++++++++++++++++++++++++++ +6. Create Airflow connection +---------------------------- Create an Airflow connection with following configurations @@ -196,7 +199,7 @@ Create an Airflow connection with following configurations 7. Execute the DAG -++++++++++++++++++ +------------------ 1. Visit the Airflow UI at ``http://localhost:8080`` 2. Enable the DAG: ``cosmos_async_dag`` @@ -209,8 +212,8 @@ Create an Airflow connection with following configurations The ``run`` tasks will run asynchronously via the deferrable operator, freeing up worker slots while waiting on I/O or long-running tasks. -Control of where to upload the SQL files -++++++++++++++++++++++++++++++++++++++++ +Control where to upload the SQL files +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ For optimal performance we encourage to keep Cosmos standard behaviour (introduced in 1.11), which is to upload the SQL files to XCom, instead of a remote object location. @@ -225,7 +228,7 @@ However, if you want to upload the SQL files to a remote object location instead Limitations -+++++++++++ +~~~~~~~~~~~ 1. **Limited to dbt models**: Only dbt resource type models are run asynchronously using Airflow deferrable operators. Other resource types are executed synchronously, similar to the local execution mode. diff --git a/docs/guides/run_dbt/airflow-worker/cosmos-managed-venv.rst b/docs/guides/run_dbt/airflow-worker/cosmos-managed-venv.rst new file mode 100644 index 0000000000..437f9637ba --- /dev/null +++ b/docs/guides/run_dbt/airflow-worker/cosmos-managed-venv.rst @@ -0,0 +1,30 @@ +.. _cosmos-managed-venv: + +Cosmos-managed virtual environment execution mode +======================================================== + +The ``virtualenv`` mode runs dbt commands from Python virtual environments created and managed by Cosmos. This mode removes the need to create a virtual environment at build time, unlike ``ExecutionMode.LOCAL``, while avoiding package conflicts. It is intended for cases where: + +- You can't install dbt directly in the Airflow environment, either in the same environment or a dedicated one. +- Multiple dbt installations are required, and you prefer Cosmos to manage them without modifying the Airflow deployment. +- Speed is not a concern, and you can afford for Cosmos to create and update the Python virtual environment during the execution of each dbt node. + +In most cases, the local execution mode with ``ExecutionConfig.dbt_executable_path`` is the preferred option, as it allows you to manage the dbt environment during the Airflow deployment process, instead of per-dbt node execution. + +When you use ``virtualenv`` mode, you are responsible for declaring which version of ``dbt`` to use by giving the argument ``py_requirements``. Set this argument directly in operator instances or when you instantiate ``DbtDag`` and ``DbtTaskGroup`` as part of ``operator_args``. + +Similar to the ``local`` execution mode, Cosmos converts Airflow Connections into a way ``dbt`` understands them by creating a ``dbt`` profile file (``profiles.yml``). +Also similar to the ``local`` execution mode, Cosmos will by default attempt to use a ``partial_parse.msgpack`` if one exists to speed up parsing. + +Some drawbacks of the ``virtualenv`` approach: + +- It is slower than ``local`` because it may create and update a new Python virtual environment for each Cosmos dbt task run, depending on the Airflow executor and if you set the ``ExecutionConfig.virtualenv_dir`` configuration. +- If dbt is unavailable in the Airflow scheduler, the default ``LoadMode.DBT_LS`` will not work. In this scenario, you must use a :ref:`parsing-methods` that does not rely on dbt, such as ``LoadMode.MANIFEST``. +- Only ``InvocationMode.SUBPROCESS`` is supported currently, attempt to use ``InvocationMode.DBT_RUNNER`` will raise error. + +Example of how to use: + +.. literalinclude:: ../../../../dev/dags/example_virtualenv.py + :language: python + :start-after: [START virtualenv_example] + :end-before: [END virtualenv_example] diff --git a/docs/guides/run_dbt/airflow-worker/index.rst b/docs/guides/run_dbt/airflow-worker/index.rst index eaa89c2d9f..ef3edda3e4 100644 --- a/docs/guides/run_dbt/airflow-worker/index.rst +++ b/docs/guides/run_dbt/airflow-worker/index.rst @@ -5,5 +5,7 @@ Run dbt in an Airflow worker :maxdepth: 1 :caption: Run dbt in an Airflow worker - async-execution-mode + local-execution-mode watcher-execution-mode + cosmos-managed-venv + async-execution-mode diff --git a/docs/guides/run_dbt/airflow-worker/local-execution-mode.rst b/docs/guides/run_dbt/airflow-worker/local-execution-mode.rst new file mode 100644 index 0000000000..ad97e31f3c --- /dev/null +++ b/docs/guides/run_dbt/airflow-worker/local-execution-mode.rst @@ -0,0 +1,23 @@ +.. _local-execution: + +Local execution mode +==================== + +By default, Cosmos uses the ``local`` execution mode. It is the fastest way to run Cosmos operators, since it runs dbt either as a library or as a local subprocess. +For situations where dbt and Airflow dependencies conflict, :ref:`execution-modes-local-conflicts`, you most likely have the option to pre-install dbt in an isolated Python virtual environment, either as part of the container image or as part of a pre-start script. + +The ``local`` execution mode assumes that the Airflow worker node can access a ``dbt`` binary. If ``dbt`` was not installed alongside Cosmos, you can create a dedicated virtual environment and define a custom path to ``dbt`` by declaring the argument ``ExecutionConfig.dbt_executable_path``. + +.. note:: + Starting in the 1.4 version, Cosmos tries to leverage the dbt partial parsing (``partial_parse.msgpack``) to speed up task execution. + This feature is bound to `dbt partial parsing limitations `_. + Learn more: :ref:`partial-parsing`. + +When using the ``local`` execution mode, Cosmos converts Airflow Connections into a native ``dbt`` profiles file (``profiles.yml``). + +Example of how to use, for instance, when ``dbt`` was installed together with Cosmos: + +.. literalinclude:: ../../../../dev/dags/basic_cosmos_dag.py + :language: python + :start-after: [START local_example] + :end-before: [END local_example] diff --git a/docs/guides/run_dbt/airflow-worker/watcher-execution-mode.rst b/docs/guides/run_dbt/airflow-worker/watcher-execution-mode.rst index 05bb21c7f7..355cef891a 100644 --- a/docs/guides/run_dbt/airflow-worker/watcher-execution-mode.rst +++ b/docs/guides/run_dbt/airflow-worker/watcher-execution-mode.rst @@ -1,7 +1,7 @@ .. _watcher-execution-mode: -Introducing ``ExecutionMode.WATCHER``: Experimental High-Performance dbt Execution in Cosmos -============================================================================================ +Watcher execution mode (Experimental) +====================================== With the release of **Cosmos 1.11.0**, we are introducing a powerful new experimental execution mode — ``ExecutionMode.WATCHER`` — designed to drastically reduce dbt pipeline run times in Airflow. @@ -149,7 +149,7 @@ This approach is best when your Airflow DAG is fully dedicated to a dbt project. :start-after: [START example_watcher] :end-before: [END example_watcher] -As it can be observed, the only difference with the default ``ExecutionMode.LOCAL`` is the addition of the ``execution_config`` parameter with the ``execution_mode`` set to ``ExecutionMode.WATCHER``. The ``ExecutionMode`` enum can be imported from ``cosmos.constants``. For more information on the ``ExecutionMode.LOCAL``, please, check the `dedicated page `__ +As it can be observed, the only difference with the default ``ExecutionMode.LOCAL`` is the addition of the ``execution_config`` parameter with the ``execution_mode`` set to ``ExecutionMode.WATCHER``. The ``ExecutionMode`` enum can be imported from ``cosmos.constants``. For more information on the ``ExecutionMode.LOCAL``, please, check the :ref:`local-execution` documentation. **How it works:** diff --git a/docs/guides/run_dbt/container/aws-container-run-job.rst b/docs/guides/run_dbt/container/aws-container-run-job.rst index 4321c8f346..40cc705fd6 100644 --- a/docs/guides/run_dbt/container/aws-container-run-job.rst +++ b/docs/guides/run_dbt/container/aws-container-run-job.rst @@ -1,9 +1,24 @@ .. _aws-container-run-job: -Getting Started with Astronomer Cosmos on AWS ECS -================================================== +AWS ECS execution mode +====================== + +.. versionadded:: 1.9.0 + +Astronomer Cosmos provides a unified way to run containerized workloads across multiple cloud providers. Using ``AWS Elastic Container Service (ECS)`` as the execution mode provides an isolated and scalable way to run ``dbt`` tasks within an AWS ECS service. This execution mode ensures that each ``dbt`` run is performed inside a dedicated container running in an ECS task. + +Performance and maintenance considerations +++++++++++++++++++++++++++++++++++++++++++ + +This execution mode requires you to have an AWS environment configured to run ECS tasks (see :ref:``aws-ecs`` for more details on the exact requirements). Similar to the ``Docker`` and ``Kubernetes`` execution modes, a Docker container should be available, containing the up-to-date ``dbt`` pipelines and profiles. + +Each task creates a new ECS task execution, providing full isolation. However, this separation introduces some overhead in execution time due to container startup and provisioning. If you require faster execution times, configuring appropriate ECS task definitions and cluster optimizations can help mitigate these delays. + +Setup ++++++ + +In this guide, you’ll learn how to deploy and run a Cosmos job on AWS Elastic Container Service (ECS) using Fargate. -Astronomer Cosmos provides a unified way to run containerized workloads across multiple cloud providers. In this guide, you’ll learn how to deploy and run a Cosmos job on AWS Elastic Container Service (ECS) using Fargate. Schematically, the guide will walk you through the steps required to build the following architecture: .. figure:: https://github.com/astronomer/astronomer-cosmos/raw/main/docs/_static/cosmos_aws_ecs_schematic.png diff --git a/docs/guides/run_dbt/container/aws-eks.rst b/docs/guides/run_dbt/container/aws-eks.rst new file mode 100644 index 0000000000..8ac9d2e802 --- /dev/null +++ b/docs/guides/run_dbt/container/aws-eks.rst @@ -0,0 +1,34 @@ +.. _aws-eks: + +AWS EKS execution mode +======================= + +The Amazon Elastic Kubernetes Service (AWS EKS), ``aws_eks``, approach is very similar to the ``kubernetes`` approach, but it is specifically designed to run on AWS EKS clusters. +It uses the `EKSPodOperator `_ +to run the dbt commands. You need to provide the ``cluster_name`` in your operator_args to connect to the AWS EKS cluster. + + +Example DAG + +.. code-block:: python + + postgres_password_secret = Secret( + deploy_type="env", + deploy_target="POSTGRES_PASSWORD", + secret="postgres-secrets", + key="password", + ) + + docker_cosmos_dag = DbtDag( + # ... + execution_config=ExecutionConfig( + execution_mode=ExecutionMode.AWS_EKS, + ), + operator_args={ + "image": "dbt-jaffle-shop:1.0.0", + "cluster_name": CLUSTER_NAME, + "get_logs": True, + "is_delete_operator_pod": False, + "secrets": [postgres_password_secret], + }, + ) diff --git a/docs/guides/run_dbt/container/azure-container-instance.rst b/docs/guides/run_dbt/container/azure-container-instance.rst index 86ce3ab9ef..be57ac50d1 100644 --- a/docs/guides/run_dbt/container/azure-container-instance.rst +++ b/docs/guides/run_dbt/container/azure-container-instance.rst @@ -1,16 +1,30 @@ .. _azure-container-instance: -Azure Container Instance Execution Mode + +Azure Container Instance execution mode ======================================= .. versionadded:: 1.4 -This tutorial will guide you through the steps required to use Azure Container Instance as the Execution Mode for your dbt code with Astronomer Cosmos. Schematically, the guide will walk you through the steps required to build the following architecture: +Using ``Azure Container Instances`` as the execution mode provides an isolated way of running ``dbt``, since the ``dbt`` run itself occurs within a container running in an Azure Container Instance. + +Performance and maintenance considerations +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +This execution mode requires you to have an Azure environment that can be used to run Azure Container Groups. Similar to the ``Docker`` and ``Kubernetes`` execution modes, a Docker container should be available that contains up-to-date ``dbt`` pipelines and profiles. + +Each task creates a new container on Azure, giving full isolation. This, however, comes at the cost of speed, as this separation of tasks introduces some overhead. + +Setup +~~~~~ + +This tutorial guides you through the steps required to use Azure Container Instance as the execution mode for your dbt code with Astronomer Cosmos. Schematically, the guide demonstrates how to build the following architecture: .. figure:: https://github.com/astronomer/astronomer-cosmos/raw/main/docs/_static/cosmos_aci_schematic.png :width: 800 Prerequisites -+++++++++++++ +~~~~~~~~~~~~~ + 1. Docker with docker daemon (Docker Desktop on MacOS). Follow the `Docker installation guide `_. 2. Airflow 3. Azure CLI (install guide here: `Azure CLI `_) @@ -28,7 +42,7 @@ More information on how to achieve 2-6 is detailed below. Note that the steps below will walk you through an example, for which the code can be found HERE Step-by-step guide -++++++++++++++++++ +~~~~~~~~~~~~~~~~~~ **Install Airflow and Cosmos** @@ -103,7 +117,7 @@ Take a read of the Dockerfile to understand what it does so that you could use i - The dags directory containing the `dbt project jaffle_shop `_ is added to the image - The dbt_project.yml is replaced with `postgres_profile_dbt_project.yml `_ which contains the profile key pointing to postgres_profile as profile creation is not handled at the moment for K8s operators like in local mode. -**Setup Airflow Connections** +**Set up Airflow Connections** Now you have the required Azure infrastructure, you still need to add configuration to Airflow to ensure the infrastructure can be used. You'll need 3 connections: 1. ``aci_db``: a Postgres connection to your Azure Postgres instance. @@ -112,7 +126,7 @@ Now you have the required Azure infrastructure, you still need to add configurat Check out the ``airflow-settings.yml`` file `here `_ for an example. If you are using Astro CLI, filling in the right values here will be enough for this to work. -**Setup and Trigger the DAG with Airflow** +**Set up and trigger the Dag with Airflow** Copy the dags directory from cosmos-example repo to your Airflow home diff --git a/docs/guides/run_dbt/container/docker.rst b/docs/guides/run_dbt/container/docker.rst index 0005914886..78686f6c76 100644 --- a/docs/guides/run_dbt/container/docker.rst +++ b/docs/guides/run_dbt/container/docker.rst @@ -1,28 +1,44 @@ .. _docker: -Docker Execution Mode -======================================== +Docker execution mode +===================== -The following tutorial illustrates how to run the Cosmos dbt Docker Operators and the required setup for them. +The ``docker`` approach assumes you previously created Docker image, which contains all the ``dbt`` pipelines and a ``profiles.yml`` that you manage. + +Performance and maintenance considerations +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +You can have better environment isolation with ``docker`` than when using ``local`` or ``virtualenv`` modes, but this mode also requires more maintenance and has some performance tradeoffs, depending on your project configurations. + +Using ``docker`` requires that you ensure the Docker container you use has up-to-date files and you might potentially need to manage secrets in multiple places. Another challenge of working with ``docker`` occurs when the Airflow worker is already running in Docker, which can cause problems related to running `Docker in Docker `__. + +Also, the Docker execution mode approach can be significantly slower than ``virtualenv``, since it might require building the ``Docker`` container before executing dbt commands, which is slower than creating a Virtualenv with ``dbt-core``. + +Finally, if you run Airflow in a container - such as in an Astro deployment - you may encounter challenges when attempting to use Cosmos ``ExecutionMode.DOCKER``. Unless you have a strong reason to pursue this setup, we generally advise against running a container inside another container, as discussed in the article `Do not use Docker-in-Docker for CI `_. + + +Set up Docker execution mode +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The following procedure illustrates how to run the Cosmos dbt Docker Operators and the required setup for them. Requirements -++++++++++++ +------------ -1. Docker with docker daemon (Docker Desktop on MacOS). Follow the `Docker installation guide `_. -2. Airflow -3. Astronomer-cosmos package containing the dbt Docker operators -4. Postgres docker container -5. Docker image built with required dbt project and dbt DAG -6. dbt DAG with dbt docker operators in the Airflow DAGs directory to run in Airflow +- Docker with Docker Desktop (Docker Desktop on MacOS) or equivalent (such as `Orbstack `__). Follow the `Docker installation guide `_. -More information on how to achieve 2-6 is detailed below. +The following example setup steps include setting up the following: -Step-by-step instructions -+++++++++++++++++++++++++ +- Airflow +- Astronomer-cosmos package containing the dbt Docker operators +- Postgres Docker container +- Docker image built with required dbt project and dbt DAG +- dbt DAG with dbt Docker operators in the Airflow DAGs directory to run in Airflow -**Install Airflow and Cosmos** +1. Install Airflow and Cosmos +----------------------------- -Create a python virtualenv, activate it, upgrade pip to the latest version and install `Apache Airflow® `_ & astronomer-postgres +Create a python virtualenv, activate it, upgrade pip to the latest version, and install `Apache Airflow® `_ & ``astronomer-postgres``: .. code-block:: bash @@ -32,26 +48,28 @@ Create a python virtualenv, activate it, upgrade pip to the latest version and i pip install apache-airflow pip install "astronomer-cosmos[dbt-postgres]" -**Setup Postgres database** +2. Set up Postgres database +--------------------------- -You will need a postgres database running to be used as the database for the dbt project. Run the following command to run and expose a postgres database +You will need a PostgreSQL database running to use as the database for the dbt project. Run the following command to run and expose a postgres database: .. code-block:: bash docker run --name some-postgres -e POSTGRES_PASSWORD="" -e POSTGRES_USER=postgres -e POSTGRES_DB=postgres -p5432:5432 -d postgres -**Build the dbt Docker image** +3. Build the dbt Docker image +----------------------------- -For the Docker operators to work, you need to create a docker image that will be supplied as image parameter to the dbt docker operators used in the DAG. +For the Docker operators to work, you need to create a Docker image that will be supplied as image parameter to the dbt Docker operators used in the DAG. -Clone the `cosmos-example `_ repo +1. Clone the `cosmos-example `_ repo .. code-block:: bash git clone https://github.com/astronomer/cosmos-example.git cd cosmos-example -Create a docker image containing the dbt project files and dbt profile by using the `Dockerfile `_, which will be supplied to the Docker operators. +2. Create a Docker image containing the dbt project files and dbt profile by using the `Dockerfile `_, which will be supplied to the Docker operators. .. code-block:: bash @@ -59,7 +77,7 @@ Create a docker image containing the dbt project files and dbt profile by using .. note:: - If running on M1, you may need to set the following envvars for running the docker build command in case it fails + If running on M1, you may need to set the following environment variables for running the Docker build command, in case it fails. .. code-block:: bash @@ -67,21 +85,25 @@ Create a docker image containing the dbt project files and dbt profile by using export COMPOSE_DOCKER_CLI_BUILD=0 export DOCKER_DEFAULT_PLATFORM=linux/amd64 -Take a read of the Dockerfile to understand what it does so that you could use it as a reference in your project. +Read the following example Dockerfiles to understand what it does so that you can use them as a project reference. + +- The `dbt profile `_ file is added to the image. +- The ``dags`` directory containing the `dbt project jaffle_shop `_ is added to the image. +- The ``dbt_project.yml`` is replaced with `postgres_profile_dbt_project.yml `_, which contains the profile key pointing to ``postgres_profile`` because profile creation is not handled for K8s operators, like in local mode. - - The `dbt profile `_ file is added to the image - - The dags directory containing the `dbt project jaffle_shop `_ is added to the image - - The dbt_project.yml is replaced with `postgres_profile_dbt_project.yml `_ which contains the profile key pointing to postgres_profile as profile creation is not handled at the moment for K8s operators like in local mode. -**Setup and Trigger the DAG with Airflow** +4. Set up and trigger the Dag with Airflow +------------------------------------------ -Copy the dags directory from cosmos-example repo to your Airflow home +1. Copy the ``dags`` directory from the ``cosmos-example`` repo to your Airflow home .. code-block:: bash cp -r dags $AIRFLOW_HOME/ -Run Airflow +This directory contains a Docker-specific example Dag. + +2. Run Airflow .. code-block:: bash @@ -89,18 +111,17 @@ Run Airflow .. note:: - You might need to run airflow standalone with ``sudo`` if your Airflow user is not able to access the docker socket URL or pull the images in the Kind cluster. + You might need to run airflow standalone with ``sudo`` if your Airflow user is not able to access the Docker socket URL or pull the images in the ``Kind`` cluster. -Log in to Airflow through a web browser ``http://localhost:8080/``, using the user ``airflow`` and the password described in the ``standalone_admin_password.txt`` file. +3. Log in to Airflow through a web browser, ``http://localhost:8080/``, using the user ``airflow`` and the password described in the ``standalone_admin_password.txt`` file. -Enable and trigger a run of the `jaffle_shop_docker `_ DAG. You will be able to see the following successful DAG run. +4. Enable and trigger a run of the `jaffle_shop_docker `_ Dag. You can see the following successful Dag run example: .. figure:: https://github.com/astronomer/astronomer-cosmos/raw/main/docs/_static/jaffle_shop_docker_dag_run.png :width: 800 - Specifying ProfileConfig -+++++++++++++++++++++++++ +~~~~~~~~~~~~~~~~~~~~~~~~ Starting with Cosmos 1.8.0, you can use the ``profile_config`` argument in your Dbt DAG Docker operators to reference profiles for your dbt project defined in a profiles.yml file. To do so, provide the file’s path via the @@ -109,3 +130,8 @@ profiles for your dbt project defined in a profiles.yml file. To do so, provide Note that in ``ExecutionMode.DOCKER``, the ``profile_config`` is only compatible with the ``profiles_yml_path`` approach. The ``profile_mapping`` method will not work because the required Airflow connections cannot be accessed within the Docker container to map them to the dbt profile. + +Troubleshooting +~~~~~~~~~~~~~~~ + +If dbt is unavailable in the Airflow scheduler, the default ``LoadMode.DBT_LS`` will not work. In this scenario, you must use a :ref:`parsing-methods` that does not rely on dbt, such as ``LoadMode.MANIFEST``. diff --git a/docs/guides/run_dbt/container/gcp-cloud-run-job.rst b/docs/guides/run_dbt/container/gcp-cloud-run-job.rst index fa4d0c60c4..090f9aa395 100644 --- a/docs/guides/run_dbt/container/gcp-cloud-run-job.rst +++ b/docs/guides/run_dbt/container/gcp-cloud-run-job.rst @@ -1,9 +1,24 @@ .. _gcp-cloud-run-job: -GCP Cloud Run Job Execution Mode -======================================= +GCP Cloud Run Job execution mode +================================= .. versionadded:: 1.7 +The ``gcp_cloud_run_job`` execution mode is particularly useful if you prefer to run their ``dbt`` commands on Google Cloud infrastructure, taking advantage of Cloud Run's scalability, isolation, and managed service capabilities. + +Performance and maintenance considerations +++++++++++++++++++++++++++++++++++++++++++ + +For the ``gcp_cloud_run_job`` execution mode to work, a Cloud Run Job instance must first be created using a previously built Docker container. This container should include the latest ``dbt`` pipelines and profiles. You can find more details in the `Cloud Run Job creation guide `__ . + +This execution mode allows you to run ``dbt`` core CLI commands in a Google Cloud Run Job instance. This mode leverages the ``CloudRunExecuteJobOperator`` from the Google Cloud Airflow provider to execute commands within a Cloud Run Job instance, where ``dbt`` is already installed. Similarly to the ``Docker`` and ``Kubernetes`` execution modes, a Docker container should be available, containing the up-to-date ``dbt`` pipelines and profiles. + +Each task will create a new Cloud Run Job execution, giving full isolation. The separation of tasks adds extra overhead; however, that can be mitigated by using the ``concurrency`` parameter in ``DbtDag``, which will result in parallelized execution of ``dbt`` models. + + +Setup ++++++ + This tutorial will guide you through the steps required to use Cloud Run Job instance as the Execution Mode for your dbt code with Astronomer Cosmos. This guide will walk you through the steps required to build the following architecture: .. figure:: https://github.com/astronomer/astronomer-cosmos/raw/main/docs/_static/cosmos_gcp_crj_schematic.png diff --git a/docs/guides/run_dbt/container/index.rst b/docs/guides/run_dbt/container/index.rst index 8e1051dc35..a407c996b3 100644 --- a/docs/guides/run_dbt/container/index.rst +++ b/docs/guides/run_dbt/container/index.rst @@ -9,5 +9,6 @@ Run dbt in a container kubernetes watcher-kubernetes-execution-mode aws-container-run-job + aws-eks azure-container-instance gcp-cloud-run-job diff --git a/docs/guides/run_dbt/container/kubernetes.rst b/docs/guides/run_dbt/container/kubernetes.rst index 4ff074ec41..2c6f0c7c5d 100644 --- a/docs/guides/run_dbt/container/kubernetes.rst +++ b/docs/guides/run_dbt/container/kubernetes.rst @@ -1,7 +1,21 @@ .. _kubernetes: -Kubernetes Execution Mode -============================================== + +Kubernetes execution mode +========================== + +The ``kubernetes`` execution mode provides a very isolated method to run ``dbt`` from within a Kubernetes Pod, usually in a separate host. + +Performance and maintenance considerations +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +This execution mode assumes you have a Kubernetes cluster. It also expects you to ensure the Docker container has up-to-date ``dbt`` pipelines and profiles, potentially leading you to declare secrets in two places; Airflow and Docker container. + +The ``Kubernetes`` deployment might be slower than ``Docker`` and ``Virtualenv``, assuming that the container image is built (which is slower than creating a Python ``virtualenv`` and installing ``dbt-core``) and the Airflow task needs to spin up a new ``Pod`` in Kubernetes. + + +Set up Kubernetes execution mode +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The following tutorial illustrates how to run the Cosmos dbt Kubernetes Operator using a local Kubernetes (K8s) cluster. It assumes the following: @@ -9,7 +23,7 @@ The following tutorial illustrates how to run the Cosmos dbt Kubernetes Operator - Airflow is run locally, and it triggers a K8s Pod which runs dbt Requirements -++++++++++++ +~~~~~~~~~~~~ To test the DbtKubernetesOperators locally, we encourage you to install the following: @@ -34,7 +48,7 @@ For instance, :end-before: [END kubernetes_tg_example] Step-by-step instructions -+++++++++++++++++++++++++ +~~~~~~~~~~~~~~~~~~~~~~~~~~ Using installed `Kind `_, you can setup a local kubernetes cluster @@ -153,7 +167,7 @@ Enable and trigger a run of the `jaffle_shop_k8s `__ -10. **watcher**: (experimental since Cosmos 1.11.0) Run a single ``dbt build`` command from a producer task and have sensor tasks to watch the progress of the producer, with improved DAG run time while maintaining the tasks lineage in the Airflow UI, and ability to retry failed tasks. Check the :ref:`watcher-execution-mode` for more details. -11. **watcher_kubernetes**: (experimental since Cosmos 1.13.0) Combines the speed of the watcher execution mode with the isolation of Kubernetes. Check the :ref:`watcher-kubernetes-execution-mode` for more details. +There are two categories of execution modes: -The choice of the ``execution mode`` can vary based on each user's needs and concerns. For more details, check each execution mode described below. +1. **Execute dbt commands on the Airflow worker.** These execution modes offer faster execution times, since no extra container needs to be spun up. But, they also don't offer environment isolation, or only provide limited isolation. There are four options for this type of execution mode: ``watcher``, ``local``, ``virtualenv``, and ``airflow_async``. ``airflow_async`` is available for BigQuery as of Cosmos 1.9 and ``watcher`` is available as of Cosmos 1.11. + +2. **Execute dbt commands in a container** This type of execution mode offers high levels of environment isolation and also allows you to run dbt from either containers or external jobs, in both on-premises environments and various cloud services. There are multiple options for this type of execution mode: ``docker``, ``kubernetes``, ``watcher_kubernetes``, ``aws_ecs``, ``aws_eks``, ``azure_container_instance``, and ``gcp_cloud_run_job``. + + +On the Airflow worker +~~~~~~~~~~~~~~~~~~~~~~ + +These execution modes offer faster execution times, since you don't need to spin up any extra containers. You can also use Airflow connections via the ``ProfileConfig``. But, these execution modes do not have any, or offer limited, environment isolation. + +There are four execution mode options that run on the Airflow worker: + +- :ref:`local `: Default execution mode in Cosmos 1.x. In this mode, a dbt command is executed for each dbt node (one dbt command per Airflow task). The dbt project is reparsed during every task execution. By default, each task runs a single dbt node and relies on a user-preinstalled dbt. This mode can operate in two ways: + + - **No isolation**: dbt is installed in the same Python virtual environment as Airflow. In this case, Cosmos can invoke dbt commands as a library rather than as a subprocess, leading to performance gains. + - **Partial isolation**: Create a dedicated Python virtual environment in the Airflow deployment, install dbt there, and configure Cosmos to use it by setting ``ExecutionConfig.dbt_executable_path``. This provides a good solution for dependency conflicts. + +- :ref:`watcher `: (Experimental since Cosmos 1.11.0) Optimized for execution speed. Run a single ``dbt build`` command from a producer task and have sensor tasks to watch the progress of the producer, with improved DAG run time while maintaining the tasks lineage in the Airflow UI, and ability to retry failed tasks. +- :ref:`virtualenv `: Runs dbt commands from Python virtual environments created and managed by Cosmos. This mode removes the need to create a virtual environment at build time, unlike ``ExecutionMode.LOCAL``, while avoiding package conflicts. It is intended for cases where: + + - **Can't install dbt directly**: If you can't install dbt in the Airflow environment (either in the same environment or a dedicated one). + - **Multiple dbt installations are required**: If you require multiple dbt installations and you prefer Cosmos to manage them without modifying the Airflow deployment. + + In most cases, the local execution mode with ``ExecutionConfig.dbt_executable_path`` is the preferred option instead of ``virtualenv``, because local mode with ``ExecutionConfig.dbt_executable_path`` allows you to manage the dbt environment while keeping the Airflow deployment simpler. + +- :ref:`airflow_async `: (Stable since Cosmos 1.9.0) Optimized for worker efficiency if you have long-running dbt commands. Currently only works with BigQuery. Pre-compile the SQL transformations with dbt in a setup task and execute them asynchronously using Apache Airflow's `Deferrable operators `__. + +In a container +~~~~~~~~~~~~~~ + +You can also execute dbt commands in a container. Choosing these kinds of execution modes provides a high degree of isolation. However, they come with limitations where you can only create Airflow connections with the dbt ``profiles.yml`` file and it has slower run times because of container provisioning. They all also require a pre-existing Docker image. + +- :ref:`docker ` : Run ``dbt`` commands via Docker containers inside the Airflow worker node. +- :ref:`kubernetes `: Run ``dbt`` commands within Kubernetes Pods managed by Cosmos. +- :ref:`watcher_kubernetes `: (experimental since Cosmos 1.13.0) Combines the speed of the watcher execution mode with the isolation of Kubernetes. +- :ref:`aws_ecs `: Run ``dbt`` commands in containers via AWS ECS. +- :ref:`aws_eks `: Run ``dbt`` commands via Kubernetes Pods in AWS EKS. +- :ref:`azure_container_instance `: Run ``dbt`` commands in Azure Container Instances. +- :ref:`gcp_cloud_run_job `: Run ``dbt`` commands via a container managed by GCP Cloud Run Job. .. _execution-modes-comparison: +Execution modes comparison +~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The type of execution mode that you choose directly affects how fast your Cosmos Dag runs. + .. list-table:: Execution Modes Comparison :widths: 25 25 25 25 :header-rows: 1 @@ -31,12 +64,20 @@ The choice of the ``execution mode`` can vary based on each user's needs and con - Cosmos Profile Management * - Local - Fast - - None + - None/Lightweight + - Yes + * - Watcher + - Very Fast + - None/Lightweight - Yes * - Virtualenv - Medium - Lightweight - Yes + * - Airflow Async + - Very Fast + - Lightweight/Medium + - Yes * - Docker - Slow - Medium @@ -45,330 +86,23 @@ The choice of the ``execution mode`` can vary based on each user's needs and con - Slow - High - No - * - AWS_EKS - - Slow + * - Watcher Kubernetes + - Fast - High - No - * - Azure Container Instance + * - AWS ECS - Slow - High - No - * - GCP Cloud Run Job Instance + * - AWS_EKS - Slow - High - No - * - AWS ECS + * - Azure Container Instance - Slow - High - No - * - Airflow Async - - Very Fast - - Medium - - Yes - * - Watcher - - Very Fast - - None - - Yes - * - Watcher Kubernetes - - Fast + * - GCP Cloud Run Job Instance + - Slow - High - No - -Local ------ - -By default, Cosmos uses the ``local`` execution mode. - -The ``local`` execution mode is the fastest way to run Cosmos operators since they don't install ``dbt`` nor build docker containers. However, it may not be an option for users using managed Airflow services such as -Google Cloud Composer, since Airflow and ``dbt`` dependencies can conflict (:ref:`execution-modes-local-conflicts`), the user may not be able to install ``dbt`` in a custom path. - -The ``local`` execution mode assumes a ``dbt`` binary is reachable within the Airflow worker node. - -If ``dbt`` was not installed as part of the Cosmos packages, -users can define a custom path to ``dbt`` by declaring the argument ``dbt_executable_path``. - -.. note:: - Starting in the 1.4 version, Cosmos tries to leverage the dbt partial parsing (``partial_parse.msgpack``) to speed up task execution. - This feature is bound to `dbt partial parsing limitations `_. - Learn more: :ref:`partial-parsing`. - -When using the ``local`` execution mode, Cosmos converts Airflow Connections into a native ``dbt`` profiles file (``profiles.yml``). - -Example of how to use, for instance, when ``dbt`` was installed together with Cosmos: - -.. literalinclude:: ../../../dev/dags/basic_cosmos_dag.py - :language: python - :start-after: [START local_example] - :end-before: [END local_example] - - -Virtualenv ----------- - -If you're using managed Airflow on GCP (Cloud Composer), for instance, we recommend you use the ``virtualenv`` execution mode. - -The ``virtualenv`` mode isolates the Airflow worker dependencies from ``dbt`` by managing a Python virtual environment created during task execution and deleted afterwards. - -In this case, users are responsible for declaring which version of ``dbt`` they want to use by giving the argument ``py_requirements``. This argument can be set directly in operator instances or when instantiating ``DbtDag`` and ``DbtTaskGroup`` as part of ``operator_args``. - -Similar to the ``local`` execution mode, Cosmos converts Airflow Connections into a way ``dbt`` understands them by creating a ``dbt`` profile file (``profiles.yml``). -Also similar to the ``local`` execution mode, Cosmos will by default attempt to use a ``partial_parse.msgpack`` if one exists to speed up parsing. - -Some drawbacks of this approach: - -- It is slower than ``local`` because it creates a new Python virtual environment for each Cosmos dbt task run. -- If dbt is unavailable in the Airflow scheduler, the default ``LoadMode.DBT_LS`` will not work. In this scenario, users must use a :ref:`parsing-methods` that does not rely on dbt, such as ``LoadMode.MANIFEST``. -- Only ``InvocationMode.SUBPROCESS`` is supported currently, attempt to use ``InvocationMode.DBT_RUNNER`` will raise error. - -Example of how to use: - -.. literalinclude:: ../../../dev/dags/example_virtualenv.py - :language: python - :start-after: [START virtualenv_example] - :end-before: [END virtualenv_example] - -Docker ------- - -The ``docker`` approach assumes users have a previously created Docker image, which should contain all the ``dbt`` pipelines and a ``profiles.yml``, managed by the user. - -The user has better environment isolation than when using ``local`` or ``virtualenv`` modes, but also more responsibility (ensuring the Docker container used has up-to-date files and managing secrets potentially in multiple places). - -The other challenge with the ``docker`` approach is if the Airflow worker is already running in Docker, which sometimes can lead to challenges running `Docker in Docker `__. - -This approach can be significantly slower than ``virtualenv`` since it may have to build the ``Docker`` container, which is slower than creating a Virtualenv with ``dbt-core``. -If dbt is unavailable in the Airflow scheduler, the default ``LoadMode.DBT_LS`` will not work. In this scenario, users must use a :ref:`parsing-methods` that does not rely on dbt, such as ``LoadMode.MANIFEST``. - -Check the step-by-step guide on using the ``docker`` execution mode at :ref:`docker`. - -Example DAG: - -.. code-block:: python - - docker_cosmos_dag = DbtDag( - # ... - execution_config=ExecutionConfig( - execution_mode=ExecutionMode.DOCKER, - ), - operator_args={ - "image": "dbt-jaffle-shop:1.0.0", - "network_mode": "bridge", - }, - ) - - -Kubernetes ----------- - -The ``kubernetes`` approach is a very isolated way of running ``dbt`` since the ``dbt`` run commands from within a Kubernetes Pod, usually in a separate host. - -It assumes the user has a Kubernetes cluster. It also expects the user to ensure the Docker container has up-to-date ``dbt`` pipelines and profiles, potentially leading the user to declare secrets in two places (Airflow and Docker container). - -The ``Kubernetes`` deployment may be slower than ``Docker`` and ``Virtualenv`` assuming that the container image is built (which is slower than creating a Python ``virtualenv`` and installing ``dbt-core``) and the Airflow task needs to spin up a new ``Pod`` in Kubernetes. - -Check the step-by-step guide on using the ``kubernetes`` execution mode at :ref:`kubernetes`. - -Example DAG: - -.. literalinclude:: ../../../dev/dags/jaffle_shop_kubernetes.py - :language: python - :start-after: [START kubernetes_seed_example] - :end-before: [END kubernetes_seed_example] - -AWS_EKS ----------- - -The ``aws_eks`` approach is very similar to the ``kubernetes`` approach, but it is specifically designed to run on AWS EKS clusters. -It uses the `EKSPodOperator `_ -to run the dbt commands. You need to provide the ``cluster_name`` in your operator_args to connect to the AWS EKS cluster. - - -Example DAG: - -.. code-block:: python - - postgres_password_secret = Secret( - deploy_type="env", - deploy_target="POSTGRES_PASSWORD", - secret="postgres-secrets", - key="password", - ) - - docker_cosmos_dag = DbtDag( - # ... - execution_config=ExecutionConfig( - execution_mode=ExecutionMode.AWS_EKS, - ), - operator_args={ - "image": "dbt-jaffle-shop:1.0.0", - "cluster_name": CLUSTER_NAME, - "get_logs": True, - "is_delete_operator_pod": False, - "secrets": [postgres_password_secret], - }, - ) - -Azure Container Instance ------------------------- -.. versionadded:: 1.4 - -Similar to the ``kubernetes`` approach, using ``Azure Container Instances`` as the execution mode gives a very isolated way of running ``dbt``, since the ``dbt`` run itself is run within a container running in an Azure Container Instance. - -This execution mode requires the user has an Azure environment that can be used to run Azure Container Groups in (see :ref:`azure-container-instance` for more details on the exact requirements). Similarly to the ``Docker`` and ``Kubernetes`` execution modes, a Docker container should be available, containing the up-to-date ``dbt`` pipelines and profiles. - -Each task will create a new container on Azure, giving full isolation. This, however, comes at the cost of speed, as this separation of tasks introduces some overhead. Please checkout the step-by-step guide for using Azure Container Instance as the execution mode - - -.. code-block:: python - - docker_cosmos_dag = DbtDag( - # ... - execution_config=ExecutionConfig( - execution_mode=ExecutionMode.AZURE_CONTAINER_INSTANCE - ), - operator_args={ - "ci_conn_id": "aci", - "registry_conn_id": "acr", - "resource_group": "my-rg", - "name": "my-aci-{{ ti.task_id.replace('.','-').replace('_','-') }}", - "region": "West Europe", - "image": "dbt-jaffle-shop:1.0.0", - }, - ) - -GCP Cloud Run Job ------------------------- -.. versionadded:: 1.7 - -The ``gcp_cloud_run_job`` execution mode is particularly useful for users who prefer to run their ``dbt`` commands on Google Cloud infrastructure, taking advantage of Cloud Run's scalability, isolation, and managed service capabilities. - -For the ``gcp_cloud_run_job`` execution mode to work, a Cloud Run Job instance must first be created using a previously built Docker container. This container should include the latest ``dbt`` pipelines and profiles. You can find more details in the `Cloud Run Job creation guide `__ . - -This execution mode allows users to run ``dbt`` core CLI commands in a Google Cloud Run Job instance. This mode leverages the ``CloudRunExecuteJobOperator`` from the Google Cloud Airflow provider to execute commands within a Cloud Run Job instance, where ``dbt`` is already installed. Similarly to the ``Docker`` and ``Kubernetes`` execution modes, a Docker container should be available, containing the up-to-date ``dbt`` pipelines and profiles. - -Each task will create a new Cloud Run Job execution, giving full isolation. The separation of tasks adds extra overhead; however, that can be mitigated by using the ``concurrency`` parameter in ``DbtDag``, which will result in parallelized execution of ``dbt`` models. - - -.. code-block:: python - - gcp_cloud_run_job_cosmos_dag = DbtDag( - # ... - execution_config=ExecutionConfig(execution_mode=ExecutionMode.GCP_CLOUD_RUN_JOB), - operator_args={ - "project_id": "my-gcp-project-id", - "region": "europe-west1", - "job_name": "my-crj-{{ ti.task_id.replace('.','-').replace('_','-') }}", - }, - ) - - -AWS ECS ---------- -.. versionadded:: 1.9.0 - -Using ``AWS Elastic Container Service (ECS)`` as the execution mode provides an isolated and scalable way to run ``dbt`` tasks within an AWS ECS service. This execution mode ensures that each ``dbt`` run is performed inside a dedicated container running in an ECS task. - -This execution mode requires the user to have an AWS environment configured to run ECS tasks (see :ref:``aws-ecs`` for more details on the exact requirements). Similar to the ``Docker`` and ``Kubernetes`` execution modes, a Docker container should be available, containing the up-to-date ``dbt`` pipelines and profiles. - -Each task will create a new ECS task execution, providing full isolation. However, this separation introduces some overhead in execution time due to container startup and provisioning. For users who require faster execution times, configuring appropriate ECS task definitions and cluster optimizations can help mitigate these delays. - -Please refer to the step-by-step guide for using AWS ECS as the execution mode. - -.. code-block:: python - - aws_ecs_cosmos_dag = DbtDag( - # ... - execution_config=ExecutionConfig(execution_mode=ExecutionMode.AWS_ECS), - operator_args={ - "aws_conn_id": "aws_default", - "cluster": "my-ecs-cluster", - "task_definition": "my-dbt-task", - "container_name": "dbt-container", - "launch_type": "FARGATE", - "deferrable": True, - "network_configuration": { - "awsvpcConfiguration": { - "subnets": ["<<>>"], - "assignPublicIp": "ENABLED", - }, - }, - "environment_variables": {"DBT_PROFILE_NAME": "default"}, - }, - ) - -.. _airflow-async-execution-mode: - -Airflow Async -------------- - -.. versionadded:: 1.9.0 - -Although this execution mode was introduced in Cosmos 1.9, we strongly encourage users to use Cosmos 1.11, which has significant performance improvements. -In comparison to the ``local``, the ``airflow_async`` execution mode can reduce the execution time of a dbt project by up to 36%. - -The ``airflow_async`` execution mode is a way to run the dbt resources from your dbt project using Apache Airflow's -`Deferrable operators `__. -This execution mode could be preferred when you've long running resources and you want to run them asynchronously by -leveraging Airflow's deferrable operators. With that, you would be able to potentially observe higher throughput of tasks -as more dbt nodes will be run in parallel since they won't be blocking Airflow's worker slots. - -Example DAG: - -.. literalinclude:: ../../../dev/dags/simple_dag_async.py - :language: python - :start-after: [START airflow_async_execution_mode_example] - :end-before: [END airflow_async_execution_mode_example] - -For a full step-by-step guide and limitations, check the :ref:`async-execution-mode` page. - - -Watcher Execution Mode (Experimental) -------------------------------------- - -.. versionadded:: 1.11.0 - -The ``watcher`` execution mode is an experimental execution mode that runs a single ``dbt build`` command from a producer task and has sensor tasks to watch the progress of the producer. -It is designed to improve DAG run time while maintaining the tasks lineage in the Airflow UI, and ability to retry failed tasks. - -Check the :ref:`watcher-execution-mode` for more details. - - -Watcher Kubernetes Execution Mode (Experimental) ------------------------------------------------- - -.. versionadded:: 1.13.0 - -The ``watcher_kubernetes`` execution mode combines the speed of the ``watcher`` execution mode with the isolation of the ``kubernetes`` execution mode. It runs a single ``dbt build`` command from a producer task inside a Kubernetes pod and has sensor tasks to watch the progress of the producer. - -Check the :ref:`watcher-kubernetes-execution-mode` for more details. - - -.. _invocation_modes: - -Invocation Modes -================ -.. versionadded:: 1.4 - -For ``ExecutionMode.LOCAL`` execution mode, Cosmos supports two invocation modes for running dbt: - -1. ``InvocationMode.SUBPROCESS``: In this mode, Cosmos runs dbt cli commands using the Python ``subprocess`` module and parses the output to capture logs and to raise exceptions. - -2. ``InvocationMode.DBT_RUNNER``: In this mode, Cosmos uses the ``dbtRunner`` available for `dbt programmatic invocations `__ to run dbt commands. \ - In order to use this mode, dbt must be installed in the same local environment. This mode does not have the overhead of spawning new subprocesses or parsing the output of dbt commands and is faster than ``InvocationMode.SUBPROCESS``. \ - This mode requires dbt version 1.5.0 or higher. It is up to the user to resolve :ref:`execution-modes-local-conflicts` when using this mode. - -The invocation mode can be set in the ``ExecutionConfig`` as shown below: - -.. code-block:: python - - from cosmos.constants import InvocationMode - - dag = DbtDag( - # ... - execution_config=ExecutionConfig( - execution_mode=ExecutionMode.LOCAL, - invocation_mode=InvocationMode.DBT_RUNNER, - ), - ) - -If the invocation mode is not set, Cosmos will attempt to use ``InvocationMode.DBT_RUNNER`` if dbt is installed in the same environment as the worker, otherwise it will fall back to ``InvocationMode.SUBPROCESS``. diff --git a/docs/optimize_performance/index.rst b/docs/optimize_performance/index.rst index 6ce9e2da20..97270b4631 100644 --- a/docs/optimize_performance/index.rst +++ b/docs/optimize_performance/index.rst @@ -9,3 +9,4 @@ Optimize the performance of your Cosmos Dags memory_optimization caching + invocation_mode diff --git a/docs/optimize_performance/invocation_mode.rst b/docs/optimize_performance/invocation_mode.rst new file mode 100644 index 0000000000..f32c7a32ac --- /dev/null +++ b/docs/optimize_performance/invocation_mode.rst @@ -0,0 +1,29 @@ +.. _invocation-mode: + +Invocation Modes +================ +.. versionadded:: 1.4 + +For ``ExecutionMode.LOCAL`` execution mode, Cosmos supports two invocation modes for running dbt: + +1. ``InvocationMode.SUBPROCESS``: In this mode, Cosmos runs dbt cli commands using the Python ``subprocess`` module and parses the output to capture logs and to raise exceptions. + +2. ``InvocationMode.DBT_RUNNER``: In this mode, Cosmos uses the ``dbtRunner`` available for `dbt programmatic invocations `__ to run dbt commands. \ + In order to use this mode, dbt must be installed in the same local environment. This mode does not have the overhead of spawning new subprocesses or parsing the output of dbt commands and is faster than ``InvocationMode.SUBPROCESS``. \ + This mode requires dbt version 1.5.0 or higher. It is up to the user to resolve :ref:`execution-modes-local-conflicts` when using this mode. + +The invocation mode can be set in the ``ExecutionConfig`` as shown below: + +.. code-block:: python + + from cosmos.constants import InvocationMode + + dag = DbtDag( + # ... + execution_config=ExecutionConfig( + execution_mode=ExecutionMode.LOCAL, + invocation_mode=InvocationMode.DBT_RUNNER, + ), + ) + +If the invocation mode is not set, Cosmos will attempt to use ``InvocationMode.DBT_RUNNER`` if dbt is installed in the same environment as the worker, otherwise it will fall back to ``InvocationMode.SUBPROCESS``. diff --git a/docs/reference/configs/execution-config.rst b/docs/reference/configs/execution-config.rst index 0ee5199fd9..cdaf3abcb1 100644 --- a/docs/reference/configs/execution-config.rst +++ b/docs/reference/configs/execution-config.rst @@ -6,8 +6,8 @@ It does this by exposing a ``cosmos.config.ExecutionConfig`` class that you can The ``ExecutionConfig`` class takes the following arguments: -- ``execution_mode``: The way dbt is run when executing within airflow. For more information, see the `execution modes <../getting_started/execution-modes.html>`_ page. -- ``invocation_mode`` (new in v1.4): The way dbt is invoked within the execution mode. This is only configurable for ``ExecutionMode.LOCAL``. For more information, see `invocation modes <../getting_started/execution-modes.html#invocation-modes>`_. +- ``execution_mode``: The way dbt is run when executing within airflow. For more information, see the :ref:`execution-modes` page. +- ``invocation_mode`` (new in v1.4): The way dbt is invoked within the execution mode. This is only configurable for ``ExecutionMode.LOCAL``. For more information, see :ref:`invocation-mode`. - ``test_indirect_selection``: The mode to configure the test behavior when performing indirect selection. - ``dbt_executable_path``: The path to the dbt executable for dag generation. Defaults to dbt if available on the path. - ``dbt_project_path``: Configures the dbt project location accessible at runtime for dag execution. This is the project path in a docker container for ``ExecutionMode.DOCKER`` or ``ExecutionMode.KUBERNETES``. Mutually exclusive with ``ProjectConfig.dbt_project_path``.