diff --git a/docs/_static/execution_mode_map.png b/docs/_static/execution_mode_map.png new file mode 100644 index 0000000000..97bb1f546d Binary files /dev/null and b/docs/_static/execution_mode_map.png differ diff --git a/docs/conf.py b/docs/conf.py index 1f20af6190..07d7aba416 100644 --- a/docs/conf.py +++ b/docs/conf.py @@ -53,3 +53,25 @@ } generate_mapping_docs() + +# -- Begin docs redirect section +# -- To test redirects in a local build, paste the redirect source, and append .html to the end. +# -- For example, "airflow3_compatibility/index" redirect must be tested using "airflow3_compatibility/index.html" +# -- https://documatt.com/sphinx-reredirects/usage/ +redirects = { + "configuration/caching": "../optimize_performance/caching.html", + "configuration/memory_optimization": "../optimize_performance/memory_optimization.html", + "configuration/partial-parsing": "../optimize_performance/partial-parsing.html", + "configuration/selecting-excluding": "../optimize_performance/selecting-excluding.html", + "getting_started/async-execution-mode": "../guides/run_dbt/airflow-worker/async-execution-mode.html", + "getting_started/aws-container-run-job": "../guides/run_dbt/airflow-worker/async-execution-mode.html", + "getting_started/azure-container-instance": "../guides/run_dbt/container/azure-container-instance.html", + "getting_started/custom-airflow-properties": "../run_dbt/airflow-worker/custom-airflow-properties.html", + "getting_started/docker": "../guides/run_dbt/container/docker.html", + "getting_started/execution-modes-local-conflicts": "../reference/troubleshooting/execution-modes-local-conflicts.html", + "getting_started/gcp-cloud-run-job": "../guides/run_dbt/container/gcp-cloud-run-job.html", + "getting_started/kubernetes": "../guides/run_dbt/container/kubernetes.html", + "getting_started/operators": "../guides/run_dbt/operators/operators.html", + "getting_started/watcher-execution-mode": "../guides/run_dbt/airflow-worker/watcher-execution-mode.html", + "getting_started/watcher-kubernetes-execution-mode": "../guides/run_dbt/container/watcher-kubernetes-execution-mode.html", +} diff --git a/docs/configuration/index.rst b/docs/configuration/index.rst deleted file mode 100644 index a6042327b0..0000000000 --- a/docs/configuration/index.rst +++ /dev/null @@ -1,36 +0,0 @@ -.. _configuration: - -Configuration -============= - -Cosmos offers a number of configuration options to customize its behavior. For more info, check out the links on the left or the table of contents below. - -.. toctree:: - :caption: Contents: - - dbt Fusion - Multi-Project Setups - - Project Config - Profile Config - Execution Config - Render Config - - Parsing Methods - Configuring in Airflow - Configuring Lineage - Generating Docs - Hosting Docs - Scheduling - Testing Behavior - Selecting & Excluding - Partial Parsing - Source Nodes Rendering - Post-rendering DAG customization - Operator Args - Compiled SQL - Logging - Caching - Task display name - Callbacks - Memory Optimization diff --git a/docs/contributing.rst b/docs/contributing.rst index 006149faac..d50c120398 100644 --- a/docs/contributing.rst +++ b/docs/contributing.rst @@ -155,17 +155,19 @@ To run the checks manually, run: Writing Docs ____________ -`Hatch `_ is a unified command-line tool for managing dependencies and environment isolation for Python developers. In Cosmos, we use a Hatchto declare the dependencies required for the project itself, as well as for tests and documentation builds. +`Hatch `_ is a unified command-line tool for managing dependencies and environment isolation for Python developers. In Cosmos, we use a Hatch to declare the dependencies required for the project itself, as well as for tests and documentation builds. If you don’t already have Hatch installed, please `install it `_ before proceeding. As an example, on macOS, you can do so with: .. code-block:: bash + brew install hatch You can run the docs locally by running the following: .. code-block:: bash + hatch run docs:serve diff --git a/docs/getting_started/astro.rst b/docs/getting_started/astro.rst index b590575f2e..56e9fa0d53 100644 --- a/docs/getting_started/astro.rst +++ b/docs/getting_started/astro.rst @@ -1,7 +1,7 @@ .. _astro: -Getting Started on Astro -======================== +Getting Started with Cosmos on Astro +==================================== While it is possible to use Cosmos on Astro with all :ref:`Execution Modes `, we recommend using the ``local`` execution mode. It's the simplest to set up and use. diff --git a/docs/getting_started/dbt-airflow-concepts.rst b/docs/getting_started/dbt-airflow-concepts.rst index 70c4feae8d..ee55abe694 100644 --- a/docs/getting_started/dbt-airflow-concepts.rst +++ b/docs/getting_started/dbt-airflow-concepts.rst @@ -1,7 +1,7 @@ .. _dbt-airflow-concepts: -Similar dbt & Airflow concepts -============================== +Similar dbt and Airflow concepts +================================ While dbt is an open source tool for data transformations and analysis, using SQL, Airflow focuses on being a platform for the development, scheduling and monitoring of batch-oriented workflows, using Python. Although both tools have many diff --git a/docs/getting_started/docker.rst b/docs/getting_started/docker.rst deleted file mode 100644 index 0005914886..0000000000 --- a/docs/getting_started/docker.rst +++ /dev/null @@ -1,111 +0,0 @@ -.. _docker: - -Docker Execution Mode -======================================== - -The following tutorial illustrates how to run the Cosmos dbt Docker Operators and the required setup for them. - -Requirements -++++++++++++ - -1. Docker with docker daemon (Docker Desktop on MacOS). Follow the `Docker installation guide `_. -2. Airflow -3. Astronomer-cosmos package containing the dbt Docker operators -4. Postgres docker container -5. Docker image built with required dbt project and dbt DAG -6. dbt DAG with dbt docker operators in the Airflow DAGs directory to run in Airflow - -More information on how to achieve 2-6 is detailed below. - -Step-by-step instructions -+++++++++++++++++++++++++ - -**Install Airflow and Cosmos** - -Create a python virtualenv, activate it, upgrade pip to the latest version and install `Apache Airflow® `_ & astronomer-postgres - -.. code-block:: bash - - python -m venv venv - source venv/bin/activate - pip install --upgrade pip - pip install apache-airflow - pip install "astronomer-cosmos[dbt-postgres]" - -**Setup Postgres database** - -You will need a postgres database running to be used as the database for the dbt project. Run the following command to run and expose a postgres database - -.. code-block:: bash - - docker run --name some-postgres -e POSTGRES_PASSWORD="" -e POSTGRES_USER=postgres -e POSTGRES_DB=postgres -p5432:5432 -d postgres - -**Build the dbt Docker image** - -For the Docker operators to work, you need to create a docker image that will be supplied as image parameter to the dbt docker operators used in the DAG. - -Clone the `cosmos-example `_ repo - -.. code-block:: bash - - git clone https://github.com/astronomer/cosmos-example.git - cd cosmos-example - -Create a docker image containing the dbt project files and dbt profile by using the `Dockerfile `_, which will be supplied to the Docker operators. - -.. code-block:: bash - - docker build -t dbt-jaffle-shop:1.0.0 -f Dockerfile.postgres_profile_docker_k8s . - -.. note:: - - If running on M1, you may need to set the following envvars for running the docker build command in case it fails - - .. code-block:: bash - - export DOCKER_BUILDKIT=0 - export COMPOSE_DOCKER_CLI_BUILD=0 - export DOCKER_DEFAULT_PLATFORM=linux/amd64 - -Take a read of the Dockerfile to understand what it does so that you could use it as a reference in your project. - - - The `dbt profile `_ file is added to the image - - The dags directory containing the `dbt project jaffle_shop `_ is added to the image - - The dbt_project.yml is replaced with `postgres_profile_dbt_project.yml `_ which contains the profile key pointing to postgres_profile as profile creation is not handled at the moment for K8s operators like in local mode. - -**Setup and Trigger the DAG with Airflow** - -Copy the dags directory from cosmos-example repo to your Airflow home - -.. code-block:: bash - - cp -r dags $AIRFLOW_HOME/ - -Run Airflow - -.. code-block:: bash - - airflow standalone - -.. note:: - - You might need to run airflow standalone with ``sudo`` if your Airflow user is not able to access the docker socket URL or pull the images in the Kind cluster. - -Log in to Airflow through a web browser ``http://localhost:8080/``, using the user ``airflow`` and the password described in the ``standalone_admin_password.txt`` file. - -Enable and trigger a run of the `jaffle_shop_docker `_ DAG. You will be able to see the following successful DAG run. - -.. figure:: https://github.com/astronomer/astronomer-cosmos/raw/main/docs/_static/jaffle_shop_docker_dag_run.png - :width: 800 - - -Specifying ProfileConfig -+++++++++++++++++++++++++ - -Starting with Cosmos 1.8.0, you can use the ``profile_config`` argument in your Dbt DAG Docker operators to reference -profiles for your dbt project defined in a profiles.yml file. To do so, provide the file’s path via the -``profiles_yml_path`` parameter in ``profile_config``. - -Note that in ``ExecutionMode.DOCKER``, the ``profile_config`` is only compatible with the ``profiles_yml_path`` -approach. The ``profile_mapping`` method will not work because the required Airflow connections cannot be accessed -within the Docker container to map them to the dbt profile. diff --git a/docs/getting_started/execution-modes.rst b/docs/getting_started/execution-modes.rst index ea6a03f283..580a250cbf 100644 --- a/docs/getting_started/execution-modes.rst +++ b/docs/getting_started/execution-modes.rst @@ -1,26 +1,56 @@ .. _execution-modes: -Execution Modes -=============== +Choose an execution mode +======================== -Cosmos can run ``dbt`` commands using several different approaches, called ``execution modes``: +The ```ExecutionConfig`` defines your execution mode, which determines where and how dbt commands run in Cosmos. -1. **local**: Run ``dbt`` commands using a local ``dbt`` installation (default) -2. **virtualenv**: Run ``dbt`` commands from Python virtual environments managed by Cosmos -3. **docker**: Run ``dbt`` commands from Docker containers managed by Cosmos (requires a pre-existing Docker image) -4. **kubernetes**: Run ``dbt`` commands from Kubernetes Pods managed by Cosmos (requires a pre-existing Docker image) -5. **aws_eks**: Run ``dbt`` commands from AWS EKS Pods managed by Cosmos (requires a pre-existing Docker image) -6. **azure_container_instance**: Run ``dbt`` commands from Azure Container Instances managed by Cosmos (requires a pre-existing Docker image) -7. **gcp_cloud_run_job**: Run ``dbt`` commands from GCP Cloud Run Job instances managed by Cosmos (requires a pre-existing Docker image) -8. **aws_ecs**: Run ``dbt`` commands from AWS ECS instances managed by Cosmos (requires a pre-existing Docker image) -9. **airflow_async**: (stable since Cosmos 1.9.0) Run the dbt resources from your dbt project asynchronously, by submitting the corresponding compiled SQLs to Apache Airflow's `Deferrable operators `__ -10. **watcher**: (experimental since Cosmos 1.11.0) Run a single ``dbt build`` command from a producer task and have sensor tasks to watch the progress of the producer, with improved DAG run time while maintaining the tasks lineage in the Airflow UI, and ability to retry failed tasks. Check the :ref:`watcher-execution-mode` for more details. -11. **watcher_kubernetes**: (experimental since Cosmos 1.13.0) Combines the speed of the watcher execution mode with the isolation of Kubernetes. Check the :ref:`watcher-kubernetes-execution-mode` for more details. +There are two types of execution modes: -The choice of the ``execution mode`` can vary based on each user's needs and concerns. For more details, check each execution mode described below. +1. **Execute dbt commands on the Airflow worker or triggerer.** These execution modes offer faster +execution times, since no extra container needs to be spun up, but no or limited environment isolation. +There are four options for this type of execution mode: ``watcher``, ``local``, ``virtualenv``, and ``airflow_async``. +``airflow_async`` is available for BigQuery as of Cosmos 1.9 and ``watcher`` is available as of Cosmos 1.11. + +2. **Execute dbt commands in a container outside of your Airflow environment.** This type of execution mode offers high levels of environment isolation, and also allows you to spin up the Docker or +Kubernetes container in various cloud or on-premises environments + +The following diagram shows a decision tree to help you select the right execution mode for your project needs. + +.. image:: ../_static/execution_mode_map.png + :alt: A diagram illustrating the details about each execution mode in the two categories, "On the Airflow worker or trigger" and "in a container". + + +On the Airflow worker or triggerer +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +These execution modes offer faster execution times, since your don't need to spin up any extra containers. You can also use Airflow connections via the ``ProfileConfig``. But, these execution modes do not have any, or offer limited, environment isolation. There are four execution mode options that run on the Airflow worker: + +- `local <../guides/run_dbt/airflow-worker/local-execution-mode.html>`_: Default execution mode, but provides no environment isolation. Run ``dbt`` commands using a local ``dbt`` installation (default) +- `watcher <../guides/run_dbt/airflow-worker/watcher-execution-mode.html>`_: (Experimental since Cosmos 1.11.0) Optimized for execution speed. Run a single ``dbt build`` command from a producer task and have sensor tasks to watch the progress of the producer, with improved DAG run time while maintaining the tasks lineage in the Airflow UI, and ability to retry failed tasks. +- `virtualenv <../guides/run_dbt/airflow-worker/cosmos-managed-venv.html>`_: Allows you to address package conflicts and an inability to create a venv at build time. Run ``dbt`` commands from Python virtual environments managed by Cosmos. This +- `airflow_async <../guides/run_dbt/airflow-worker/async-execution-mode.html>`_: (Stable since Cosmos 1.9.0) Optimized for worker efficiency if you have long-running dbt commands. Run the dbt resources from your dbt project asynchronously, by submitting the corresponding compiled SQLs to Apache Airflow's `Deferrable operators `__ + +In a container +~~~~~~~~~~~~~~ + +You can also execute dbt commands in a container outside of the Airflow environment. Choosing these kinds of execution modes provides a high degree of isolation, but requires that you can only create Airflow connections with the dbt ``profiles.yml`` file, requires a pre-existing Docker image, and has slower run times, because of container provisioning. + +- `docker <../guides/run_dbt/container/docker.html>`_ : Run ``dbt`` commands from Docker containers managed by Cosmos (requires a pre-existing Docker image) +- `kubernetes <../guides/run_dbt/container/kubernetes.html>`_: Run ``dbt`` commands from Kubernetes Pods managed by Cosmos (requires a pre-existing Docker image) +- `watcher_kubernetes <../guides/run_dbt/container/watcher-kubernetes-execution-mode.html>`_: (experimental since Cosmos 1.13.0) Combines the speed of the watcher execution mode with the isolation of Kubernetes. Check the :ref:`watcher-kubernetes-execution-mode` for more details. +- `azure_container_instance <../guides/run_dbt/container/azure-container-instance.html>`_: Run ``dbt`` commands from Azure Container Instances managed by Cosmos (requires a pre-existing Docker image) +- `aws_ecs <../guides/run_dbt/container/aws-container-run-job.html>`_: Run ``dbt`` commands from AWS ECS instances managed by Cosmos (requires a pre-existing Docker image) +- `aws_eks <../guides/run_dbt/container/aws-eks.html>`_: Run ``dbt`` commands from AWS EKS Pods managed by Cosmos (requires a pre-existing Docker image) +- `gcp_cloud_run_job <../guides/run_dbt/container/gcp-cloud-run-job.html>`_: Run ``dbt`` commands from GCP Cloud Run Job instances managed by Cosmos (requires a pre-existing Docker image). .. _execution-modes-comparison: +Execution modes comparison +~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The type of execution mode that you choose directly affects how fast your Cosmos Dag runs. + .. list-table:: Execution Modes Comparison :widths: 25 25 25 25 :header-rows: 1 @@ -33,10 +63,18 @@ The choice of the ``execution mode`` can vary based on each user's needs and con - Fast - None - Yes + * - Watcher + - Very Fast + - None + - Yes * - Virtualenv - Medium - Lightweight - Yes + * - Airflow Async + - Very Fast + - Medium + - Yes * - Docker - Slow - Medium @@ -61,314 +99,8 @@ The choice of the ``execution mode`` can vary based on each user's needs and con - Slow - High - No - * - Airflow Async - - Very Fast - - Medium - - Yes - * - Watcher - - Very Fast - - None - - Yes * - Watcher Kubernetes - Fast - High - No -Local ------ - -By default, Cosmos uses the ``local`` execution mode. - -The ``local`` execution mode is the fastest way to run Cosmos operators since they don't install ``dbt`` nor build docker containers. However, it may not be an option for users using managed Airflow services such as -Google Cloud Composer, since Airflow and ``dbt`` dependencies can conflict (:ref:`execution-modes-local-conflicts`), the user may not be able to install ``dbt`` in a custom path. - -The ``local`` execution mode assumes a ``dbt`` binary is reachable within the Airflow worker node. - -If ``dbt`` was not installed as part of the Cosmos packages, -users can define a custom path to ``dbt`` by declaring the argument ``dbt_executable_path``. - -.. note:: - Starting in the 1.4 version, Cosmos tries to leverage the dbt partial parsing (``partial_parse.msgpack``) to speed up task execution. - This feature is bound to `dbt partial parsing limitations `_. - Learn more: :ref:`partial-parsing`. - -When using the ``local`` execution mode, Cosmos converts Airflow Connections into a native ``dbt`` profiles file (``profiles.yml``). - -Example of how to use, for instance, when ``dbt`` was installed together with Cosmos: - -.. literalinclude:: ../../dev/dags/basic_cosmos_dag.py - :language: python - :start-after: [START local_example] - :end-before: [END local_example] - - -Virtualenv ----------- - -If you're using managed Airflow on GCP (Cloud Composer), for instance, we recommend you use the ``virtualenv`` execution mode. - -The ``virtualenv`` mode isolates the Airflow worker dependencies from ``dbt`` by managing a Python virtual environment created during task execution and deleted afterwards. - -In this case, users are responsible for declaring which version of ``dbt`` they want to use by giving the argument ``py_requirements``. This argument can be set directly in operator instances or when instantiating ``DbtDag`` and ``DbtTaskGroup`` as part of ``operator_args``. - -Similar to the ``local`` execution mode, Cosmos converts Airflow Connections into a way ``dbt`` understands them by creating a ``dbt`` profile file (``profiles.yml``). -Also similar to the ``local`` execution mode, Cosmos will by default attempt to use a ``partial_parse.msgpack`` if one exists to speed up parsing. - -Some drawbacks of this approach: - -- It is slower than ``local`` because it creates a new Python virtual environment for each Cosmos dbt task run. -- If dbt is unavailable in the Airflow scheduler, the default ``LoadMode.DBT_LS`` will not work. In this scenario, users must use a :ref:`parsing-methods` that does not rely on dbt, such as ``LoadMode.MANIFEST``. -- Only ``InvocationMode.SUBPROCESS`` is supported currently, attempt to use ``InvocationMode.DBT_RUNNER`` will raise error. - -Example of how to use: - -.. literalinclude:: ../../dev/dags/example_virtualenv.py - :language: python - :start-after: [START virtualenv_example] - :end-before: [END virtualenv_example] - -Docker ------- - -The ``docker`` approach assumes users have a previously created Docker image, which should contain all the ``dbt`` pipelines and a ``profiles.yml``, managed by the user. - -The user has better environment isolation than when using ``local`` or ``virtualenv`` modes, but also more responsibility (ensuring the Docker container used has up-to-date files and managing secrets potentially in multiple places). - -The other challenge with the ``docker`` approach is if the Airflow worker is already running in Docker, which sometimes can lead to challenges running `Docker in Docker `__. - -This approach can be significantly slower than ``virtualenv`` since it may have to build the ``Docker`` container, which is slower than creating a Virtualenv with ``dbt-core``. -If dbt is unavailable in the Airflow scheduler, the default ``LoadMode.DBT_LS`` will not work. In this scenario, users must use a :ref:`parsing-methods` that does not rely on dbt, such as ``LoadMode.MANIFEST``. - -Check the step-by-step guide on using the ``docker`` execution mode at :ref:`docker`. - -Example DAG: - -.. code-block:: python - - docker_cosmos_dag = DbtDag( - # ... - execution_config=ExecutionConfig( - execution_mode=ExecutionMode.DOCKER, - ), - operator_args={ - "image": "dbt-jaffle-shop:1.0.0", - "network_mode": "bridge", - }, - ) - - -Kubernetes ----------- - -The ``kubernetes`` approach is a very isolated way of running ``dbt`` since the ``dbt`` run commands from within a Kubernetes Pod, usually in a separate host. - -It assumes the user has a Kubernetes cluster. It also expects the user to ensure the Docker container has up-to-date ``dbt`` pipelines and profiles, potentially leading the user to declare secrets in two places (Airflow and Docker container). - -The ``Kubernetes`` deployment may be slower than ``Docker`` and ``Virtualenv`` assuming that the container image is built (which is slower than creating a Python ``virtualenv`` and installing ``dbt-core``) and the Airflow task needs to spin up a new ``Pod`` in Kubernetes. - -Check the step-by-step guide on using the ``kubernetes`` execution mode at :ref:`kubernetes`. - -Example DAG: - -.. literalinclude:: ../../dev/dags/jaffle_shop_kubernetes.py - :language: python - :start-after: [START kubernetes_seed_example] - :end-before: [END kubernetes_seed_example] - -AWS_EKS ----------- - -The ``aws_eks`` approach is very similar to the ``kubernetes`` approach, but it is specifically designed to run on AWS EKS clusters. -It uses the `EKSPodOperator `_ -to run the dbt commands. You need to provide the ``cluster_name`` in your operator_args to connect to the AWS EKS cluster. - - -Example DAG: - -.. code-block:: python - - postgres_password_secret = Secret( - deploy_type="env", - deploy_target="POSTGRES_PASSWORD", - secret="postgres-secrets", - key="password", - ) - - docker_cosmos_dag = DbtDag( - # ... - execution_config=ExecutionConfig( - execution_mode=ExecutionMode.AWS_EKS, - ), - operator_args={ - "image": "dbt-jaffle-shop:1.0.0", - "cluster_name": CLUSTER_NAME, - "get_logs": True, - "is_delete_operator_pod": False, - "secrets": [postgres_password_secret], - }, - ) - -Azure Container Instance ------------------------- -.. versionadded:: 1.4 - -Similar to the ``kubernetes`` approach, using ``Azure Container Instances`` as the execution mode gives a very isolated way of running ``dbt``, since the ``dbt`` run itself is run within a container running in an Azure Container Instance. - -This execution mode requires the user has an Azure environment that can be used to run Azure Container Groups in (see :ref:`azure-container-instance` for more details on the exact requirements). Similarly to the ``Docker`` and ``Kubernetes`` execution modes, a Docker container should be available, containing the up-to-date ``dbt`` pipelines and profiles. - -Each task will create a new container on Azure, giving full isolation. This, however, comes at the cost of speed, as this separation of tasks introduces some overhead. Please checkout the step-by-step guide for using Azure Container Instance as the execution mode - - -.. code-block:: python - - docker_cosmos_dag = DbtDag( - # ... - execution_config=ExecutionConfig( - execution_mode=ExecutionMode.AZURE_CONTAINER_INSTANCE - ), - operator_args={ - "ci_conn_id": "aci", - "registry_conn_id": "acr", - "resource_group": "my-rg", - "name": "my-aci-{{ ti.task_id.replace('.','-').replace('_','-') }}", - "region": "West Europe", - "image": "dbt-jaffle-shop:1.0.0", - }, - ) - -GCP Cloud Run Job ------------------------- -.. versionadded:: 1.7 - -The ``gcp_cloud_run_job`` execution mode is particularly useful for users who prefer to run their ``dbt`` commands on Google Cloud infrastructure, taking advantage of Cloud Run's scalability, isolation, and managed service capabilities. - -For the ``gcp_cloud_run_job`` execution mode to work, a Cloud Run Job instance must first be created using a previously built Docker container. This container should include the latest ``dbt`` pipelines and profiles. You can find more details in the `Cloud Run Job creation guide `__ . - -This execution mode allows users to run ``dbt`` core CLI commands in a Google Cloud Run Job instance. This mode leverages the ``CloudRunExecuteJobOperator`` from the Google Cloud Airflow provider to execute commands within a Cloud Run Job instance, where ``dbt`` is already installed. Similarly to the ``Docker`` and ``Kubernetes`` execution modes, a Docker container should be available, containing the up-to-date ``dbt`` pipelines and profiles. - -Each task will create a new Cloud Run Job execution, giving full isolation. The separation of tasks adds extra overhead; however, that can be mitigated by using the ``concurrency`` parameter in ``DbtDag``, which will result in parallelized execution of ``dbt`` models. - - -.. code-block:: python - - gcp_cloud_run_job_cosmos_dag = DbtDag( - # ... - execution_config=ExecutionConfig(execution_mode=ExecutionMode.GCP_CLOUD_RUN_JOB), - operator_args={ - "project_id": "my-gcp-project-id", - "region": "europe-west1", - "job_name": "my-crj-{{ ti.task_id.replace('.','-').replace('_','-') }}", - }, - ) - - -AWS ECS ---------- -.. versionadded:: 1.9.0 - -Using ``AWS Elastic Container Service (ECS)`` as the execution mode provides an isolated and scalable way to run ``dbt`` tasks within an AWS ECS service. This execution mode ensures that each ``dbt`` run is performed inside a dedicated container running in an ECS task. - -This execution mode requires the user to have an AWS environment configured to run ECS tasks (see :ref:``aws-ecs`` for more details on the exact requirements). Similar to the ``Docker`` and ``Kubernetes`` execution modes, a Docker container should be available, containing the up-to-date ``dbt`` pipelines and profiles. - -Each task will create a new ECS task execution, providing full isolation. However, this separation introduces some overhead in execution time due to container startup and provisioning. For users who require faster execution times, configuring appropriate ECS task definitions and cluster optimizations can help mitigate these delays. - -Please refer to the step-by-step guide for using AWS ECS as the execution mode. - -.. code-block:: python - - aws_ecs_cosmos_dag = DbtDag( - # ... - execution_config=ExecutionConfig(execution_mode=ExecutionMode.AWS_ECS), - operator_args={ - "aws_conn_id": "aws_default", - "cluster": "my-ecs-cluster", - "task_definition": "my-dbt-task", - "container_name": "dbt-container", - "launch_type": "FARGATE", - "deferrable": True, - "network_configuration": { - "awsvpcConfiguration": { - "subnets": ["<<>>"], - "assignPublicIp": "ENABLED", - }, - }, - "environment_variables": {"DBT_PROFILE_NAME": "default"}, - }, - ) - -.. _airflow-async-execution-mode: - -Airflow Async -------------- - -.. versionadded:: 1.9.0 - -Although this execution mode was introduced in Cosmos 1.9, we strongly encourage users to use Cosmos 1.11, which has significant performance improvements. -In comparison to the ``local``, the ``airflow_async`` execution mode can reduce the execution time of a dbt project by up to 36%. - -The ``airflow_async`` execution mode is a way to run the dbt resources from your dbt project using Apache Airflow's -`Deferrable operators `__. -This execution mode could be preferred when you've long running resources and you want to run them asynchronously by -leveraging Airflow's deferrable operators. With that, you would be able to potentially observe higher throughput of tasks -as more dbt nodes will be run in parallel since they won't be blocking Airflow's worker slots. - -Example DAG: - -.. literalinclude:: ../../dev/dags/simple_dag_async.py - :language: python - :start-after: [START airflow_async_execution_mode_example] - :end-before: [END airflow_async_execution_mode_example] - -For a full step-by-step guide and limitations, check the :ref:`async-execution-mode` page. - - -Watcher Execution Mode (Experimental) -------------------------------------- - -.. versionadded:: 1.11.0 - -The ``watcher`` execution mode is an experimental execution mode that runs a single ``dbt build`` command from a producer task and has sensor tasks to watch the progress of the producer. -It is designed to improve DAG run time while maintaining the tasks lineage in the Airflow UI, and ability to retry failed tasks. - -Check the :ref:`watcher-execution-mode` for more details. - - -Watcher Kubernetes Execution Mode (Experimental) ------------------------------------------------- - -.. versionadded:: 1.13.0 - -The ``watcher_kubernetes`` execution mode combines the speed of the ``watcher`` execution mode with the isolation of the ``kubernetes`` execution mode. It runs a single ``dbt build`` command from a producer task inside a Kubernetes pod and has sensor tasks to watch the progress of the producer. - -Check the :ref:`watcher-kubernetes-execution-mode` for more details. - - -.. _invocation_modes: - -Invocation Modes -================ -.. versionadded:: 1.4 - -For ``ExecutionMode.LOCAL`` execution mode, Cosmos supports two invocation modes for running dbt: - -1. ``InvocationMode.SUBPROCESS``: In this mode, Cosmos runs dbt cli commands using the Python ``subprocess`` module and parses the output to capture logs and to raise exceptions. - -2. ``InvocationMode.DBT_RUNNER``: In this mode, Cosmos uses the ``dbtRunner`` available for `dbt programmatic invocations `__ to run dbt commands. \ - In order to use this mode, dbt must be installed in the same local environment. This mode does not have the overhead of spawning new subprocesses or parsing the output of dbt commands and is faster than ``InvocationMode.SUBPROCESS``. \ - This mode requires dbt version 1.5.0 or higher. It is up to the user to resolve :ref:`execution-modes-local-conflicts` when using this mode. - -The invocation mode can be set in the ``ExecutionConfig`` as shown below: - -.. code-block:: python - - from cosmos.constants import InvocationMode - - dag = DbtDag( - # ... - execution_config=ExecutionConfig( - execution_mode=ExecutionMode.LOCAL, - invocation_mode=InvocationMode.DBT_RUNNER, - ), - ) - -If the invocation mode is not set, Cosmos will attempt to use ``InvocationMode.DBT_RUNNER`` if dbt is installed in the same environment as the worker, otherwise it will fall back to ``InvocationMode.SUBPROCESS``. diff --git a/docs/getting_started/index.rst b/docs/getting_started/index.rst index eb71d10221..586f2e1dd2 100644 --- a/docs/getting_started/index.rst +++ b/docs/getting_started/index.rst @@ -1,27 +1,29 @@ .. _getting-started: .. toctree:: + :maxdepth: 1 :hidden: - :caption: Contents: + :caption: Cosmos Fundamentals + + Choosing an execution mode + Similar dbt and Airflow concepts + +.. toctree:: + :maxdepth: 1 + :hidden: + :caption: Quickstart Astro CLI quickstart - Astro - MWAA - GCC - Open-Source - Execution Modes - Docker Execution Mode - Kubernetes Execution Mode - Azure Container Instance Execution Mode - AWS Container Run Job Execution Mode - GCP Cloud Run Job Execution Mode - Airflow Async Execution Mode - Watcher Execution Mode - Watcher Kubernetes Execution Mode - dbt and Airflow Similar Concepts - Operators - Custom Airflow Properties +.. toctree:: + :maxdepth: 1 + :hidden: + :caption: Get started with Cosmos + + Open-source Airflow + Astro + Google Cloud Composer (GCC) + Amazon Managed Workflows for Apache Airflow (MWAA) Getting Started =============== @@ -44,15 +46,6 @@ Execution Methods For more customization, check out the different execution modes that Cosmos supports on the `Execution Modes `__ page. -For specific guides, see the following: - -- `Executing dbt DAGs with Docker Operators `__ -- `Executing dbt DAGs with KubernetesPodOperators `__ -- `Executing dbt DAGs with Watcher Kubernetes Mode `__ -- `Executing dbt DAGs with AzureContainerInstancesOperators `__ -- `Executing dbt DAGs with GcpCloudRunExecuteJobOperators `__ - - Concepts Overview ----------------- diff --git a/docs/getting_started/mwaa.rst b/docs/getting_started/mwaa.rst index 5b7c41bde5..5b1da23439 100644 --- a/docs/getting_started/mwaa.rst +++ b/docs/getting_started/mwaa.rst @@ -1,7 +1,7 @@ .. _mwaa: -Getting Started on MWAA -======================= +Getting Started with Cosmos on Amazon Managed Workflows +======================================================= Users can face Python dependency issues when trying to use the Cosmos `Local Execution Mode `_ in Amazon Managed Workflows for `Apache Airflow® `_ (MWAA). diff --git a/docs/getting_started/open-source.rst b/docs/getting_started/open-source.rst index ba9bbdb15c..f5d1db832b 100644 --- a/docs/getting_started/open-source.rst +++ b/docs/getting_started/open-source.rst @@ -1,7 +1,7 @@ .. _open-source: -Getting Started on Open Source Airflow -====================================== +Getting Started with Cosmos on Open-source Airflow +================================================== When running open-source Airflow, your setup may vary. This guide assumes you have access to edit the underlying image. diff --git a/docs/configuration/compiled-sql.rst b/docs/guides/cosmos_devex/compiled-sql.rst similarity index 100% rename from docs/configuration/compiled-sql.rst rename to docs/guides/cosmos_devex/compiled-sql.rst diff --git a/docs/guides/cosmos_devex/index.rst b/docs/guides/cosmos_devex/index.rst new file mode 100644 index 0000000000..2ad3dff71b --- /dev/null +++ b/docs/guides/cosmos_devex/index.rst @@ -0,0 +1,14 @@ +.. _cosmos_devex: + + +Cosmos DevEx +============ + +.. toctree:: + :maxdepth: 1 + :caption: Cosmos DevEx + + lineage + compiled-sql + logging + task-display-name diff --git a/docs/configuration/lineage.rst b/docs/guides/cosmos_devex/lineage.rst similarity index 100% rename from docs/configuration/lineage.rst rename to docs/guides/cosmos_devex/lineage.rst diff --git a/docs/configuration/logging.rst b/docs/guides/cosmos_devex/logging.rst similarity index 100% rename from docs/configuration/logging.rst rename to docs/guides/cosmos_devex/logging.rst diff --git a/docs/configuration/task-display-name.rst b/docs/guides/cosmos_devex/task-display-name.rst similarity index 100% rename from docs/configuration/task-display-name.rst rename to docs/guides/cosmos_devex/task-display-name.rst diff --git a/docs/configuration/generating-docs.rst b/docs/guides/dbt_docs/generating-docs.rst similarity index 100% rename from docs/configuration/generating-docs.rst rename to docs/guides/dbt_docs/generating-docs.rst diff --git a/docs/configuration/hosting-docs.rst b/docs/guides/dbt_docs/hosting-docs.rst similarity index 100% rename from docs/configuration/hosting-docs.rst rename to docs/guides/dbt_docs/hosting-docs.rst diff --git a/docs/configuration/dbt-fusion.rst b/docs/guides/dbt_setup/dbt-fusion.rst similarity index 100% rename from docs/configuration/dbt-fusion.rst rename to docs/guides/dbt_setup/dbt-fusion.rst diff --git a/docs/guides/index.rst b/docs/guides/index.rst new file mode 100644 index 0000000000..fe4a6744bd --- /dev/null +++ b/docs/guides/index.rst @@ -0,0 +1,93 @@ +.. _guides: + +Guides +====== + +.. toctree:: + :maxdepth: 0 + :hidden: + + self + +Cosmos offers a number of configuration options to customize how Airflow dags and dbt commands run. + +To set up a project, you follow the same general set of steps. + + +1. Set up dbt with Airflow +~~~~~~~~~~~~~~~~~~~~~~~~~~ + +You must make your dbt projects available to Airflow and install dbt into the environment where your dbt code runs. + +.. toctree:: + :maxdepth: 1 + :caption: Set up dbt with Airflow + + dbt_setup/dbt-fusion + +2. Connect to your dbt database +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Configure your Cosmos project to allow Airflow Dags to initiate dbt commands, and make data transformations and udpates in your data warehouses. You can creae these connections with your ``profiles.yml`` file in the dbt project, using profile mappings, or customizing ``ProfileConfig`` per dbt configuration. + +3. Translate your dbt code into Airflow Dags +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +You can customize how Cosmos parses your dbt workflows into Airflow Dags. Choosing how you want your dbt nodes to map to Airflow tasks within Dags can affect the time required for Cosmos to parse the dbt workflows and for Airflow to execute the resulting Dags. + +.. toctree:: + :maxdepth: 2 + :caption: Translating dbt into Airflow + + translate_dbt_to_airflow/index + + +4. Run dbt +~~~~~~~~~~~~~ + +You can specify more details about how Cosmos runs both dbt commands and Airflow Dags. This includes `choosing an execution mode <../getting_started/execution-modes.html>`_ , either one that runs dbt on an Airflow worker node or one that runs in a container. You can customize additional aspects of how your dbt code runs, like using particular operators that correspond to dbt commands. And, you can leverage Airflow's scheduling capabilities in your Cosmos Dags. + +.. toctree:: + :maxdepth: 2 + :caption: How Cosmos runs dbt + + run_dbt/airflow-worker/index + run_dbt/container/index + run_dbt/callbacks/callbacks + run_dbt/operators/operators + run_dbt/customization/index + +Multi-project Setups +~~~~~~~~~~~~~~~~~~~~ + +If you have a multi-project architecture where you have multiple dbt projects that reference each others' models, you can set up ``dbt-loom`` with Cosmos to handle cross-project references. + +.. toctree:: + :maxdepth: 1 + :caption: Multi-project Setups + + multi_project/multi-project + +Add your dbt documentation +~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Cosmos supports dbt's documetnation capabilities. + +.. toctree:: + :maxdepth: 1 + :caption: Documentation + + dbt_docs/generating-docs + dbt_docs/hosting-docs + +Cosmos DevEx +~~~~~~~~~~~~ + +You can configure Cosmos to improve your development experience. + +.. toctree:: + :maxdepth: 1 + :caption: Cosmos DevEx + + cosmos_devex/index + diff --git a/docs/configuration/multi-project.rst b/docs/guides/multi_project/multi-project.rst similarity index 99% rename from docs/configuration/multi-project.rst rename to docs/guides/multi_project/multi-project.rst index 70283bc410..5dfd79eea0 100644 --- a/docs/configuration/multi-project.rst +++ b/docs/guides/multi_project/multi-project.rst @@ -169,7 +169,7 @@ You can use either separate DAGs or a combined DAG with task groups. **Option 1: Combined DAG with Task Groups using dbt ls Load Mode (Recommended)** -.. literalinclude:: ../../dev/dags/cross_project_dbt_ls_dag.py +.. literalinclude:: ../../../dev/dags/cross_project_dbt_ls_dag.py :language: python :start-after: [START cross_project_dbt_ls_dag] :end-before: [END cross_project_dbt_ls_dag] @@ -178,7 +178,7 @@ You can use either separate DAGs or a combined DAG with task groups. This option uses pre-generated ``manifest.json`` files for faster DAG parsing (no ``dbt ls`` execution required). -.. literalinclude:: ../../dev/dags/cross_project_manifest_dag.py +.. literalinclude:: ../../../dev/dags/cross_project_manifest_dag.py :language: python :start-after: [START cross_project_manifest_dag] :end-before: [END cross_project_manifest_dag] diff --git a/docs/getting_started/async-execution-mode.rst b/docs/guides/run_dbt/airflow-worker/async-execution-mode.rst similarity index 82% rename from docs/getting_started/async-execution-mode.rst rename to docs/guides/run_dbt/airflow-worker/async-execution-mode.rst index 6d61bcf22b..0b7784a082 100644 --- a/docs/getting_started/async-execution-mode.rst +++ b/docs/guides/run_dbt/airflow-worker/async-execution-mode.rst @@ -1,22 +1,23 @@ .. _async-execution-mode: -.. title:: Getting Started with Deferrable Operator - -Airflow Async Execution Mode +Airflow async execution mode ============================ -This execution mode can reduce the runtime by 35% in comparison to Cosmos LOCAL execution mode, but is currently only available for BigQuery. While this mode was introduced in Cosmos 1.9, we strongly encourage users to use Cosmos 1.11, which has significant performance improvements. - -It can be particularly useful for long-running transformations, since it leverages Airflow's `deferrable operators `__. +This execution mode can reduce the runtime by 35% in comparison to Cosmos ``LOCAL`` execution mode, but is currently only available for BigQuery. While this mode was introduced in Cosmos 1.9, we strongly encourage users to use the latest version of Cosmos, which has significant performance improvements. -In this mode, there is a ``SetupAsyncOperator`` that will pre-generate the SQL files for the dbt project and upload them to Airflow XCom or a remote location. A remote location will only be used if users set ``AIRFLOW__COSMOS__REMOTE_TARGET_PATH`` and ``AIRFLOW__COSMOS__REMOTE_TARGET_PATH_CONN_ID``. This operator is run before the remaining pipeline. -All the pipeline dbt model transformations will be run using ``DbtRunAirflowAsyncOperator`` which, instead of running the ``dbt run`` command for each model. They will download the SQL files from the Airflow XCom or remote location and execute them directly leveraging the Airflow ``BigQueryInsertJobOperator``. +The ``airflow_async`` execution mode is a way to run the dbt resources from your dbt project using Apache Airflow's +`Deferrable operators `__. +This execution mode is well-suited for when you have long-running resources and you want to run them asynchronously by +leveraging Airflow's deferrable operators. With deferrable operators, you can potentially observe higher throughput of tasks +because more dbt nodes run in parallel, since they won't be blocking Airflow's worker slots. -Users can leverage other existing ``BigQueryInsertJobOperator`` features, such as the UI controls to link to the job in the BigQuery UI. +In this mode, there is a ``SetupAsyncOperator`` that pre-generates the SQL files for the dbt project and uploads them to Airflow XCom or a remote location. Airflow only uses a remote location if you set ``AIRFLOW__COSMOS__REMOTE_TARGET_PATH`` and ``AIRFLOW__COSMOS__REMOTE_TARGET_PATH_CONN_ID``. This operator runs before the remaining pipeline. +All the pipeline dbt model transformations run using ``DbtRunAirflowAsyncOperator`` instead of running the ``dbt run`` command for each model. They download the SQL files from the Airflow XCom or remote location, and then execute them directly using the Airflow ``BigQueryInsertJobOperator``. +You can also use other existing ``BigQueryInsertJobOperator`` features, such as the UI controls to link to the job in the BigQuery UI. Advantages of Airflow Async Mode -++++++++++++++++++++++++++++++++ +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - **Improved Task Throughput:** Async tasks free up Airflow workers by leveraging the Airflow Trigger framework. While long-running SQL transformations are executing in the data warehouse, the worker is released and can handle other tasks, increasing overall task throughput. - **Better Resource Utilization:** By minimizing idle time on Airflow workers, async tasks allow more efficient use of compute resources. Workers aren't blocked waiting for external systems and can be reused for other work while waiting on async operations. @@ -36,18 +37,18 @@ We have `observed `_ Getting Started with Airflow Async Mode -+++++++++++++++++++++++++++++++++++++++ +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ This guide walks you through setting up an Astro CLI project and running a Cosmos-based DAG with a deferrable operator, enabling asynchronous task execution in Apache Airflow. Prerequisites -+++++++++++++ +------------- - `Astro CLI `_ - Airflow>=2.9 1. Create Astro-CLI Project -+++++++++++++++++++++++++++ +--------------------------- Run the following command in your terminal: @@ -72,7 +73,7 @@ This will create an Astro project with the following structure: 2. Update Dockerfile -++++++++++++++++++++ +-------------------- Edit your Dockerfile to ensure all necessary requirements are included. @@ -82,7 +83,7 @@ Edit your Dockerfile to ensure all necessary requirements are included. 3. Add astronomer-cosmos Dependency -+++++++++++++++++++++++++++++++++++ +----------------------------------- In your ``requirements.txt``, add: @@ -92,7 +93,7 @@ In your ``requirements.txt``, add: 4. Create Airflow DAG -+++++++++++++++++++++ +--------------------- 1. Create a new DAG file: ``dags/cosmos_async_dag.py`` @@ -154,8 +155,8 @@ In your ``requirements.txt``, add: - Add a valid dbt project inside your Airflow project under ``dags/dbt/``. -5. Start the Project -++++++++++++++++++++ +5. Start the project +-------------------- Launch the Airflow project locally: @@ -168,8 +169,8 @@ This will: - Spin up the scheduler, webserver, and triggerer (needed for deferrable operators) - Expose Airflow UI at http://localhost:8080 -6. Create Airflow Connection -++++++++++++++++++++++++++++ +6. Create Airflow connection +---------------------------- Create an Airflow connection with following configurations @@ -198,7 +199,7 @@ Create an Airflow connection with following configurations 7. Execute the DAG -++++++++++++++++++ +------------------ 1. Visit the Airflow UI at ``http://localhost:8080`` 2. Enable the DAG: ``cosmos_async_dag`` @@ -211,8 +212,8 @@ Create an Airflow connection with following configurations The ``run`` tasks will run asynchronously via the deferrable operator, freeing up worker slots while waiting on I/O or long-running tasks. -Control of where to upload the SQL files -++++++++++++++++++++++++++++++++++++++++ +Control where to upload the SQL files +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ For optimal performance we encourage to keep Cosmos standard behaviour (introduced in 1.11), which is to upload the SQL files to XCom, instead of a remote object location. @@ -227,7 +228,7 @@ However, if you want to upload the SQL files to a remote object location instead Limitations -+++++++++++ +~~~~~~~~~~~ 1. **Limited to dbt models**: Only dbt resource type models are run asynchronously using Airflow deferrable operators. Other resource types are executed synchronously, similar to the local execution mode. diff --git a/docs/guides/run_dbt/airflow-worker/cosmos-managed-venv.rst b/docs/guides/run_dbt/airflow-worker/cosmos-managed-venv.rst new file mode 100644 index 0000000000..7a4863b9b1 --- /dev/null +++ b/docs/guides/run_dbt/airflow-worker/cosmos-managed-venv.rst @@ -0,0 +1,26 @@ +.. _cosmos-managed-venv: + +Cosmos-managed virtual environment execution mode +======================================================== + +If you're using managed Airflow, we recommend you use the ``virtualenv`` execution mode. + +The ``virtualenv`` mode isolates the Airflow worker dependencies from ``dbt`` by managing a Python virtual environment created during task execution and deleted afterwards. + +In this case, you are responsible for declaring which version of ``dbt`` to use by giving the argument ``py_requirements``. Set this argument directly in operator instances or when you instantiate ``DbtDag`` and ``DbtTaskGroup`` as part of ``operator_args``. + +Similar to the ``local`` execution mode, Cosmos converts Airflow Connections into a way ``dbt`` understands them by creating a ``dbt`` profile file (``profiles.yml``). +Also similar to the ``local`` execution mode, Cosmos will by default attempt to use a ``partial_parse.msgpack`` if one exists to speed up parsing. + +Some drawbacks of this approach: + +- It is slower than ``local`` because it creates a new Python virtual environment for each Cosmos dbt task run. +- If dbt is unavailable in the Airflow scheduler, the default ``LoadMode.DBT_LS`` will not work. In this scenario, you must use a :ref:`parsing-methods` that does not rely on dbt, such as ``LoadMode.MANIFEST``. +- Only ``InvocationMode.SUBPROCESS`` is supported currently, attempt to use ``InvocationMode.DBT_RUNNER`` will raise error. + +Example of how to use: + +.. literalinclude:: ../../../../dev/dags/example_virtualenv.py + :language: python + :start-after: [START virtualenv_example] + :end-before: [END virtualenv_example] \ No newline at end of file diff --git a/docs/guides/run_dbt/airflow-worker/index.rst b/docs/guides/run_dbt/airflow-worker/index.rst new file mode 100644 index 0000000000..ef3edda3e4 --- /dev/null +++ b/docs/guides/run_dbt/airflow-worker/index.rst @@ -0,0 +1,11 @@ +Run dbt in an Airflow worker +============================ + +.. toctree:: + :maxdepth: 1 + :caption: Run dbt in an Airflow worker + + local-execution-mode + watcher-execution-mode + cosmos-managed-venv + async-execution-mode diff --git a/docs/guides/run_dbt/airflow-worker/local-execution-mode.rst b/docs/guides/run_dbt/airflow-worker/local-execution-mode.rst new file mode 100644 index 0000000000..a5b1e71ad9 --- /dev/null +++ b/docs/guides/run_dbt/airflow-worker/local-execution-mode.rst @@ -0,0 +1,27 @@ +.. _local-execution: + +Local execution mode +==================== + +By default, Cosmos uses the ``local`` execution mode. + +The ``local`` execution mode is the fastest way to run Cosmos operators since they don't neither install ``dbt`` nor build docker containers. If you use managed Airflow services such as +Google Cloud Composer, you might want to use a different execution mode, since Airflow and ``dbt`` dependencies can conflict (:ref:`execution-modes-local-conflicts`). On an managed Airflow service, you you might not be able to install ``dbt`` in a custom path. + +The ``local`` execution mode assumes that the Airflow worker node can access a ``dbt`` binary. + +If ``dbt`` was not installed as part of the Cosmos packages, you can define a custom path to ``dbt`` by declaring the argument ``dbt_executable_path``. + +.. note:: + Starting in the 1.4 version, Cosmos tries to leverage the dbt partial parsing (``partial_parse.msgpack``) to speed up task execution. + This feature is bound to `dbt partial parsing limitations `_. + Learn more: :ref:`partial-parsing`. + +When using the ``local`` execution mode, Cosmos converts Airflow Connections into a native ``dbt`` profiles file (``profiles.yml``). + +Example of how to use, for instance, when ``dbt`` was installed together with Cosmos: + +.. literalinclude:: ../../../../dev/dags/basic_cosmos_dag.py + :language: python + :start-after: [START local_example] + :end-before: [END local_example] \ No newline at end of file diff --git a/docs/getting_started/watcher-execution-mode.rst b/docs/guides/run_dbt/airflow-worker/watcher-execution-mode.rst similarity index 98% rename from docs/getting_started/watcher-execution-mode.rst rename to docs/guides/run_dbt/airflow-worker/watcher-execution-mode.rst index af7589650c..85e67bd36c 100644 --- a/docs/getting_started/watcher-execution-mode.rst +++ b/docs/guides/run_dbt/airflow-worker/watcher-execution-mode.rst @@ -1,7 +1,7 @@ .. _watcher-execution-mode: -Introducing ``ExecutionMode.WATCHER``: Experimental High-Performance dbt Execution in Cosmos -============================================================================================ +Watcher execution mode (Experimental) +====================================== With the release of **Cosmos 1.11.0**, we are introducing a powerful new experimental execution mode — ``ExecutionMode.WATCHER`` — designed to drastically reduce dbt pipeline run times in Airflow. @@ -144,7 +144,7 @@ Example 1 — Using ``DbtDag`` with ``ExecutionMode.WATCHER`` You can enable WATCHER mode directly in your ``DbtDag`` configuration. This approach is best when your Airflow DAG is fully dedicated to a dbt project. -.. literalinclude:: ../../dev/dags/example_watcher.py +.. literalinclude:: ../../../../dev/dags/example_watcher.py :language: python :start-after: [START example_watcher] :end-before: [END example_watcher] @@ -370,7 +370,7 @@ Source freshness nodes Since Cosmos 1.6, it `supports the rendering of source nodes `_. -We noticed some Cosmos users use this feature alongside `overriding Cosmos source nodes `_ as sensors or another operator that allows them to skip the following branch of the DAG if the source is not fresh. +We noticed some Cosmos users use this feature alongside `overriding Cosmos source nodes `_ as sensors or another operator that allows them to skip the following branch of the DAG if the source is not fresh. This use case is not currently supported by the ``ExecutionMode.WATCHER``, since the ``dbt build`` command does not run `source freshness checks `_. @@ -451,7 +451,7 @@ Asynchronous sensor execution To disable asynchronous execution, set the ``deferrable`` flag to ``False`` in the ``operator_args``. -.. literalinclude:: ../../dev/dags/example_watcher.py +.. literalinclude:: ../../../../dev/dags/example_watcher.py :language: python :start-after: [START example_watcher_synchronous] :end-before: [END example_watcher_synchronous] diff --git a/docs/configuration/callbacks.rst b/docs/guides/run_dbt/callbacks/callbacks.rst similarity index 98% rename from docs/configuration/callbacks.rst rename to docs/guides/run_dbt/callbacks/callbacks.rst index c754245525..4b602ece3f 100644 --- a/docs/configuration/callbacks.rst +++ b/docs/guides/run_dbt/callbacks/callbacks.rst @@ -34,7 +34,7 @@ Example: Using Callbacks with a Single Operator To demonstrate how to specify a callback function for uploading files from the target directory, here’s an example using a single operator in an Airflow DAG: -.. literalinclude:: ../../dev/dags/example_operators.py +.. literalinclude:: ../../../../dev/dags/example_operators.py :language: python :start-after: [START single_operator_callback] :end-before: [END single_operator_callback] @@ -46,7 +46,7 @@ You can leverage the :ref:`remote_target_path` configuration to upload files from the target directory to a remote storage. Below is an example of how to define a callback helper function in your ``DbtDag`` that utilizes this configuration: -.. literalinclude:: ../../dev/dags/cosmos_callback_dag.py +.. literalinclude:: ../../../../dev/dags/cosmos_callback_dag.py :language: python :start-after: [START cosmos_callback_example] :end-before: [END cosmos_callback_example] diff --git a/docs/getting_started/aws-container-run-job.rst b/docs/guides/run_dbt/container/aws-container-run-job.rst similarity index 84% rename from docs/getting_started/aws-container-run-job.rst rename to docs/guides/run_dbt/container/aws-container-run-job.rst index db00fc8c3c..40cc705fd6 100644 --- a/docs/getting_started/aws-container-run-job.rst +++ b/docs/guides/run_dbt/container/aws-container-run-job.rst @@ -1,11 +1,24 @@ .. _aws-container-run-job: -.. title:: Getting Started with Astronomer Cosmos on AWS ECS +AWS ECS execution mode +====================== -Getting Started with Astronomer Cosmos on AWS ECS -================================================== +.. versionadded:: 1.9.0 + +Astronomer Cosmos provides a unified way to run containerized workloads across multiple cloud providers. Using ``AWS Elastic Container Service (ECS)`` as the execution mode provides an isolated and scalable way to run ``dbt`` tasks within an AWS ECS service. This execution mode ensures that each ``dbt`` run is performed inside a dedicated container running in an ECS task. + +Performance and maintenance considerations +++++++++++++++++++++++++++++++++++++++++++ + +This execution mode requires you to have an AWS environment configured to run ECS tasks (see :ref:``aws-ecs`` for more details on the exact requirements). Similar to the ``Docker`` and ``Kubernetes`` execution modes, a Docker container should be available, containing the up-to-date ``dbt`` pipelines and profiles. + +Each task creates a new ECS task execution, providing full isolation. However, this separation introduces some overhead in execution time due to container startup and provisioning. If you require faster execution times, configuring appropriate ECS task definitions and cluster optimizations can help mitigate these delays. + +Setup ++++++ + +In this guide, you’ll learn how to deploy and run a Cosmos job on AWS Elastic Container Service (ECS) using Fargate. -Astronomer Cosmos provides a unified way to run containerized workloads across multiple cloud providers. In this guide, you’ll learn how to deploy and run a Cosmos job on AWS Elastic Container Service (ECS) using Fargate. Schematically, the guide will walk you through the steps required to build the following architecture: .. figure:: https://github.com/astronomer/astronomer-cosmos/raw/main/docs/_static/cosmos_aws_ecs_schematic.png diff --git a/docs/guides/run_dbt/container/aws-eks.rst b/docs/guides/run_dbt/container/aws-eks.rst new file mode 100644 index 0000000000..9894089d43 --- /dev/null +++ b/docs/guides/run_dbt/container/aws-eks.rst @@ -0,0 +1,34 @@ +.. _aws-eks: + +AWS EKS execution mode +======================= + +The Amazon Elastic Kubernetes Service (AWS EKS), ``aws_eks``, approach is very similar to the ``kubernetes`` approach, but it is specifically designed to run on AWS EKS clusters. +It uses the `EKSPodOperator `_ +to run the dbt commands. You need to provide the ``cluster_name`` in your operator_args to connect to the AWS EKS cluster. + + +Example DAG + +.. code-block:: python + + postgres_password_secret = Secret( + deploy_type="env", + deploy_target="POSTGRES_PASSWORD", + secret="postgres-secrets", + key="password", + ) + + docker_cosmos_dag = DbtDag( + # ... + execution_config=ExecutionConfig( + execution_mode=ExecutionMode.AWS_EKS, + ), + operator_args={ + "image": "dbt-jaffle-shop:1.0.0", + "cluster_name": CLUSTER_NAME, + "get_logs": True, + "is_delete_operator_pod": False, + "secrets": [postgres_password_secret], + }, + ) \ No newline at end of file diff --git a/docs/getting_started/azure-container-instance.rst b/docs/guides/run_dbt/container/azure-container-instance.rst similarity index 84% rename from docs/getting_started/azure-container-instance.rst rename to docs/guides/run_dbt/container/azure-container-instance.rst index 86ce3ab9ef..834a87595f 100644 --- a/docs/getting_started/azure-container-instance.rst +++ b/docs/guides/run_dbt/container/azure-container-instance.rst @@ -1,16 +1,30 @@ .. _azure-container-instance: -Azure Container Instance Execution Mode + +Azure Container Instance execution mode ======================================= .. versionadded:: 1.4 -This tutorial will guide you through the steps required to use Azure Container Instance as the Execution Mode for your dbt code with Astronomer Cosmos. Schematically, the guide will walk you through the steps required to build the following architecture: +Using ``Azure Container Instances`` as the execution mode provides an isolated way of running ``dbt``, since the ``dbt`` run itself occurs within a container running in an Azure Container Instance. + +Performance and maintenance considerations +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +This execution mode requires the user has an Azure environment that can be used to run Azure Container Groups in (see :ref:`azure-container-instance` for more details on the exact requirements). Similarly to the ``Docker`` and ``Kubernetes`` execution modes, a Docker container should be available, containing the up-to-date ``dbt`` pipelines and profiles. + +Each task creates a new container on Azure, giving full isolation. This, however, comes at the cost of speed, as this separation of tasks introduces some overhead. + +Setup +~~~~~ + +This tutorial guides you through the steps required to use Azure Container Instance as the execution mode for your dbt code with Astronomer Cosmos. Schematically, the guide demonstrates how to build the following architecture: .. figure:: https://github.com/astronomer/astronomer-cosmos/raw/main/docs/_static/cosmos_aci_schematic.png :width: 800 Prerequisites -+++++++++++++ +~~~~~~~~~~~~~ + 1. Docker with docker daemon (Docker Desktop on MacOS). Follow the `Docker installation guide `_. 2. Airflow 3. Azure CLI (install guide here: `Azure CLI `_) @@ -28,7 +42,7 @@ More information on how to achieve 2-6 is detailed below. Note that the steps below will walk you through an example, for which the code can be found HERE Step-by-step guide -++++++++++++++++++ +~~~~~~~~~~~~~~~~~~ **Install Airflow and Cosmos** @@ -103,7 +117,7 @@ Take a read of the Dockerfile to understand what it does so that you could use i - The dags directory containing the `dbt project jaffle_shop `_ is added to the image - The dbt_project.yml is replaced with `postgres_profile_dbt_project.yml `_ which contains the profile key pointing to postgres_profile as profile creation is not handled at the moment for K8s operators like in local mode. -**Setup Airflow Connections** +**Set up Airflow Connections** Now you have the required Azure infrastructure, you still need to add configuration to Airflow to ensure the infrastructure can be used. You'll need 3 connections: 1. ``aci_db``: a Postgres connection to your Azure Postgres instance. @@ -112,7 +126,7 @@ Now you have the required Azure infrastructure, you still need to add configurat Check out the ``airflow-settings.yml`` file `here `_ for an example. If you are using Astro CLI, filling in the right values here will be enough for this to work. -**Setup and Trigger the DAG with Airflow** +**Set up and trigger the Dag with Airflow** Copy the dags directory from cosmos-example repo to your Airflow home diff --git a/docs/guides/run_dbt/container/docker.rst b/docs/guides/run_dbt/container/docker.rst new file mode 100644 index 0000000000..143e05f549 --- /dev/null +++ b/docs/guides/run_dbt/container/docker.rst @@ -0,0 +1,135 @@ +.. _docker: + +Docker execution mode +===================== + +The ``docker`` approach assumes you previously created Docker image, which contains all the ``dbt`` pipelines and a ``profiles.yml`` that you manage. + +Performance and maintenance considerations +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +You can have better environment isolation with ``docker`` than when using ``local`` or ``virtualenv`` modes, but this mode also requires more maintenance and has some performance tradeoffs, depending on your project configurations. + +Using ``docker`` requires that you ensure the Docker container you use has up-to-date files and you might potentially need to manage secrets in multiple places. Another challenge of working with ``docker`` occurs when the Airflow worker is already running in Docker, which can cause problems related to running `Docker in Docker `__. + +Also, the Docker execution mode approach can be significantly slower than ``virtualenv``, since it might require building the ``Docker`` container before executing dbt commands, which is slower than creating a Virtualenv with ``dbt-core``. + + +Set up Docker execution mode +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The following procedure illustrates how to run the Cosmos dbt Docker Operators and the required setup for them. + +Requirements +------------ + +- Docker with docker daemon (Docker Desktop on MacOS). Follow the `Docker installation guide `_. + +The following example setup steps include setting up the following: + +- Airflow +- Astronomer-cosmos package containing the dbt Docker operators +- Postgres docker container +- Docker image built with required dbt project and dbt DAG +- dbt DAG with dbt docker operators in the Airflow DAGs directory to run in Airflow + +1. Install Airflow and Cosmos +----------------------------- + +Create a python virtualenv, activate it, upgrade pip to the latest version, and install `Apache Airflow® `_ & ``astronomer-postgres``: + +.. code-block:: bash + + python -m venv venv + source venv/bin/activate + pip install --upgrade pip + pip install apache-airflow + pip install "astronomer-cosmos[dbt-postgres]" + +2. Set up Postgres database +--------------------------- + +You will need a postgres database running to use as the database for the dbt project. Run the following command to run and expose a postgres database: + +.. code-block:: bash + + docker run --name some-postgres -e POSTGRES_PASSWORD="" -e POSTGRES_USER=postgres -e POSTGRES_DB=postgres -p5432:5432 -d postgres + +3. Build the dbt Docker image +----------------------------- + +For the Docker operators to work, you need to create a docker image that will be supplied as image parameter to the dbt docker operators used in the DAG. + +1. Clone the `cosmos-example `_ repo + +.. code-block:: bash + + git clone https://github.com/astronomer/cosmos-example.git + cd cosmos-example + +2. Create a docker image containing the dbt project files and dbt profile by using the `Dockerfile `_, which will be supplied to the Docker operators. + +.. code-block:: bash + + docker build -t dbt-jaffle-shop:1.0.0 -f Dockerfile.postgres_profile_docker_k8s . + +.. note:: + + If running on M1, you may need to set the following environment variables for running the docker build command, in case it fails. + + .. code-block:: bash + + export DOCKER_BUILDKIT=0 + export COMPOSE_DOCKER_CLI_BUILD=0 + export DOCKER_DEFAULT_PLATFORM=linux/amd64 + +Read the following example Dockerfiles to understand what it does so that you can use them as a project reference. + +- The `dbt profile `_ file is added to the image. +- The ``dags`` directory containing the `dbt project jaffle_shop `_ is added to the image. +- The ``dbt_project.yml`` is replaced with `postgres_profile_dbt_project.yml `_, which contains the profile key pointing to ``postgres_profile`` because profile creation is not handled for K8s operators, like in local mode. + + +4. Set up and trigger the Dag with Airflow +------------------------------------------ + +1. Copy the ``dags`` directory from the ``cosmos-example`` repo to your Airflow home + +.. code-block:: bash + + cp -r dags $AIRFLOW_HOME/ + +This directory contains a Docker-specific example Dag. + +2. Run Airflow + +.. code-block:: bash + + airflow standalone + +.. note:: + + You might need to run airflow standalone with ``sudo`` if your Airflow user is not able to access the docker socket URL or pull the images in the ``Kind`` cluster. + +3. Log in to Airflow through a web browser, ``http://localhost:8080/``, using the user ``airflow`` and the password described in the ``standalone_admin_password.txt`` file. + +4. Enable and trigger a run of the `jaffle_shop_docker `_ Dag. You can see the following successful Dag run example: + +.. figure:: https://github.com/astronomer/astronomer-cosmos/raw/main/docs/_static/jaffle_shop_docker_dag_run.png + :width: 800 + +Specifying ProfileConfig +~~~~~~~~~~~~~~~~~~~~~~~~ + +Starting with Cosmos 1.8.0, you can use the ``profile_config`` argument in your Dbt DAG Docker operators to reference +profiles for your dbt project defined in a profiles.yml file. To do so, provide the file’s path via the +``profiles_yml_path`` parameter in ``profile_config``. + +Note that in ``ExecutionMode.DOCKER``, the ``profile_config`` is only compatible with the ``profiles_yml_path`` +approach. The ``profile_mapping`` method will not work because the required Airflow connections cannot be accessed +within the Docker container to map them to the dbt profile. + +Troubleshooting +~~~~~~~~~~~~~~~ + +If dbt is unavailable in the Airflow scheduler, the default ``LoadMode.DBT_LS`` will not work. In this scenario, you must use a :ref:`parsing-methods` that does not rely on dbt, such as ``LoadMode.MANIFEST``. diff --git a/docs/getting_started/gcp-cloud-run-job.rst b/docs/guides/run_dbt/container/gcp-cloud-run-job.rst similarity index 87% rename from docs/getting_started/gcp-cloud-run-job.rst rename to docs/guides/run_dbt/container/gcp-cloud-run-job.rst index fa4d0c60c4..090f9aa395 100644 --- a/docs/getting_started/gcp-cloud-run-job.rst +++ b/docs/guides/run_dbt/container/gcp-cloud-run-job.rst @@ -1,9 +1,24 @@ .. _gcp-cloud-run-job: -GCP Cloud Run Job Execution Mode -======================================= +GCP Cloud Run Job execution mode +================================= .. versionadded:: 1.7 +The ``gcp_cloud_run_job`` execution mode is particularly useful if you prefer to run their ``dbt`` commands on Google Cloud infrastructure, taking advantage of Cloud Run's scalability, isolation, and managed service capabilities. + +Performance and maintenance considerations +++++++++++++++++++++++++++++++++++++++++++ + +For the ``gcp_cloud_run_job`` execution mode to work, a Cloud Run Job instance must first be created using a previously built Docker container. This container should include the latest ``dbt`` pipelines and profiles. You can find more details in the `Cloud Run Job creation guide `__ . + +This execution mode allows you to run ``dbt`` core CLI commands in a Google Cloud Run Job instance. This mode leverages the ``CloudRunExecuteJobOperator`` from the Google Cloud Airflow provider to execute commands within a Cloud Run Job instance, where ``dbt`` is already installed. Similarly to the ``Docker`` and ``Kubernetes`` execution modes, a Docker container should be available, containing the up-to-date ``dbt`` pipelines and profiles. + +Each task will create a new Cloud Run Job execution, giving full isolation. The separation of tasks adds extra overhead; however, that can be mitigated by using the ``concurrency`` parameter in ``DbtDag``, which will result in parallelized execution of ``dbt`` models. + + +Setup ++++++ + This tutorial will guide you through the steps required to use Cloud Run Job instance as the Execution Mode for your dbt code with Astronomer Cosmos. This guide will walk you through the steps required to build the following architecture: .. figure:: https://github.com/astronomer/astronomer-cosmos/raw/main/docs/_static/cosmos_gcp_crj_schematic.png diff --git a/docs/guides/run_dbt/container/index.rst b/docs/guides/run_dbt/container/index.rst new file mode 100644 index 0000000000..ba31890637 --- /dev/null +++ b/docs/guides/run_dbt/container/index.rst @@ -0,0 +1,15 @@ +Run dbt in a container +====================== + +.. toctree:: + :maxdepth: 1 + :caption: Run dbt in a container + + docker + kubernetes + watcher-kubernetes-execution-mode + azure-container-instance + aws-container-run-job + aws-eks + gcp-cloud-run-job + diff --git a/docs/getting_started/kubernetes.rst b/docs/guides/run_dbt/container/kubernetes.rst similarity index 82% rename from docs/getting_started/kubernetes.rst rename to docs/guides/run_dbt/container/kubernetes.rst index 607ba07bd7..6e215ca252 100644 --- a/docs/getting_started/kubernetes.rst +++ b/docs/guides/run_dbt/container/kubernetes.rst @@ -1,7 +1,21 @@ .. _kubernetes: -Kubernetes Execution Mode -============================================== + +Kubernetes execution mode +========================== + +The ``kubernetes`` execution mode provides a very isolated method to run ``dbt`` from within a Kubernetes Pod, usually in a separate host. + +Performance and maintenance considerations +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +This execution mode assumes you have a Kubernetes cluster. It also expects you to ensure the Docker container has up-to-date ``dbt`` pipelines and profiles, potentially leading you to declare secrets in two places; Airflow and Docker container. + +The ``Kubernetes`` deployment might be slower than ``Docker`` and ``Virtualenv``, assuming that the container image is built (which is slower than creating a Python ``virtualenv`` and installing ``dbt-core``) and the Airflow task needs to spin up a new ``Pod`` in Kubernetes. + + +Set up Kubernetes execution mode +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The following tutorial illustrates how to run the Cosmos dbt Kubernetes Operator using a local Kubernetes (K8s) cluster. It assumes the following: @@ -9,7 +23,7 @@ The following tutorial illustrates how to run the Cosmos dbt Kubernetes Operator - Airflow is run locally, and it triggers a K8s Pod which runs dbt Requirements -++++++++++++ +~~~~~~~~~~~~ To test the DbtKubernetesOperators locally, we encourage you to install the following: @@ -28,13 +42,13 @@ Additional KubernetesPodOperator parameters can be added to the ``operator_args` For instance, -.. literalinclude:: ../../dev/dags/jaffle_shop_kubernetes.py +.. literalinclude:: ../../../../dev/dags/jaffle_shop_kubernetes.py :language: python :start-after: [START kubernetes_tg_example] :end-before: [END kubernetes_tg_example] Step-by-step instructions -+++++++++++++++++++++++++ +~~~~~~~~~~~~~~~~~~~~~~~~~~ Using installed `Kind `_, you can setup a local kubernetes cluster @@ -153,7 +167,7 @@ Enable and trigger a run of the `jaffle_shop_k8s `__ to address this) - Does not handle installing dbt deps for users (there is an `open ticket #679 `__ to address this) - Does not support `ProfileMapping `_ (there is an `open ticket #749 `__ to address this) -- Does not support `Callbacks `_ (there is an `open ticket #1575 `__ to address this) -- Does not expose Compiled SQL as a `templated field `_ -- Does not benefit from `Cosmos caching mechanisms `_ -- Does not support `generating dbt docs & uploading to an object store `_ (there is a `PR `_ to solve this for S3) +- Does not support `Callbacks `_ (there is an `open ticket #1575 `__ to address this) +- Does not expose Compiled SQL as a `templated field `_ +- Does not benefit from `Cosmos caching mechanisms `_ +- Does not support `generating dbt docs & uploading to an object store `_ (there is a `PR `_ to solve this for S3) diff --git a/docs/getting_started/watcher-kubernetes-execution-mode.rst b/docs/guides/run_dbt/container/watcher-kubernetes-execution-mode.rst similarity index 97% rename from docs/getting_started/watcher-kubernetes-execution-mode.rst rename to docs/guides/run_dbt/container/watcher-kubernetes-execution-mode.rst index 16dbbffd0a..8c8066d47e 100644 --- a/docs/getting_started/watcher-kubernetes-execution-mode.rst +++ b/docs/guides/run_dbt/container/watcher-kubernetes-execution-mode.rst @@ -1,7 +1,8 @@ .. _watcher-kubernetes-execution-mode: -``ExecutionMode.WATCHER_KUBERNETES``: High-Performance dbt Execution in Kubernetes -=================================================================================== + +Watcher Kubernetes execution mode (Experimental) +================================================ .. versionadded:: 1.13.0 @@ -183,7 +184,7 @@ Example DAG Below is a complete example of a DAG using ``ExecutionMode.WATCHER_KUBERNETES``: -.. literalinclude:: ../../dev/dags/jaffle_shop_watcher_kubernetes.py +.. literalinclude:: ../../../../dev/dags/jaffle_shop_watcher_kubernetes.py :language: python ------------------------------------------------------------------------------- diff --git a/docs/guides/run_dbt/customization/index.rst b/docs/guides/run_dbt/customization/index.rst new file mode 100644 index 0000000000..44021154dc --- /dev/null +++ b/docs/guides/run_dbt/customization/index.rst @@ -0,0 +1,9 @@ +Additional Customization +======================== + +.. toctree:: + :maxdepth: 1 + :caption: Additional Customization + + operator-args + scheduling diff --git a/docs/configuration/operator-args.rst b/docs/guides/run_dbt/customization/operator-args.rst similarity index 100% rename from docs/configuration/operator-args.rst rename to docs/guides/run_dbt/customization/operator-args.rst diff --git a/docs/configuration/scheduling.rst b/docs/guides/run_dbt/customization/scheduling.rst similarity index 99% rename from docs/configuration/scheduling.rst rename to docs/guides/run_dbt/customization/scheduling.rst index 2d4e729c5b..0040135d37 100644 --- a/docs/configuration/scheduling.rst +++ b/docs/guides/run_dbt/customization/scheduling.rst @@ -77,7 +77,7 @@ This example DAG: .. The following renders in Sphinx but not Github: -.. literalinclude:: ../../dev/dags/basic_cosmos_dag.py +.. literalinclude:: ../../../../dev/dags/basic_cosmos_dag.py :language: python :start-after: [START local_example] :end-before: [END local_example] diff --git a/docs/getting_started/operators.rst b/docs/guides/run_dbt/operators/operators.rst similarity index 88% rename from docs/getting_started/operators.rst rename to docs/guides/run_dbt/operators/operators.rst index 9f6658b6b1..448e037e77 100644 --- a/docs/getting_started/operators.rst +++ b/docs/guides/run_dbt/operators/operators.rst @@ -18,7 +18,7 @@ The ``DbtCloneLocalOperator`` implement `dbt clone = 1.5 and cosmos >= 1.6.0. @@ -70,7 +70,7 @@ The ``on_warning_callback`` is a callback parameter available on the ``DbtSource Example: -.. literalinclude:: ../../dev/dags/example_source_rendering.py/ +.. literalinclude:: ../../../dev/dags/example_source_rendering.py/ :language: python :start-after: [START cosmos_source_node_example] :end-before: [END cosmos_source_node_example] diff --git a/docs/configuration/parsing-methods.rst b/docs/guides/translate_dbt_to_airflow/parsing-methods.rst similarity index 96% rename from docs/configuration/parsing-methods.rst rename to docs/guides/translate_dbt_to_airflow/parsing-methods.rst index 9eb654d04f..567fc4c137 100644 --- a/docs/configuration/parsing-methods.rst +++ b/docs/guides/translate_dbt_to_airflow/parsing-methods.rst @@ -56,7 +56,7 @@ Examples of how to supply ``manifest.json`` using ``manifest_path`` argument: - Local path: -.. literalinclude:: ../../dev/dags/cosmos_manifest_example.py +.. literalinclude:: ../../../dev/dags/cosmos_manifest_example.py :language: python :start-after: [START local_example] :end-before: [END local_example] @@ -66,7 +66,7 @@ Examples of how to supply ``manifest.json`` using ``manifest_path`` argument: Ensure that you have the required dependencies installed to use the S3 URL. You can install the required dependencies using the following command: ``pip install "astronomer-cosmos[amazon]"`` -.. literalinclude:: ../../dev/dags/cosmos_manifest_example.py +.. literalinclude:: ../../../dev/dags/cosmos_manifest_example.py :language: python :start-after: [START aws_s3_example] :end-before: [END aws_s3_example] @@ -76,7 +76,7 @@ using the following command: ``pip install "astronomer-cosmos[amazon]"`` Ensure that you have the required dependencies installed to use the GCS URL. You can install the required dependencies using the following command: ``pip install "astronomer-cosmos[google]"`` -.. literalinclude:: ../../dev/dags/cosmos_manifest_example.py +.. literalinclude:: ../../../dev/dags/cosmos_manifest_example.py :language: python :start-after: [START gcp_gs_example] :end-before: [END gcp_gs_example] @@ -86,7 +86,7 @@ using the following command: ``pip install "astronomer-cosmos[google]"`` Ensure that you have the required dependencies installed to use the Azure blob URL. You can install the required dependencies using the following command: ``pip install "astronomer-cosmos[microsoft]"`` -.. literalinclude:: ../../dev/dags/cosmos_manifest_example.py +.. literalinclude:: ../../../dev/dags/cosmos_manifest_example.py :language: python :start-after: [START azure_abfs_example] :end-before: [END azure_abfs_example] diff --git a/docs/configuration/render-config.rst b/docs/guides/translate_dbt_to_airflow/render-config.rst similarity index 99% rename from docs/configuration/render-config.rst rename to docs/guides/translate_dbt_to_airflow/render-config.rst index f153d3c3d1..425e106124 100644 --- a/docs/configuration/render-config.rst +++ b/docs/guides/translate_dbt_to_airflow/render-config.rst @@ -63,7 +63,7 @@ Your pipeline may even have specific node types not part of the standard dbt def The following example illustrates how it is possible to tell Cosmos how to convert two different types of nodes (``source`` and ``exposure``) into Airflow: -.. literalinclude:: ../../dev/dags/example_cosmos_sources.py +.. literalinclude:: ../../../dev/dags/example_cosmos_sources.py :language: python :start-after: [START custom_dbt_nodes] :end-before: [END custom_dbt_nodes] diff --git a/docs/index.rst b/docs/index.rst index beee4f40bb..2a4d00de6f 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -2,13 +2,15 @@ .. toctree:: :hidden: - :maxdepth: 2 + :maxdepth: 0 :caption: Contents: Home Getting Started - Configuration + Guides + Optimize Performance Profiles + Reference Contributing Airflow 3 compatibility Compatibility Policy @@ -110,7 +112,7 @@ for managing and scaling your data workflows. Getting Started with Airflow Async Execution Mode ------------------------------------------------- -See our :doc:`Getting Started with Airflow Async Execution Mode ` for details. +See our :doc:`Getting Started with Airflow Async Execution Mode ` for details. Airflow 3 compatibility diff --git a/docs/configuration/caching.rst b/docs/optimize_performance/caching.rst similarity index 100% rename from docs/configuration/caching.rst rename to docs/optimize_performance/caching.rst diff --git a/docs/optimize_performance/index.rst b/docs/optimize_performance/index.rst new file mode 100644 index 0000000000..89f28cc168 --- /dev/null +++ b/docs/optimize_performance/index.rst @@ -0,0 +1,14 @@ +.. _optimize-performance: + +Optimize your Cosmos Performance +================================ + +.. toctree:: + :maxdepth: 1 + :caption: Optimize Performance + + partial-parsing + memory_optimization + selecting-excluding + invocation_modes + caching diff --git a/docs/optimize_performance/invocation_modes.rst b/docs/optimize_performance/invocation_modes.rst new file mode 100644 index 0000000000..97b85ce66e --- /dev/null +++ b/docs/optimize_performance/invocation_modes.rst @@ -0,0 +1,30 @@ +.. _invocation_modes: + +Invocation Modes +================ + +.. versionadded:: 1.4 + +For ``ExecutionMode.LOCAL`` and ``ExecutionMode.WATCHER`` execution mode, Cosmos supports two invocation modes for running dbt: + +1. ``InvocationMode.SUBPROCESS``: In this mode, Cosmos runs dbt cli commands using the Python ``subprocess`` module and parses the output to capture logs and to raise exceptions. + +2. ``InvocationMode.DBT_RUNNER``: In this mode, Cosmos uses the ``dbtRunner`` available for `dbt programmatic invocations `__ to run dbt commands. \ + In order to use this mode, dbt must be installed in the same local environment. This mode does not have the overhead of spawning new subprocesses or parsing the output of dbt commands and is faster than ``InvocationMode.SUBPROCESS``. \ + This mode requires dbt version 1.5.0 or higher. It is up to the user to resolve :ref:`execution-modes-local-conflicts` when using this mode. + +The invocation mode can be set in the ``ExecutionConfig`` as shown below: + +.. code-block:: python + + from cosmos.constants import InvocationMode + + dag = DbtDag( + # ... + execution_config=ExecutionConfig( + execution_mode=ExecutionMode.LOCAL, + invocation_mode=InvocationMode.DBT_RUNNER, + ), + ) + +If the invocation mode is not set, Cosmos will attempt to use ``InvocationMode.DBT_RUNNER`` if dbt is installed in the same environment as the worker, otherwise it will fall back to ``InvocationMode.SUBPROCESS``. diff --git a/docs/configuration/memory_optimization.rst b/docs/optimize_performance/memory_optimization.rst similarity index 100% rename from docs/configuration/memory_optimization.rst rename to docs/optimize_performance/memory_optimization.rst diff --git a/docs/configuration/partial-parsing.rst b/docs/optimize_performance/partial-parsing.rst similarity index 100% rename from docs/configuration/partial-parsing.rst rename to docs/optimize_performance/partial-parsing.rst diff --git a/docs/configuration/selecting-excluding.rst b/docs/optimize_performance/selecting-excluding.rst similarity index 100% rename from docs/configuration/selecting-excluding.rst rename to docs/optimize_performance/selecting-excluding.rst diff --git a/docs/configuration/cosmos-conf.rst b/docs/reference/configs/cosmos-conf.rst similarity index 98% rename from docs/configuration/cosmos-conf.rst rename to docs/reference/configs/cosmos-conf.rst index cc68c3b71f..a8928c3840 100644 --- a/docs/configuration/cosmos-conf.rst +++ b/docs/reference/configs/cosmos-conf.rst @@ -253,14 +253,14 @@ This page lists all available Airflow configurations that affect ``astronomer-co As an example, when this option is enabled, the following is an example of specifying the imports with full module paths: - .. literalinclude:: ../../dev/dags/basic_cosmos_dag_full_module_path_imports.py + .. literalinclude:: ../../../dev/dags/basic_cosmos_dag_full_module_path_imports.py :language: python :start-after: [START cosmos_explicit_imports] :end-before: [END cosmos_explicit_imports] as opposed to the following approach you might have when this option is disabled (default): - .. literalinclude:: ../../dev/dags/basic_cosmos_dag.py + .. literalinclude:: ../../../dev/dags/basic_cosmos_dag.py :language: python :start-after: [START cosmos_init_imports] :end-before: [END cosmos_init_imports] diff --git a/docs/configuration/execution-config.rst b/docs/reference/configs/execution-config.rst similarity index 100% rename from docs/configuration/execution-config.rst rename to docs/reference/configs/execution-config.rst diff --git a/docs/reference/configs/index.rst b/docs/reference/configs/index.rst new file mode 100644 index 0000000000..48a94c8c6b --- /dev/null +++ b/docs/reference/configs/index.rst @@ -0,0 +1,13 @@ + +Configurations +============== + +.. toctree:: + :maxdepth: 1 + :hidden: + :caption: Configurations + + cosmos-conf + execution-config + profile-config + project-config diff --git a/docs/configuration/profile-config.rst b/docs/reference/configs/profile-config.rst similarity index 100% rename from docs/configuration/profile-config.rst rename to docs/reference/configs/profile-config.rst diff --git a/docs/configuration/project-config.rst b/docs/reference/configs/project-config.rst similarity index 100% rename from docs/configuration/project-config.rst rename to docs/reference/configs/project-config.rst diff --git a/docs/reference/index.rst b/docs/reference/index.rst new file mode 100644 index 0000000000..3fe39513f7 --- /dev/null +++ b/docs/reference/index.rst @@ -0,0 +1,16 @@ +Reference +========= + +.. toctree:: + :maxdepth: 1 + :hidden: + :caption: Configurations + + configs/index + +.. toctree:: + :maxdepth: 1 + :hidden: + :caption: Troubleshooting + + troubleshooting/index diff --git a/docs/getting_started/execution-modes-local-conflicts.rst b/docs/reference/troubleshooting/execution-modes-local-conflicts.rst similarity index 93% rename from docs/getting_started/execution-modes-local-conflicts.rst rename to docs/reference/troubleshooting/execution-modes-local-conflicts.rst index 9fec173751..52acf85394 100644 --- a/docs/getting_started/execution-modes-local-conflicts.rst +++ b/docs/reference/troubleshooting/execution-modes-local-conflicts.rst @@ -1,5 +1,3 @@ -:orphan: - .. _execution-modes-local-conflicts: Airflow and dbt dependencies conflicts @@ -10,8 +8,8 @@ When using the `Local Execution Mode `__, users may If you find errors, we recommend users isolating the installation of dbt from the Airflow installation. With the `Local Execution Mode `__, this can be accomplished by installing dbt in a separate -Python virtualenv and setting the `ExecutionConfig.dbt_executable_path <../configuration/execution-config.html>`_ and -`RenderConfig.dbt_executable_path <../configuration/render-config.html>`_ parameters. +Python virtualenv and setting the `ExecutionConfig.dbt_executable_path <../guides/execution-config.html>`_ and +`RenderConfig.dbt_executable_path <../guides/render-config.html>`_ parameters. The page `execution modes `__ describes many other methods that support isolating dbt from Airflow. @@ -92,11 +90,11 @@ Examples of errors .. code-block:: bash -ERROR: Cannot install apache-airflow and dbt-core==1.10.0 because these package versions have conflicting dependencies. + ERROR: Cannot install apache-airflow and dbt-core==1.10.0 because these package versions have conflicting dependencies. -The conflict is caused by: - dbt-core 1.10.0 depends on pydantic<2 - apache-airflow-core 3.0.0 depends on pydantic>=2.11.0 + The conflict is caused by: + dbt-core 1.10.0 depends on pydantic<2 + apache-airflow-core 3.0.0 depends on pydantic>=2.11.0 diff --git a/docs/reference/troubleshooting/index.rst b/docs/reference/troubleshooting/index.rst new file mode 100644 index 0000000000..3090a90bc2 --- /dev/null +++ b/docs/reference/troubleshooting/index.rst @@ -0,0 +1,9 @@ +Troubleshooting +=============== + +.. toctree:: + :maxdepth: 1 + :hidden: + :caption: Troubleshoot Cosmos + + execution-modes-local-conflicts \ No newline at end of file