From f1a745f055a0d133c10ec38c3debb830758ad1f8 Mon Sep 17 00:00:00 2001 From: Tatiana Al-Chueyr Date: Thu, 21 Mar 2024 11:34:48 +0000 Subject: [PATCH 1/3] Improve partial parsing docs --- docs/configuration/project-config.rst | 2 +- docs/getting_started/execution-modes.rst | 3 +-- 2 files changed, 2 insertions(+), 3 deletions(-) diff --git a/docs/configuration/project-config.rst b/docs/configuration/project-config.rst index 3bf524ac82..2882ee9cc3 100644 --- a/docs/configuration/project-config.rst +++ b/docs/configuration/project-config.rst @@ -25,7 +25,7 @@ variables that should be used for rendering and execution. It takes the followin env vars is only supported when using ``RenderConfig.LoadMode.DBT_LS`` load mode. - ``partial_parse``: (new in v1.4) If True, then attempt to use the ``partial_parse.msgpack`` if it exists. This is only used for the ``LoadMode.DBT_LS`` load mode, and for the ``ExecutionMode.LOCAL`` and ``ExecutionMode.VIRTUALENV`` - execution modes. + execution modes. Due to the way that dbt `partial parsing works `_, it does not work with Cosmos profile mapping classes. To benefit from this feature, users have to set the ``profiles_yml_filepath`` argument in ``ProfileConfig``. Project Config Example ---------------------- diff --git a/docs/getting_started/execution-modes.rst b/docs/getting_started/execution-modes.rst index 8f70135722..6b611b777d 100644 --- a/docs/getting_started/execution-modes.rst +++ b/docs/getting_started/execution-modes.rst @@ -56,8 +56,7 @@ The ``local`` execution mode assumes a ``dbt`` binary is reachable within the Ai If ``dbt`` was not installed as part of the Cosmos packages, users can define a custom path to ``dbt`` by declaring the argument ``dbt_executable_path``. -By default, if Cosmos sees a ``partial_parse.msgpack`` in the target directory of the dbt project directory when using ``local`` execution, it will use this for partial parsing to speed up task execution. -This can be turned off by setting ``partial_parse=False`` in the ``ProjectConfig``. +By default, if Cosmos sees a ``partial_parse.msgpack`` in the target directory of the dbt project directory when using ``local`` execution, it will use this for partial parsing to speed up task execution. Due to the way that dbt `partial parsing works `_, it does not work with Cosmos profile mapping classes. To benefit from this feature, users have to set the ``profiles_yml_filepath`` argument in ``ProfileConfig``. It is possible to turned off partial parsing in Cosmos by setting ``partial_parse=False`` in the ``ProjectConfig``. When using the ``local`` execution mode, Cosmos converts Airflow Connections into a native ``dbt`` profiles file (``profiles.yml``). From 7cc53c7d54ac0916555a7efaa0dda586a2d2ee4d Mon Sep 17 00:00:00 2001 From: Tatiana Al-Chueyr Date: Wed, 24 Apr 2024 22:53:52 +0100 Subject: [PATCH 2/3] Update partial parsing docs based on the changes introduced in #800 --- docs/configuration/index.rst | 1 + docs/configuration/partial-parsing.rst | 55 ++++++++++++++++++++++++ docs/getting_started/execution-modes.rst | 5 ++- 3 files changed, 60 insertions(+), 1 deletion(-) create mode 100644 docs/configuration/partial-parsing.rst diff --git a/docs/configuration/index.rst b/docs/configuration/index.rst index 919ed9b1e5..ec69c1f528 100644 --- a/docs/configuration/index.rst +++ b/docs/configuration/index.rst @@ -20,6 +20,7 @@ Cosmos offers a number of configuration options to customize its behavior. For m Scheduling Testing Behavior Selecting & Excluding + Partial Parsing Operator Args Compiled SQL Logging diff --git a/docs/configuration/partial-parsing.rst b/docs/configuration/partial-parsing.rst new file mode 100644 index 0000000000..3b59149c06 --- /dev/null +++ b/docs/configuration/partial-parsing.rst @@ -0,0 +1,55 @@ +.. _partial-parsing: + +Partial parsing +=============== + +Starting in the 1.4 version, Cosmos tries to leverage dbt's partial parsing (``partial_parse.msgpack``) to speed up both the task execution and the DAG parsing (if using ``LoadMode.DBT_LS``). + +This feature is bound to `dbt partial parsing limitations `_. +As an example, ``dbt`` requires the same ``--vars``, ``--target``, ``--profile``, and ``profile.yml`` environment variables (as called by the ``env_var()`` macro) while running dbt commands, otherwise it will reparse the project from scratch. + +Profile configuration +--------------------- + +To respect the dbt requirement of having the same profile to benefit from partial parsing, Cosmos users should either: +* If using Cosmos profile mapping (``ProfileConfig(profile_mapping=...``), disable using mocked profile mappings by setting ``render_config=RenderConfig(enable_mock_profile=False)`` +* Declare their own ``profiles.yml`` file, via ``ProfileConfig(profiles_yml_filepath=...)`` + +If users don't follow these guidelines, Cosmos will use different profiles to parse the dbt project and to run tasks, and the user won't leverage dbt partial parsing. +Their logs will contain multiple ``INFO`` messages similar to the following, meaning that Cosmos are is not using partial parsing: + +.. code-block:: + + 13:33:16 Unable to do partial parsing because profile has changed + 13:33:16 Unable to do partial parsing because env vars used in profiles.yml have changed + +dbt vars +-------- + +If the Airflow scheduler and worker processes run in the same node, users must ensure the dbt ``--vars`` flag is the same in the ``RenderConfig`` and ``ExecutionConfig``. + +Otherwise, users may see messages similar to the following in their logs: + +.. code-block:: + + [2024-03-14, 17:04:57 GMT] {{subprocess.py:94}} INFO - Unable to do partial parsing because config vars, config profile, or config target have changed + + +Caching +------- + +If the dbt project ``target`` directory has a ``partial_parse.msgpack``, Cosmos will attempt to use it. + +There is a chance, however, that the file is stale or was generated in a way that is different to how Cosmos runs the dbt commands. + +Therefore, Cosmos also caches the most up-to-date ``partial_parse.msgpack`` file after running a dbt command in the `system temporary directory `_. +With this, unless there are code changes, each Airflow node should only run the dbt command with a full dbt project parse once, and benefit from partial parsing from then onwards. + +It is possible to override the directory that Cosmos uses caching with the Airflow configuration ``[cosmos][cache_dir]`` or environment variable ``AIRFLOW__COSMOS__CACHE_DIR``. + +To turn off caching, set the Airflow configuration ``[cosmos][enable_cache]`` or the environment variable ``AIRFLOW__COSMOS__ENABLE_CACHE=0``. + +Disabling +--------- + +To switch off partial parsing in Cosmos, use the argument ``partial_parse=False`` in the ``ProjectConfig``. diff --git a/docs/getting_started/execution-modes.rst b/docs/getting_started/execution-modes.rst index 6b611b777d..1765144d99 100644 --- a/docs/getting_started/execution-modes.rst +++ b/docs/getting_started/execution-modes.rst @@ -56,7 +56,10 @@ The ``local`` execution mode assumes a ``dbt`` binary is reachable within the Ai If ``dbt`` was not installed as part of the Cosmos packages, users can define a custom path to ``dbt`` by declaring the argument ``dbt_executable_path``. -By default, if Cosmos sees a ``partial_parse.msgpack`` in the target directory of the dbt project directory when using ``local`` execution, it will use this for partial parsing to speed up task execution. Due to the way that dbt `partial parsing works `_, it does not work with Cosmos profile mapping classes. To benefit from this feature, users have to set the ``profiles_yml_filepath`` argument in ``ProfileConfig``. It is possible to turned off partial parsing in Cosmos by setting ``partial_parse=False`` in the ``ProjectConfig``. +.. note:: + Starting in the 1.4 version, Cosmos tries to leverage the dbt partial parsing (``partial_parse.msgpack``) to speed up task execution. + This feature is bound to `dbt partial parsing limitations `_. + Learn more: :ref:`partial-parsing`. When using the ``local`` execution mode, Cosmos converts Airflow Connections into a native ``dbt`` profiles file (``profiles.yml``). From 081beccd1f25de45529972067d1e1a7d72d6fc73 Mon Sep 17 00:00:00 2001 From: Tatiana Al-Chueyr Date: Thu, 25 Apr 2024 14:47:38 +0100 Subject: [PATCH 3/3] Address PR feedback --- docs/configuration/partial-parsing.rst | 19 ++++++++++++++++--- 1 file changed, 16 insertions(+), 3 deletions(-) diff --git a/docs/configuration/partial-parsing.rst b/docs/configuration/partial-parsing.rst index 3b59149c06..911e828b3f 100644 --- a/docs/configuration/partial-parsing.rst +++ b/docs/configuration/partial-parsing.rst @@ -16,7 +16,7 @@ To respect the dbt requirement of having the same profile to benefit from partia * Declare their own ``profiles.yml`` file, via ``ProfileConfig(profiles_yml_filepath=...)`` If users don't follow these guidelines, Cosmos will use different profiles to parse the dbt project and to run tasks, and the user won't leverage dbt partial parsing. -Their logs will contain multiple ``INFO`` messages similar to the following, meaning that Cosmos are is not using partial parsing: +Their logs will contain multiple ``INFO`` messages similar to the following, meaning that Cosmos is not using partial parsing: .. code-block:: @@ -45,9 +45,22 @@ There is a chance, however, that the file is stale or was generated in a way tha Therefore, Cosmos also caches the most up-to-date ``partial_parse.msgpack`` file after running a dbt command in the `system temporary directory `_. With this, unless there are code changes, each Airflow node should only run the dbt command with a full dbt project parse once, and benefit from partial parsing from then onwards. -It is possible to override the directory that Cosmos uses caching with the Airflow configuration ``[cosmos][cache_dir]`` or environment variable ``AIRFLOW__COSMOS__CACHE_DIR``. -To turn off caching, set the Airflow configuration ``[cosmos][enable_cache]`` or the environment variable ``AIRFLOW__COSMOS__ENABLE_CACHE=0``. +Caching is enabled by default. +It is possible to disable caching or override the directory that Cosmos uses caching with the Airflow configuration: + +.. code-block:: cfg + + [cosmos] + cache_dir = path/to/docs/here # to override default caching directory (by default, uses the system temporary directory) + enable_cache = False # to disable caching (enabled by default) + +Or environment variable: + +.. code-block:: cfg + + AIRFLOW__COSMOS__CACHE_DIR="path/to/docs/here" # to override default caching directory (by default, uses the system temporary directory) + AIRFLOW__COSMOS__ENABLE_CACHE="False" # to disable caching (enabled by default) Disabling ---------