Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dc jc/update mesh guide #5600

Merged
merged 20 commits into from
Jun 17, 2024
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
Show all changes
20 commits
Select commit Hold shift + click to select a range
8956744
first pass at who is mesh for
dave-connors-3 May 28, 2024
e5ba69f
updates
dave-connors-3 Jun 4, 2024
f0fb522
Merge branch 'current' into dc-jc/update-mesh-guide
dave-connors-3 Jun 4, 2024
5c94883
update links
dave-connors-3 Jun 4, 2024
592fa87
Merge branch 'dc-jc/update-mesh-guide' of github.com:dbt-labs/docs.ge…
dave-connors-3 Jun 4, 2024
26fc479
Apply suggestions from code review
matthewshaver Jun 6, 2024
7ce23a6
Update website/docs/best-practices/how-we-mesh/mesh-2-who-is-dbt-mesh…
matthewshaver Jun 6, 2024
ada383d
Update mesh-2-who-is-dbt-mesh-for.md
matthewshaver Jun 6, 2024
7d17500
Apply suggestions from code review
matthewshaver Jun 6, 2024
4d4e5f3
Apply suggestions from code review
matthewshaver Jun 6, 2024
bf9830d
Apply suggestions from code review
matthewshaver Jun 6, 2024
0bdd4ce
Apply suggestions from code review
matthewshaver Jun 6, 2024
81edb57
Merge branch 'current' into dc-jc/update-mesh-guide
matthewshaver Jun 6, 2024
204a82f
Update website/docs/best-practices/how-we-mesh/mesh-4-implementation.md
matthewshaver Jun 6, 2024
5eca926
Update website/docs/best-practices/how-we-mesh/mesh-2-who-is-dbt-mesh…
dave-connors-3 Jun 10, 2024
bbf3176
updates
dave-connors-3 Jun 10, 2024
e6374bb
typo
dave-connors-3 Jun 10, 2024
fc657f5
Update website/docs/best-practices/how-we-mesh/mesh-2-who-is-dbt-mesh…
jtcohen6 Jun 12, 2024
8ab039d
Merge branch 'current' into dc-jc/update-mesh-guide
mirnawong1 Jun 13, 2024
12b6ef1
Merge branch 'current' into dc-jc/update-mesh-guide
dave-connors-3 Jun 17, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 3 additions & 2 deletions website/docs/best-practices/how-we-mesh/mesh-1-intro.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,19 +22,20 @@ This guide will walk you through the concepts and implementation details needed
- **[Model Versions](/docs/collaborate/govern/model-versions)** - when coordinating across projects and teams, we recommend treating your data models as stable APIs. Model versioning is the mechanism to allow graceful adoption and deprecation of models as they evolve.
- **[Model Contracts](/docs/collaborate/govern/model-contracts)** - data contracts set explicit expectations on the shape of the data to ensure data changes upstream of dbt or within a project's logic don't break downstream consumers' data products.

## Who is dbt Mesh for?
## When is the right time to use dbt Mesh?

The multi-project architecture helps organizations with mature, complex transformation workflows in dbt increase the flexibility and performance of their dbt projects. If you're already using dbt and your project has started to experience any of the following, you're likely ready to start exploring this paradigm:

- The **number of models** in your project is degrading performance and slowing down development.
- Teams have developed **separate workflows** and need to decouple development from each other.
- Teams are experiencing **communication challenges** and the reliability of some of your data products has started to deteriorate.
matthewshaver marked this conversation as resolved.
Show resolved Hide resolved
- **Security and governance** requirements are increasing and would benefit from increased isolation.

dbt Cloud is designed to coordinate the features above and simplify the complexity to solve for these problems.

If you're just starting your dbt journey, don't worry about building a multi-project architecture right away. You can _incrementally_ adopt the features in this guide as you scale. The collection of features work effectively as independent tools. Familiarizing yourself with the tooling and features that make up a multi-project architecture, and how they can apply to your organization will help you make better decisions as you grow.

For additional information, refer to the [dbt Mesh FAQs](/best-practices/how-we-mesh/mesh-4-faqs).
For additional information, refer to the [dbt Mesh FAQs](/best-practices/how-we-mesh/mesh-5-faqs).

## Learning goals

Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
---
title: Who is dbt Mesh for?
description: Understanding if dbt Mesh is the right fit for your team
hoverSnippet: Learn how to get started with dbt Mesh
---

Before embarking on a dbt Mesh implementation, it's important to understand if dbt Mesh is the right fit for your team. Below, we outline three common organizational structures to help teams identify whether dbt Mesh might fit your organization's needs.
matthewshaver marked this conversation as resolved.
Show resolved Hide resolved

### 1. The Enterprise Data Mesh
matthewshaver marked this conversation as resolved.
Show resolved Hide resolved

Some data teams operation on a global scale -- the team by definition needs to manage, deploy and distribute data products across an incredibly large number of teams. Central IT may own some data products, or simply own the platform upon which data products are built. Often, these organziations have “Architects” who can advise line-of-business teams on their work, while keeping track of what’s happening globally (in terms both of tooling and the substance of work). This is a lot like how software organizations work, beyond a certain scale.
matthewshaver marked this conversation as resolved.
Show resolved Hide resolved

This is a true data mesh -- many teams publish many models for each others' consumption. The headcount ratio here is roughly ≥10:1 — for each member of the central platform team, there might be dozens of members of domain-aligned data teams.
matthewshaver marked this conversation as resolved.
Show resolved Hide resolved

**Is dbt Mesh a good fit? If so, why?** Absolutely! There simply is no other way to share data products at scale! One dbt project would not keep up with the global demands of an organization like this.
matthewshaver marked this conversation as resolved.
Show resolved Hide resolved

**Tips and tricks**
matthewshaver marked this conversation as resolved.
Show resolved Hide resolved

- **Managing shared macros**: Teams operating at this scale will benefit from a separate repository containing a dbt package of reusable utility macros that all other projects will install. This is different from public models, which provide data-as-a-service (a set of “API endpoints”) — this is distributed as a **library**. This package can also standardize imports of other third-party packages, as well as providing wrappers / shims for those macros. This package should have a dedicated team of maintainers — probably the central platform team, or a set of “superusers” from domain-aligned data modeling teams.
dave-connors-3 marked this conversation as resolved.
Show resolved Hide resolved

**Adoption challenges**
matthewshaver marked this conversation as resolved.
Show resolved Hide resolved

- Onboarding hundreds of people and dozens of projects is full of friction! The challenges of a scaled, global organization are not to be underestimated.
matthewshaver marked this conversation as resolved.
Show resolved Hide resolved
dave-connors-3 marked this conversation as resolved.
Show resolved Hide resolved
- Bi-directional project dependencies. If projects are aligned to domain teams, they need the ability to have “chatty” APIs ; otherwise they need to split projects beyond the 1:1 mapping with team boundaries. *Stay tuned for more on this in the future.*
matthewshaver marked this conversation as resolved.
Show resolved Hide resolved

If your organization sounds like this, it's almost certain that dbt Mesh is the architecture you should pursue. ✅
matthewshaver marked this conversation as resolved.
Show resolved Hide resolved

### 2. Hub and Spoke
matthewshaver marked this conversation as resolved.
Show resolved Hide resolved

Some slightly smaller organizations still operate with a central data team serving several business-aligned analytics teams, in a ~5:1 headcount ratio. These central teams look less like a central IT function, and more like a modern data platform team of analytics engineers. This team provides the majority of the data products to the rest of the org, as well as the infrastructure for downstream analytics teams to spin up their own spoke projects to ensure quality + maintenance of the core platform.
matthewshaver marked this conversation as resolved.
Show resolved Hide resolved

**Is dbt Mesh a good fit? If so, why?** Almost certainly! If your central data team starts to bottleneck analsyts’ work, you need a way for those teams to operate relatively independently while still ensuring the quality of the most used data products. dbt Mesh is designed to solve this exact problem.
matthewshaver marked this conversation as resolved.
Show resolved Hide resolved

**Tips and tricks**
matthewshaver marked this conversation as resolved.
Show resolved Hide resolved

- **Data products by some, for all.** The Spoke teams shouldn’t produce public models. By contrast, development in the Hub project should be slower and more careful, and should focus on producing foundational public models that are shared across domains. We’d recommend giving Hub team members access (at least read-only) to downstream projects, which will help with more granular impact analysis within dbt Explorer. If a public model isn’t being used in any downstream project, or a specific column in that model, the Hub team can feel better about removing it, but they should still utilize the dbt governance features like `deprecation_date` and `version` as appropriate — to set expectations.
matthewshaver marked this conversation as resolved.
Show resolved Hide resolved
- **Sources:** Spokes should be allowed/encouraged to define and use ***domain-specific*** data sources. The platform team should not need to worry about, say, `Thinkific` data when building out core data marts, but the Training project may need to! **No two sources anywhere in a dbt mesh should point to the same relation object.** If a spoke feels like they need to use a source that the hub already uses, the interfaces should change so that the spoke can get what they need from the platform project.
matthewshaver marked this conversation as resolved.
Show resolved Hide resolved
- **Project quality.** More analyst-focused teams will have different skill levels & quality bars. Owning their data means they own the consequences as well. Rather than being accountable for the end-to-end delivery of data assets, the Hub team is an enablement team: their role is to provide guardrails and quality checks, but not to fix all the issues exactly to their liking (and thereby remain a bottleneck).
matthewshaver marked this conversation as resolved.
Show resolved Hide resolved

**What are the adoption challenges?**
matthewshaver marked this conversation as resolved.
Show resolved Hide resolved

There are trade-offs to using this architecture, especially for the Hub team managing and maintaining public models. There is intentional friction in this workflow to reduce the chances of unintentional model changes that break unspoken data contracts. These assurances may come with some sacrifices, such as faster onboarding or more flexible development workflows.
matthewshaver marked this conversation as resolved.
Show resolved Hide resolved

If your organization sounds like this, it's very likely that dbt Mesh is a good fit for you. ✅
matthewshaver marked this conversation as resolved.
Show resolved Hide resolved

### 3. Single Team Monolith
matthewshaver marked this conversation as resolved.
Show resolved Hide resolved

Other organizations operate on an even smaller scale. If your data org is a single small team that controls the end to end process of building and maintaining all data products at the organization, dbt Mesh may not be required. The complexity in projects comes from having a wide ariety of data sources and stakeholders, but given the team's size, operating on a single codebase may be the most efficient way to manage data products. Generally, if a team of this size and scope is looking to implement dbt Mesh, it's likely that they are looking for better interface design and/or performance improvements for certain parts of their dbt DAG, and not because they necessarily have an organizational pain point to solve.
matthewshaver marked this conversation as resolved.
Show resolved Hide resolved

**Is dbt Mesh a good fit? If so, why?** Maybe! There may be reason to separate out parts of a large monolithic project into several to better orchestrate and manage the models. However, if the same people are managing each project, they may find that the overhead of managing multiple projects is not worth the benefits.
matthewshaver marked this conversation as resolved.
Show resolved Hide resolved

If this sounds like your organization, it's worth considering whether dbt Mesh is a good fit for you.
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,11 @@ At a high level, you’ll need to decide:

### Cycle detection

Like resource dependencies, project dependencies are acyclic, meaning they only move in one direction. This prevents `ref` cycles (or loops), which lead to issues with your data workflows. For example, if project B depends on project A, a new model in project A could not import and use a public model from project B. Refer to [Project dependencies](/docs/collaborate/govern/project-dependencies#how-to-write-cross-project-ref) for more information.
Like resource dependencies, project dependencies are acyclic, meaning they only move in one direction. This prevents `ref` cycles (or loops), which lead to issues with your data workflows. For example, if project B depends on project A, a new model in project A could not import and use a public model from project B. Refer to [Project dependencies](/docs/collaborate/govern/project-dependencies#how-to-write-cross-project-ref) for more information.

:::note
This requirement is likely to change in the future, so stay tuned for updates!
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

:::

## Define your project interfaces by splitting your DAG

Expand All @@ -34,6 +38,8 @@ Vertical splits separate out layers of transformation in DAG order. Let's look a
- **Isolating earlier models for security and governance requirements** to separate out and mask PII data so that downstream consumers can't access it is a common use case for a vertical split.
- **Protecting complex or expensive data** to isolate large or complex models that are expensive to run so that they are safe from accidental selection, independently deployable, and easier to debug when they have issues.

<Lightbox src="/img/best-practices/how-we-mesh/vertical_split.png" title="A simplified dbt DAG with a dotted line representing a vertical split." />

### Horizontal splits

Horizontal splits separate your DAG based on source or domain. These splits are often based around the shape and size of the data and how it's used. Let's consider some possibilities for horizontal splitting.
Expand All @@ -42,15 +48,37 @@ Horizontal splits separate your DAG based on source or domain. These splits are
- **Data from different sources.** For example, clickstream event data and transactional ecommerce data may need to be modeled independently of each other.
- **Team workflows.** For example, if two embedded groups operate at different paces, you may want to split the projects up so they can move independently.


<Lightbox src="/img/best-practices/how-we-mesh/horizontal_split.png" title="A simplified dbt DAG with a dotted line representing a horizontal split." />

### Combining these strategies

- **These are not either/or techniques**. You should consider both types of splits, and combine them in any way that makes sense for your organization.
- **Pick one type of split and focus on that first**. If you have a hub-and-spoke team topology for example, handle breaking out the central platform project before you split the remainder into domains. Then if you need to break those domains up horizontally you can focus on that after the fact.
- **DRY applies to underlying data, not just code.** Regardless of your strategy, you should not be sourcing the same rows and columns into multiple nodes. When working within a mesh pattern it becomes increasingly important that we don't duplicate logic or data.


<Lightbox src="/img/best-practices/how-we-mesh/combined_splits.png" title="A simplified dbt DAG with two dotted lines representing both a vertical and horizontal split." />


## Determine your git strategy

A multi-project architecture can exist in a single repo (monorepo) or as multiple projects, with each one being in their own repository (multi-repo).

- If you're a **smaller team** looking primarily to speed up and simplify development, a **monorepo** is likely the right choice, but can become unwieldy as the number of projects, models and contributors grow.
- If you’re a **larger team with multiple groups**, and need to decouple projects for security and enablement of different development styles and rhythms, a **multi-repo setup** is your best bet.

## Projects, splits, and teams
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 good spot for this


Since the laumch of dbt Mesh, the most common pattern we've seen is one where **projects are 1:1 aligned to teams**, and each project is its own codebase in its own repository. This isn’t a hard-and-fast rule: Some organizations want multiple teams working out of a monorepo. Some teams own multiple domains that feel awkward to keep combined.
matthewshaver marked this conversation as resolved.
Show resolved Hide resolved

Users may need to contribute models across multiple projects. This is fine! There will be some more friction to do this, versus the monorepo — but this is **useful** friction, especially if upstreaming a change from a “spoke” to a “hub.” This should be treated like making an API change, one that the other team will be living with for some time to come. You should be concerned if your teammates find they need to make a coordinated change across multiple projects very frequently (every week), or as a key prerequisite for ~20%+ of their work.
matthewshaver marked this conversation as resolved.
Show resolved Hide resolved

### Tips and tricks

In the next page, we'll go through a more in depth example of how to split a monolithic project into multiple projects. Here are some tips to get you started when starting to think about using the splitting methods listed above on your own projects:
matthewshaver marked this conversation as resolved.
Show resolved Hide resolved

1. Start by drawing a diagram of your teams doing data work. Map each team to a single dbt project. If you already have an existing monolithic project, and you’re onboarding *net-new teams,* this could be as simple as declaring the existing project as your “Hub” and creating new “Spoke” sandbox projects for each of those teams.
matthewshaver marked this conversation as resolved.
Show resolved Hide resolved
2. **Split off common foundations** when ****you know that multiple downstream teams will require the same data source. Those could be upstreamed into a centralized Hub, or split off into a separate foundational project. need some splits to facilitate other splits, e.g. source staging models in A that are used in both B + C (lack of project cycles).
matthewshaver marked this conversation as resolved.
Show resolved Hide resolved
3. **Split again** to ~~introduce intentional friction /~~ encapsulate a particular set of models (e.g. for external export).
matthewshaver marked this conversation as resolved.
Show resolved Hide resolved
4. **Recombine** if you have “hot path” subsets of the DAG that you need to deploy with low latency, because it powers in-app reporting or operational analytics. It might make sense to have a different dedicated team own these data models (see principle 1), similar to how software services with significantly different performance characteristics often warrant dedicated infrastructure, architecture, and staffing.
matthewshaver marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
Expand Up @@ -4,17 +4,34 @@ description: Getting started with dbt Mesh patterns
hoverSnippet: Learn how to get started with dbt Mesh
---

As mentioned before, the key decision in migrating to a multi-project architecture is understanding how your project is already being grouped, built, and deployed. We can use this information to inform our decision to split our project apart.
### Where should your mesh journey start?

Moving from to a dbt Mesh represents an meaningful change in development and deployment architecture. We’ve seen migrations fail for two reasons:
matthewshaver marked this conversation as resolved.
Show resolved Hide resolved

1. lacking buy-in that a dbt Mesh is the right long-term architecture
matthewshaver marked this conversation as resolved.
Show resolved Hide resolved
2. lack of alignment on a well-scoped starting point
matthewshaver marked this conversation as resolved.
Show resolved Hide resolved

Provided you have both of these, you’re ready to start your migration.
matthewshaver marked this conversation as resolved.
Show resolved Hide resolved

This migration should be organized and planned according to how your project(s) are already being grouped, built, and deployed. The goal is to define and formalize your organizational interfaces, and can use this information to inform our decision to split our project apart.
matthewshaver marked this conversation as resolved.
Show resolved Hide resolved

This change is **primarily organizational**. The most important component is alignment of your teammates.
matthewshaver marked this conversation as resolved.
Show resolved Hide resolved

- **Examine your jobs** - which sets of models are most often built together?
- **Look at your lineage graph** - how are models connected?
- **Look at your selectors** defined in `selectors.yml` - how do people already define resource groups?
- **Talk to teams** about what sort of separation naturally exists right now.
- Are there various domains people are focused on?
- Are there various sizes, shapes, and sources of data that get handled separately (such as click event data)?
- Are there people focused on separate levels of transformation, such as landing and staging data or building marts?
- Is there a single team that is *downstream* of your current dbt project, who could more easily migrate onto dbt Mesh as a consumer?

When attempting to define your project interfaces, you should consider investigating:

- **Your jobs** - which sets of models are most often built together?
matthewshaver marked this conversation as resolved.
Show resolved Hide resolved
- **Your lineage graph** - how are models connected?
matthewshaver marked this conversation as resolved.
Show resolved Hide resolved
- **Your selectors** defined in `selectors.yml` - how do people already define resource groups?
matthewshaver marked this conversation as resolved.
Show resolved Hide resolved

Let's go through an example process of taking a monolithing project, using groups and access to define the interfaces, and then splitting it into multiple projects.

## Add groups and access
## Defining project interfaces with groups and access

Once you have a sense of some initial groupings, you can first implement **group and access permissions** within a single project.

Expand Down
2 changes: 1 addition & 1 deletion website/docs/guides/core-to-cloud-3.md
Original file line number Diff line number Diff line change
Expand Up @@ -135,7 +135,7 @@ Here are some tips and caveats to consider when using dbt Mesh:
- Project dependencies are uni-directional, meaning they go in one direction. This means dbt checks for cycles across projects (circular dependencies) and raise errors if any are detected. However, we are considering support to allow projects to depend on each other in both directions in the future, with dbt still checking for node-level cycles while allowing cycles at the project level.
- Everyone in the account can view public model metadata, which helps users find data products more easily. This is separate from who can access the actual data, which is controlled by permissions in the data warehouse. For use cases where even metadata about a reusable data asset is sensitive, we are [considering](https://github.com/dbt-labs/dbt-core/issues/9340) an optional extension of protected models.

Refer to the [dbt Mesh FAQs](/best-practices/how-we-mesh/mesh-4-faqs) for more questions.
Refer to the [dbt Mesh FAQs](/best-practices/how-we-mesh/mesh-5-faqs) for more questions.

## dbt Semantic Layer

Expand Down
7 changes: 4 additions & 3 deletions website/sidebars.js
Original file line number Diff line number Diff line change
Expand Up @@ -1133,9 +1133,10 @@ const sidebarSettings = {
id: "best-practices/how-we-mesh/mesh-1-intro",
},
items: [
"best-practices/how-we-mesh/mesh-2-structures",
"best-practices/how-we-mesh/mesh-3-implementation",
"best-practices/how-we-mesh/mesh-4-faqs",
"best-practices/how-we-mesh/mesh-2-who-is-dbt-mesh-for",
"best-practices/how-we-mesh/mesh-3-structures",
"best-practices/how-we-mesh/mesh-4-implementation",
"best-practices/how-we-mesh/mesh-5-faqs",
],
},
{
Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading