Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dc jc/update mesh guide #5600

Merged
merged 20 commits into from
Jun 17, 2024
Merged
Show file tree
Hide file tree
Changes from 14 commits
Commits
Show all changes
20 commits
Select commit Hold shift + click to select a range
8956744
first pass at who is mesh for
dave-connors-3 May 28, 2024
e5ba69f
updates
dave-connors-3 Jun 4, 2024
f0fb522
Merge branch 'current' into dc-jc/update-mesh-guide
dave-connors-3 Jun 4, 2024
5c94883
update links
dave-connors-3 Jun 4, 2024
592fa87
Merge branch 'dc-jc/update-mesh-guide' of github.com:dbt-labs/docs.ge…
dave-connors-3 Jun 4, 2024
26fc479
Apply suggestions from code review
matthewshaver Jun 6, 2024
7ce23a6
Update website/docs/best-practices/how-we-mesh/mesh-2-who-is-dbt-mesh…
matthewshaver Jun 6, 2024
ada383d
Update mesh-2-who-is-dbt-mesh-for.md
matthewshaver Jun 6, 2024
7d17500
Apply suggestions from code review
matthewshaver Jun 6, 2024
4d4e5f3
Apply suggestions from code review
matthewshaver Jun 6, 2024
bf9830d
Apply suggestions from code review
matthewshaver Jun 6, 2024
0bdd4ce
Apply suggestions from code review
matthewshaver Jun 6, 2024
81edb57
Merge branch 'current' into dc-jc/update-mesh-guide
matthewshaver Jun 6, 2024
204a82f
Update website/docs/best-practices/how-we-mesh/mesh-4-implementation.md
matthewshaver Jun 6, 2024
5eca926
Update website/docs/best-practices/how-we-mesh/mesh-2-who-is-dbt-mesh…
dave-connors-3 Jun 10, 2024
bbf3176
updates
dave-connors-3 Jun 10, 2024
e6374bb
typo
dave-connors-3 Jun 10, 2024
fc657f5
Update website/docs/best-practices/how-we-mesh/mesh-2-who-is-dbt-mesh…
jtcohen6 Jun 12, 2024
8ab039d
Merge branch 'current' into dc-jc/update-mesh-guide
mirnawong1 Jun 13, 2024
12b6ef1
Merge branch 'current' into dc-jc/update-mesh-guide
dave-connors-3 Jun 17, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 3 additions & 2 deletions website/docs/best-practices/how-we-mesh/mesh-1-intro.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,19 +22,20 @@ This guide will walk you through the concepts and implementation details needed
- **[Model Versions](/docs/collaborate/govern/model-versions)** - when coordinating across projects and teams, we recommend treating your data models as stable APIs. Model versioning is the mechanism to allow graceful adoption and deprecation of models as they evolve.
- **[Model Contracts](/docs/collaborate/govern/model-contracts)** - data contracts set explicit expectations on the shape of the data to ensure data changes upstream of dbt or within a project's logic don't break downstream consumers' data products.

## Who is dbt Mesh for?
## When is the right time to use dbt Mesh?

The multi-project architecture helps organizations with mature, complex transformation workflows in dbt increase the flexibility and performance of their dbt projects. If you're already using dbt and your project has started to experience any of the following, you're likely ready to start exploring this paradigm:

- The **number of models** in your project is degrading performance and slowing down development.
- Teams have developed **separate workflows** and need to decouple development from each other.
- Teams are experiencing **communication challenges**, and the reliability of some of your data products has started to deteriorate.
- **Security and governance** requirements are increasing and would benefit from increased isolation.

dbt Cloud is designed to coordinate the features above and simplify the complexity to solve for these problems.

If you're just starting your dbt journey, don't worry about building a multi-project architecture right away. You can _incrementally_ adopt the features in this guide as you scale. The collection of features work effectively as independent tools. Familiarizing yourself with the tooling and features that make up a multi-project architecture, and how they can apply to your organization will help you make better decisions as you grow.

For additional information, refer to the [dbt Mesh FAQs](/best-practices/how-we-mesh/mesh-4-faqs).
For additional information, refer to the [dbt Mesh FAQs](/best-practices/how-we-mesh/mesh-5-faqs).

## Learning goals

Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
---
title: Who is dbt Mesh for?
description: Understanding if dbt Mesh is the right fit for your team
hoverSnippet: Learn how to get started with dbt Mesh
---

Before embarking on a dbt Mesh implementation, it's important to understand if dbt Mesh is the right fit for your team. Here, we outline three common organizational structures to help teams identify whether dbt Mesh might fit your organization's needs.

## The enterprise data mesh

Some data teams operate on a global scale. By definition, the team needs to manage, deploy, and distribute data products across a large number of teams. Central IT may own some data products or simply own the platform upon which data products are built. Often, these organizations have “architects” who can advise line-of-business teams on their work while keeping track of what’s happening globally (regarding tooling and the substance of work). This is a lot like how software organizations work beyond a certain scale.

This is a true data mesh where many teams publish models for each others' consumption. The headcount ratio here is roughly ≥10:1. For each member of the central platform team, there might be dozens of members of domain-aligned data teams.
dave-connors-3 marked this conversation as resolved.
Show resolved Hide resolved

Is dbt Mesh a good fit in this scenario? Absolutely! There is no other way to share data products at scale. One dbt project would not keep up with the global demands of an organization like this.

### Tips and tricks

- **Managing shared macros**: Teams operating at this scale will benefit from a separate repository containing a dbt package of reusable utility macros that all other projects will install. This is different from public models, which provide data-as-a-service (a set of “API endpoints”) — this is distributed as a **library**. This package can also standardize imports of other third-party packages, as well as providing wrappers / shims for those macros. This package should have a dedicated team of maintainers — probably the central platform team, or a set of “superusers” from domain-aligned data modeling teams.
dave-connors-3 marked this conversation as resolved.
Show resolved Hide resolved

### Adoption challenges

- Onboarding hundreds of people and dozens of projects is full of friction! The challenges of a scaled, global organization are not to be underestimated.
matthewshaver marked this conversation as resolved.
Show resolved Hide resolved
dave-connors-3 marked this conversation as resolved.
Show resolved Hide resolved
- Bi-directional project dependencies. If projects are aligned to domain teams, they need the ability to have “chatty” APIs; otherwise, they need to split projects beyond the 1:1 mapping with team boundaries. More information about this will be provided in the near future.
dave-connors-3 marked this conversation as resolved.
Show resolved Hide resolved

If this sounds like your organization, dbt Mesh is the architecture you should pursue. ✅

## Hub and spoke

Some slightly smaller organizations still operate with a central data team serving several business-aligned analytics teams in a ~5:1 headcount ratio. These central teams look less like an IT function and more like a modern data platform team of analytics engineers. This team provides the majority of the data products to the rest of the org, as well as the infrastructure for downstream analytics teams to spin up their own spoke projects to ensure quality and maintenance of the core platform.

Is dbt Mesh a good fit in this scenario? Almost certainly! If your central data team starts to bottleneck analysts’ work, you need a way for those teams to operate relatively independently while still ensuring the quality of the most used data products. dbt Mesh is designed to solve this exact problem.

### Tips and tricks

- **Data products by some, for all:** The spoke teams shouldn’t produce public models. By contrast, development in the hub team project should be slower, more careful, and focus on producing foundational public models shared across domains. We’d recommend giving hub team members access (at least read-only) to downstream projects, which will help with more granular impact analysis within dbt Explorer. If a public model isn’t used in any downstream project or a specific column in that model, the hub team can feel better about removing it. However, they should still utilize the dbt governance features like `deprecation_date` and `version` as appropriate to set expectations.
dave-connors-3 marked this conversation as resolved.
Show resolved Hide resolved
- **Sources:** Spokes should be allowed/encouraged to define and use _domain-specific_ data sources. The platform team should not need to worry about, say, `Thinkific` data when building core data marts, but the Training project may need to. _No two sources anywhere in a dbt mesh should point to the same relation object._ If a spoke feels like they need to use a source the hub already uses, the interfaces should change so that the spoke can get what they need from the platform project.
- **Project quality:** More analyst-focused teams will have different skill levels & quality bars. Owning their data means they own the consequences as well. Rather than being accountable for the end-to-end delivery of data assets, the Hub team is an enablement team: their role is to provide guardrails and quality checks, but not to fix all the issues exactly to their liking (and thereby remain a bottleneck).

### Adoption challenges

There are trade-offs to using this architecture, especially for the hub team managing and maintaining public models. This workflow has intentional friction to reduce the chances of unintentional model changes that break unspoken data contracts. These assurances may come with some sacrifices, such as faster onboarding or more flexible development workflows.
dave-connors-3 marked this conversation as resolved.
Show resolved Hide resolved

If this sounds like your organization, it's very likely that dbt Mesh is a good fit for you. ✅

## Single team monolith

Some organizations operate on an even smaller scale. If your data org is a single small team that controls the end-to-end process of building and maintaining all data products at the organization, dbt Mesh may not be required. The complexity in projects comes from having a wide ariety of data sources and stakeholders. However, given the team's size, operating on a single codebase may be the most efficient way to manage data products. Generally, if a team of this size and scope is looking to implement dbt Mesh, it's likely that they are looking for better interface design and/or performance improvements for certain parts of their dbt DAG, and not because they necessarily have an organizational pain point to solve.
dave-connors-3 marked this conversation as resolved.
Show resolved Hide resolved

_Is dbt Mesh a good fit?_ Maybe! There are reasons to separate out parts of a large monolithic project into several to better orchestrate and manage the models. However, if the same people are managing each project, they may find that the overhead of managing multiple projects is not worth the benefits.

If this sounds like your organization, it's worth considering whether dbt Mesh is a good fit for you.
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,11 @@ At a high level, you’ll need to decide:

### Cycle detection

Like resource dependencies, project dependencies are acyclic, meaning they only move in one direction. This prevents `ref` cycles (or loops), which lead to issues with your data workflows. For example, if project B depends on project A, a new model in project A could not import and use a public model from project B. Refer to [Project dependencies](/docs/collaborate/govern/project-dependencies#how-to-write-cross-project-ref) for more information.
Like resource dependencies, project dependencies are acyclic, meaning they only move in one direction. This prevents `ref` cycles (or loops), which lead to issues with your data workflows. For example, if project B depends on project A, a new model in project A could not import and use a public model from project B. Refer to [Project dependencies](/docs/collaborate/govern/project-dependencies#how-to-write-cross-project-ref) for more information.

:::note
This requirement is likely to change in the future, so stay tuned for updates!
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

:::

## Define your project interfaces by splitting your DAG

Expand All @@ -34,6 +38,8 @@ Vertical splits separate out layers of transformation in DAG order. Let's look a
- **Isolating earlier models for security and governance requirements** to separate out and mask PII data so that downstream consumers can't access it is a common use case for a vertical split.
- **Protecting complex or expensive data** to isolate large or complex models that are expensive to run so that they are safe from accidental selection, independently deployable, and easier to debug when they have issues.

<Lightbox src="/img/best-practices/how-we-mesh/vertical_split.png" title="A simplified dbt DAG with a dotted line representing a vertical split." />

### Horizontal splits

Horizontal splits separate your DAG based on source or domain. These splits are often based around the shape and size of the data and how it's used. Let's consider some possibilities for horizontal splitting.
Expand All @@ -42,15 +48,37 @@ Horizontal splits separate your DAG based on source or domain. These splits are
- **Data from different sources.** For example, clickstream event data and transactional ecommerce data may need to be modeled independently of each other.
- **Team workflows.** For example, if two embedded groups operate at different paces, you may want to split the projects up so they can move independently.


<Lightbox src="/img/best-practices/how-we-mesh/horizontal_split.png" title="A simplified dbt DAG with a dotted line representing a horizontal split." />

### Combining these strategies

- **These are not either/or techniques**. You should consider both types of splits, and combine them in any way that makes sense for your organization.
- **Pick one type of split and focus on that first**. If you have a hub-and-spoke team topology for example, handle breaking out the central platform project before you split the remainder into domains. Then if you need to break those domains up horizontally you can focus on that after the fact.
- **DRY applies to underlying data, not just code.** Regardless of your strategy, you should not be sourcing the same rows and columns into multiple nodes. When working within a mesh pattern it becomes increasingly important that we don't duplicate logic or data.


<Lightbox src="/img/best-practices/how-we-mesh/combined_splits.png" title="A simplified dbt DAG with two dotted lines representing both a vertical and horizontal split." />


## Determine your git strategy

A multi-project architecture can exist in a single repo (monorepo) or as multiple projects, with each one being in their own repository (multi-repo).

- If you're a **smaller team** looking primarily to speed up and simplify development, a **monorepo** is likely the right choice, but can become unwieldy as the number of projects, models and contributors grow.
- If you’re a **larger team with multiple groups**, and need to decouple projects for security and enablement of different development styles and rhythms, a **multi-repo setup** is your best bet.

## Projects, splits, and teams
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 good spot for this


Since the launch of dbt Mesh, the most common pattern we've seen is one where projects are 1:1 aligned to teams, and each project has its own codebase in its own repository. This isn’t a hard-and-fast rule: Some organizations want multiple teams working out of a single repo, and some teams own multiple domains that feel awkward to keep combined.

Users may need to contribute models across multiple projects and this is fine. There will be some friction doing this, versus a single repo, but this is _useful_ friction, especially if upstreaming a change from a “spoke” to a “hub.” This should be treated like making an API change, one that the other team will be living with for some time to come. You should be concerned if your teammates find they need to make a coordinated change across multiple projects very frequently (every week), or as a key prerequisite for ~20%+ of their work.

### Tips and tricks

The [implementation](/best-practices/how-we-mesh/mesh-4-implementation) page provides more in-depth examples of how to split a monolithic project into multiple projects. Here are some tips to get you started when considering the splitting methods listed above on your own projects:

1. Start by drawing a diagram of your teams doing data work. Map each team to a single dbt project. If you already have an existing monolithic project, and you’re onboarding _net-new teams,_ this could be as simple as declaring the existing project as your “hub” and creating new “spoke” sandbox projects for each team.
2. Split off common foundations when you know that multiple downstream teams will require the same data source. Those could be upstreamed into a centralized hub or split off into a separate foundational project. need some splits to facilitate other splits, for example, source staging models in A that are used in both B and C (lack of project cycles).
3. Split again to introduce intentional friction and encapsulate a particular set of models (for example, for external export).
4. Recombine if you have “hot path” subsets of the DAG that you need to deploy with low latency because it powers in-app reporting or operational analytics. It might make sense to have a different dedicated team own these data models (see principle 1), similar to how software services with significantly different performance characteristics often warrant dedicated infrastructure, architecture, and staffing.
Original file line number Diff line number Diff line change
Expand Up @@ -4,17 +4,34 @@ description: Getting started with dbt Mesh patterns
hoverSnippet: Learn how to get started with dbt Mesh
---

As mentioned before, the key decision in migrating to a multi-project architecture is understanding how your project is already being grouped, built, and deployed. We can use this information to inform our decision to split our project apart.
### Where should your mesh journey start?

Moving to a dbt Mesh represents a meaningful change in development and deployment architecture. We’ve seen migrations fail for two reasons:

1. Lack of buy-in that a dbt Mesh is the right long-term architecture
2. Lack of alignment on a well-scoped starting point

Provided you're not experiencing those blockers, you’re ready to start your migration.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's frame this as a "pre-mortem" exercise, so it doesn't sound so negative right out of the gate. "Before any sufficiently complex software refactor or migration, it's important to ask, 'Why might this not work?' The two most common reasons we've seen..."


The migration should be organized and planned according to how your project(s) are being grouped, built, and deployed. The goal is to define and formalize your organizational interfaces and use the information to make your decision to split your project apart.

This change is _primarily_ organizational. The most important component is the alignment of your teammates.

- **Examine your jobs** - which sets of models are most often built together?
- **Look at your lineage graph** - how are models connected?
- **Look at your selectors** defined in `selectors.yml` - how do people already define resource groups?
- **Talk to teams** about what sort of separation naturally exists right now.
- Are there various domains people are focused on?
- Are there various sizes, shapes, and sources of data that get handled separately (such as click event data)?
- Are there people focused on separate levels of transformation, such as landing and staging data or building marts?
- Is there a single team that is *downstream* of your current dbt project, who could more easily migrate onto dbt Mesh as a consumer?

When attempting to define your project interfaces, you should consider investigating:

- **Your jobs:** Which sets of models are most often built together?
- **Your lineage graph:** How are models connected?
- **Your selectors(defined in `selectors.yml`):** How do people already define resource groups?

Let's go through an example process of taking a monolithing project, using groups and access to define the interfaces, and then splitting it into multiple projects.

## Add groups and access
## Defining project interfaces with groups and access

Once you have a sense of some initial groupings, you can first implement **group and access permissions** within a single project.

Expand Down
2 changes: 1 addition & 1 deletion website/docs/guides/core-to-cloud-3.md
Original file line number Diff line number Diff line change
Expand Up @@ -135,7 +135,7 @@ Here are some tips and caveats to consider when using dbt Mesh:
- Project dependencies are uni-directional, meaning they go in one direction. This means dbt checks for cycles across projects (circular dependencies) and raise errors if any are detected. However, we are considering support to allow projects to depend on each other in both directions in the future, with dbt still checking for node-level cycles while allowing cycles at the project level.
- Everyone in the account can view public model metadata, which helps users find data products more easily. This is separate from who can access the actual data, which is controlled by permissions in the data warehouse. For use cases where even metadata about a reusable data asset is sensitive, we are [considering](https://github.com/dbt-labs/dbt-core/issues/9340) an optional extension of protected models.

Refer to the [dbt Mesh FAQs](/best-practices/how-we-mesh/mesh-4-faqs) for more questions.
Refer to the [dbt Mesh FAQs](/best-practices/how-we-mesh/mesh-5-faqs) for more questions.

## dbt Semantic Layer

Expand Down
7 changes: 4 additions & 3 deletions website/sidebars.js
Original file line number Diff line number Diff line change
Expand Up @@ -1152,9 +1152,10 @@ const sidebarSettings = {
id: "best-practices/how-we-mesh/mesh-1-intro",
},
items: [
"best-practices/how-we-mesh/mesh-2-structures",
"best-practices/how-we-mesh/mesh-3-implementation",
"best-practices/how-we-mesh/mesh-4-faqs",
"best-practices/how-we-mesh/mesh-2-who-is-dbt-mesh-for",
"best-practices/how-we-mesh/mesh-3-structures",
"best-practices/how-we-mesh/mesh-4-implementation",
"best-practices/how-we-mesh/mesh-5-faqs",
],
},
{
Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading