Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add namespacing for dbt resources #1269

Closed
drewbanin opened this issue Jan 30, 2019 · 43 comments · Fixed by #7374
Closed

Add namespacing for dbt resources #1269

drewbanin opened this issue Jan 30, 2019 · 43 comments · Fixed by #7374

Comments

@drewbanin
Copy link
Contributor

Placeholder issue for thoughts related to namespacing. When we're able to prioritize work on namespacing, let's convert this to a more actionable issue:

Questions:

  • how can we namespace models to avoid ambiguity?
  • what do package namespaces look like for models?
  • is there a way to "import" resources from a package, rather than injecting everything into the toplevel context?

Other similar questions / use cases welcomed :)

@drewbanin drewbanin added the help_wanted Trickier changes, with a clear starting point, good for previous/experienced contributors label Jan 30, 2019
@drewbanin drewbanin changed the title Namespace Resources Add namespacing for dbt resources Jan 30, 2019
@danimcgoo
Copy link

danimcgoo commented Feb 5, 2019

Hi @drewbanin, is this different / related to the comment here "be sure to namespace your models" ?

https://docs.getdbt.com/docs/configuring-models

We're looking for a clean way to separate sets of pipelines, e.g. all pipelines related to a consumer ML service, and pipelines related to analytics. This will be particularly useful as the size of our pipelines grow, the ability to run dbt run/test on a specific namespace will ensure the processing completes faster, as non-related models aren't built.

@drewbanin
Copy link
Contributor Author

hey @danimcgoo - yeah - this would be slightly different than the namespacing described in the link you shared above. The type of namespacing described here would make it possible to refer to models with more specific names. Imagine you wanted to have two models with the same name, but materialized in different schemas. Namespacing in dbt would let you do:

-- models/shopify_uk/sales.sql
select ...
-- models/shopify_fr/sales.sql
select ...

And then later:

select * from {{ ref('shopify_uk.sales') }
union all
select * from {{ ref('shopify_fr.sales') }

Here, there would be two models named sales.sql, but you'd be able to unambiguously refer to either using some sort of "namespace". TBD exactly what shape that will take in dbt, but I think that as dbt projects grow, we'll increasingly be pushed away from our current paradigm of globally unique model names, and this is one way to accomplish that.

I would say that for the moment, dbt's approach to model selection might be sufficient for your case. Have you checked out tagging yet? You can use dbt_project.yml to apply tags to whole directories of models, then these models with those tags on the CLI with --models!

Hope some of this is helpful -- lmk if there's anything else I can clarify

@danimcgoo
Copy link

Hey @drewbanin, tagging looks like it'll do the trick. thanks fo the tip!

@heisencoder
Copy link
Contributor

Hi @drewbanin

Is this "Help Wanted" still current? I'm interested in supporting this feature in our internal projects because we have 100s of sql files that have a hierarchy of subdirectories under the "models" directory to help keep things semantically organized. Also, we have several SQL files with the same name but that exist in different directories (as illustrated above with the sales.sql example).

Offhand, I think that this feature needs to be enabled with a flag somewhere to avoid getting an error during the compile phase when two SQL files have the same name (i.e., the "dbt found two resources with the name" CompilationException. One place to add this flag would be in dbt_project.yml at the top-level. Maybe like this:

dbt_project.yml:

  use_reference_namespace = true

The next trick is to thread this parameter through to the places where the unique identifiers are generated. As one example, in parser/base_sql.py BaseSqlParser.parse_sql_node, a unique_id is generated by joining the resource_type, package_name, and node name (i.e., filename base) using '.'. This unique_id is later split apart in utils.py id_matches(). As such, I don't think that we can use the '.' character as the separate the namespace. Instead, maybe it could be a '/', just like the directory separator:

select * from {{ ref('shopify_uk/sales') }}
union all
select * from {{ ref('shopify_fr/sales') }}

So, I think one solution would approximately be to update the get_path methods (in parser/schemas.py and parser/base.py) to take a node instead of name as the last parameter, and then conditionally include the directory prefix in the name based on the use_reference_namespace flag. All these directories would be relative to the directory specified in dbt_project.yml, and could include subdirectories of subdirectories (e.,g, ref('shopify_uk/sub_dir/sales')).

Thoughts? Thanks!

@heisencoder
Copy link
Contributor

(Actually, in looking closer at the id_matches() logic, there is support for using '.' in the name. For example, NodeType.Source requires that there be a single '.' in the node_name)

@drewbanin
Copy link
Contributor Author

Hey @heisencoder - before we begin any work here, I'd like to better develop our thinking around how namespacing will work in general. In your example (shopify_uk/sales), what is shopify_uk relative to? One obvious answer is the "root" models/ directory, but that doesn't extend super well to a world with packages.

dbt already has a notion of "fully qualified names" -- this looks like package_name.path.to.model_name. It's how the --models selector works, and how models are configured in the dbt_project.yml file. In dbt's selector logic, we have some code to assume that the package_name is the "root" package if one is not provided.

If namespacing is turned on, should it be allowed to ref a model without namespacing it? Or are namespaces solely intended to make it possible to handle models with shared names?

Presently, dbt will assert that no two models share the same name. I think we'll need to defer that validation until all of the refs are evaluated. If a ref can evaluate to more than one model, then we can make that a compile-time error.

I'm definitely more in favor of using dots (.) to represent namespaces compared to slashes (/) just insofar as these won't be "real" file paths.

What do you think about all of this?

@heisencoder
Copy link
Contributor

Thanks @drewbanin for your thoughts!

I haven't looked into the "fully qualified names" that you've mentioned for the --models selector, so don't know how this proposal plays into that. Maybe it can just slide in with minimal changes? The package_name is already included in the node's unique_id, so maybe they can work together?

Here's a proposed approach that hopefully fits in the framework you've outlined:

  1. Update the name computation to use the full path relative to the resources-path in which the model was found. Use '.' to separate path components. e.g. A file in 'models/dir1/dir2/model.sql' would have a name of 'dir1.dir2.model'.
  2. Update utils.find_in_list_by_name and utils.find_in_subgraph_by_name to first attempt to find an exact match, and then later default to looking only at the model name without the package prefix to match current behavior. However, If looking just at the model name, it will be necessary to find all such matches and raise an exception if two or more matches apply.
  3. Hunt down and update any code that relies on the previous naming convention.

We could do these changes without adding a namespacing mode, although adding a namespacing mode could help with forward compatibility issues with the manifest.json format.

As a proof-of-concept, here are examples of the changes I imagined to utils.find_in_list_by_name:

def find_in_list_by_name(haystack, target_name, target_package, nodetype):
    """Find an entry in the given list by name.

    target_name could either be the base of the model name or could include the
    package prefix separated with '.' characters.
    """
    base_name_matches = []
    for model in haystack:
        name = model.get('unique_id')
        if id_matches(name, target_name, target_package, nodetype, model):
            return model
        if id_matches(name, target_name.split('.')[-1], target_package, nodetype, model):
            base_name_matches.append(model)
    if len(base_name_matches >= 2):
        raise_compiler_error(
            'Ambiguous reference %s has multiple matches: %s' % target_name, base_name_matches)
    if len(base_name_matches == 1):
        return base_name_matches[0]
    return None

(And then also update id_matches to not blow up when the target_name contains '.')

Note that as a side-effect of including the full package prefix in the target_name, the compile tests in _load_schema_tests no longer fail because they're using the fully qualified name.

@drewbanin
Copy link
Contributor Author

Hey @heisencoder - I think everything you're saying here is totally reasonable. The big challenge I can imagine is that we've always held as a hard constraint that resource names are unique. As such, there may be pernicious failures in parts of the codebase which rely on these model names being unique. We'll definitely have to figure out how to adjust the unique_id attribute of nodes, as it would be very bad if those unique ids were no longer unique :)

I'm supportive of this idea, but we need to figure out:

  1. how to generate unique ids for nodes (we can probably leverage the FQN path here)
  2. understand the breadth of changes outside of the ref function context that are required:
  • i know the docs site will need to be updated
  • i don't believe we parse values out of the unique_id string anywhere, but we'd need to verify that
  • figure out things around the edges, like how schema.yml works when models can have duplicate names

This is probably going to be a pretty big project!

@heisencoder
Copy link
Contributor

Given the large number of edge cases, I think it makes sense to flag control this new code so that old code doesn't have to worry about these edge cases. So I'm thinking that this could be controlled by a use_namespaces = true option at the top-level of the dbt_project.yml file. We could theoretically allow turning this off or on at different levels of the models section of the dbt_project.yml, but I'm hoping to avoid allowing that level of configuration.

The unique_ids for the nodes would then become the FQN path instead of just the base filename. This should make them be unique in all contexts.

I can provide some suggested documentation text.

I hadn't thought about the name parameter in the schema.yml files, but here are my two thoughts:

  1. The path prefix to the model name can be implied by the path prefix of the schema.yml file. I'm guessing that in general, each directory will have its own schema.yml file that describes the models in the same directory. Since this is convention, I'm guessing that this won't always be followed.
  2. We could expand the syntax to allow the models.name field to include the dotted FQN to the model name. This would allow schema.yml files to describe models in other directories.

@preston-hf
Copy link

Why not just copy what other programming environments do (python) and make the folder the package name? As a newbie, I expected it already worked list this. I think this should be the default with an option to disable it that allows legacy users to migrate.

@preston-hf
Copy link

Also, I will add that the source selector works about like I'd want from a usage point of view. That may provide a better way to migrate as well. Any time the old ref() is called, the uniqueness for that model name is ensured, and generates and exception if there are duplicates. A new ref that uses packages (or maybe just a new arg!) would limit this check to the namespace that was specified.

@rpedela-recurly
Copy link

I love the source('my_schema', 'my_table') syntax. To echo @preston-hf, could we replicate that syntax for ref with ref('my_schema', 'my_model')? If the schema name is set in schema.yml or dbt_project.yml, then that is the model's schema/namespace. If not set, then we keep the current behavior.

@drewbanin
Copy link
Contributor Author

hey @rpedela-recurly - we definitely don't want to namespace things by schema name! Destination schemas are configuration in dbt, so the actual schema that a model renders into should be totally different in dev vs. prod.

You can actually currently supply two arguments to the ref function, eg:

{{ ref('package_name', 'model_name') }}

In practice, this isn't super useful though -- model names must currently be unique across all packages included in a project. Increasingly, my thinking is that we should:

  1. make it easier to define multiple packages inside of a single project
  2. drop the requirement that model names must be unique across packages

If two models share the same name, then you would need to qualify the model name with a package name in the ref statement. If the two models have the same database representation, then dbt could raise a compiler error.

Do you buy that?

@rpedela-recurly
Copy link

I didn't realize you could define package_name. Number 1 would suffice I think then, and at least I would make package_name = schema_name, but there would be more flexibility for others. As far as defining package/model relations, I still like setting the package for a set of models in either dbt_project.yml or schema.yml. But I also like having the ability to keep packages the way they are, as I see them being useful for large projects with many contributors. What if ref was updated to allow this:

ref('package', 'namespace_within_package', 'model')
ref('namespace_within_project', 'model')

Then namespace is defined in schema.yml or dbt_project.yml. I really like having the source/ref syntax look like SQL schema_name.table_name even if the real schema differs because of dev/prod environments.

@mjumbewu
Copy link
Contributor

mjumbewu commented Apr 9, 2020

@drewbanin I like the two notions of making packages within a project easy, and dropping the name uniqueness across packages.

This is similar to ideas that Betterment has been tossing around a lot lately. We have models across a small but slowly growing number of internal domains that all belong in the same project, but have different owners and use cases depending on the domain. We want to relax the requirement for unique model names across these domains (especially for ephemeral models).

We were thinking of suggesting custom schema names (not the actual materialized schema names, which may differ from run to run) as a namespace, but I think if we could easily create intra-project packages, that could suffice. The requirements that we've been thinking about are as follows (let's say for these requirements that namespace could either be the name of a package or of a custom schema):

  • Models can be identified either by "model_name" or ("namespace", "model_name").
  • e.g., If two models have the same name x but different namespaces A and B:
    • DBT does not fail to compile the project, as it does now
    • Using ref('x') from a model in a different namespace (not A or B) fails in compile because the reference is ambiguous
    • Using ref('A', 'x') from. a model in a different namespace than A is fine
    • Using ref('x') from a model in namespace A implies ref('A', 'x') (i.e., DBT first it checks whether x is unambiguous and, if it is not, tries to find 'A', 'x')
  • Two models named y within the same namespace, or that materialize to the same schema, do cause a compile failure (as is the current behavior)

I know everyone on our team would also be thrilled if ephemeral models within different folders of the same package could be similarly named without creating conflicts as well, but the points above are more important.

@mjumbewu
Copy link
Contributor

mjumbewu commented Apr 9, 2020

Also @drewbanin, when you say "make it easier to define multiple packages inside of a single project", what do you imagine that looks like? Something along the lines of the following structure, where DBT perhaps infers the package names of sub-projects, and maybe inherits project config from the "parent"?:

project/
|-- domain1/
|   |-- models/
|   +-- tests/
|
|-- domain2/
|   |-- models/
|   +-- tests/
|
|-- models/
|
|-- tests/
|
|-- dbt_project.yml
+-- packages.yml
# project/packages.yml

packages:
  - local: domain1
  - local: domain2
# project/dbt_project.yml

name: 'company_project'
version: '1.0'

models:
  domain1:
    schema: d1

  domain2:
    schema: d2

Maybe even allowing the packages definition to be folded into the dbt_project.yml file?

@jgysland
Copy link

jgysland commented Apr 13, 2020

I am very interested in being able to namespace objects like models and sources and macros—at least in a rudimentary fashion by removing the requirement that model names be unique even between packages—and I'd be happy to try to put together a PR if someone more familiar with the project can give me a little guidance on where to get started.

@drewbanin
Copy link
Contributor Author

hey @mjumbewu - thanks for writing all of this out!

I really like your suggestions on how we should approach model-level namespacing: it's really sensible and well-specified.

Re: making namespacing easier: I like the general idea of your suggestion:

packages:
  - local: domain1
  - local: domain2

The only issue with this approach is that you need to run dbt deps to create symlinks from dbt_modules/domain1/ to domain1/. This would be a little bit circuitous and it doesn't work incredibly well on Windows (you know, no symlinks).

Instead, I think we could do one of the following:

1. Smarter local packages
We could add a config like symlink to the packages.yml file for local: packages. If this value is false, then dbt wouldn't try to symlink the package into dbt_modules/ and would instead point to the package dir directly.

Pros:

  • Feels pretty natural
  • You could use arbitrary file paths

Cons:

  • I think this would be a big change to dbt. We definitely can read the packages.yml file and use it to inform parsing/running, but I don't think that's how things work today

2. A new dbt_project.yml config
The solution to having too many configs is to add another config. We could add a top-level dbt_project.yml config like package-paths. This config would let you enumerate paths in your project that contain dbt packages. Given the example you showed above, that would look like:

project/
|-- domain1/
|   |-- models/
|   +-- tests/
|
|-- domain2/
|   |-- models/
|   +-- tests/
|
|-- models/
|
|-- tests/
|
|-- dbt_project.yml
+-- packages.yml
# dbt_project.yml

source-paths: ['models']
package-paths: ['domain1', 'domain2']

Alternatively, we could get clever with source-paths and support something like:

source-paths:
  project: ['models']
  packages: ['domain1', 'domain2']

But I don't think I love that approach.

Pros:

  • I think this should be relatively straightforward to implement today. dbt already knows to look in dbt_modules for packages, and this would be an extension to that logic

Cons:

  • ???
  • I can't think of any real downsides to this approach

3. config-level namespaces

This was the one I had in mind when I wrote about making namespacing easier, but increasingly, I don't think this approach is such a good idea. I was picturing a models: level config which supports a namespace as a config value. That would look like:

models:
  my_project:
    path1:
      namespace: domain1
    path2:
      namespace: domain2

All of the models in models/path1 would be namespaced under the package domain1, and all of the models in models/path2 would be namespaced under the package domain2.

Pros:

  • It's pretty flexible!
  • Models/maros/seeds/snapshots/etc can live closer together, rather than in totally separate folders
    Cons:
  • I think an approach like this would make correctly parsing models pretty challenging!
  • The models: config is only intended to configure models, but we'd want to namespace other resources (say, tests/snapshots/seeds/etc) in a package-aware way too

These are the things bouncing around in my head. I think approach 2 outlined above is going to be our best bet, but I'm curious what you all think too.

@dmateusp
Copy link
Contributor

dmateusp commented Jul 8, 2020

Hi there, just wanted to continue that discussion

I'm a beginner on DBT, so I would just like to give my opinion from a "beginner" point of view.

Seeds

First I would like to point out that this would be useful for Seeds as well, consider the following simple use-case:

2 pipelines are being worked on, they target 2 different databases, A and B, this is the folder structure

data/a/calendar.csv
data/b/calendar.csv

The 2 pipelines target 2 different Databases, so their names do not clash when they run against the Warehouse. But A needs a different fiscal calendar than B.

Now, as a beginner, my first intuition would be to use ref("my_proj", "a.calendar") because of the --select argument available to dbt seed

Sub projects

I think the "sub project" abstraction is quite hard to wrap my head around:

  • Now I need to look at the config to figure out if a directory has a special meaning?
  • What if I have a name clash in a large sub-project? Aren't we back to this issue?

Proposal

(I'm aware it's a summary/mix of some other solutions above)

  1. Have the ability to call ref-able objects either by their full "module path" ref("my_proj", "a.calendar") or their name "excluding module" ref("my_proj", "calendar"), note that ref("a.calendar") or ref("calendar") should be valid if it's not ambiguous cross project

  2. In the function to resolve the object, we would look at a.calendar and see that there aren't duplicate objects returned, in case calendar is used, we could return an Exception "Ambiguous ref: use either a.calendar or b.calendar" . I'm actually not familiar with how the code deals with ambiguous objects cross packages but I would imagine part of the logic could be re-used.

  3. I'm not very experienced with the generated documentation, maybe we could display names of objects as full explicit references in case of ambiguity (perhaps by running a version of the above logic on all objects to be rendered), or just the object name otherwise. Again I'm not familiar with how ambiguous objects cross packages are rendered on the docs but it could inform implementation.

Pros:

  • Simplicity
  • No config change
  • Intuitive to beginners
  • Shielded against potential clashes of the sub-projects solution

Cons:

  • Can't assume uniqueness within a project
  • Need to introduce logic to detect ambiguous objects in modules

@bhtucker
Copy link

bhtucker commented Aug 3, 2020

@drewbanin Where does this go from here? Is this the 'DBT extension proposal'? Do you need to make a BDFL pronouncement?

Was shocked to discover that this is an outstanding issue still. I've always used multiple packages with qualified refs and I guess just got lucky that I never repeated a model name.

@hhagblom
Copy link

hhagblom commented Feb 3, 2021

@drewbanin

Hi Drew,

I was horrified when I just noticed that I could not just namespace the same model name by putting these in the different packages. When one references with {{ ref('package1', 'model1') }}and {{ ref('package2', 'model1') }} I definitely assumed that it would indeed be valid to include these in the same top-level dbt project without a clash.

Instead of discussing more advanced ways (as in bigger changes) of namespacing as above, would it be possible to at least make this approach work, it seems to do the trick (for me at least ) ? I would think this makes for less surprises when using the dbt-framework. Is there anything in the way that dbt is structured internally that makes a change like this painful?

I haven't looked into the source code in dbt yet, but I might be able to contribute with a PR if you find that this would be a suitable way forward.

And sorry for whining :) dbt is a great tool, thanks for your contributions!

@drewbanin
Copy link
Contributor Author

oh geez... I missed this one and am disappointed that I have abdicated my BDFL responsibilities. Thanks for the bump @hhagblom and for getting this back on my radar!

Yep.... it is highly weird that dbt supports a two-arg version of ref, but still requires model names to be unique across packages. I'd need to think hard about what kind of downstream effects will come out of lifting the "model names are unique" constraint that's been in place for the entirety of dbt's development, but here's an optimistic scenario:

  • we use a unique_id everywhere and the impacts of making model names non-unique across packages in minimal
  • we remove the validation that ensures that model/node names are globally unique, and instead only require uniqueness within a project
  • we keep the validation which ensures that two nodes do not point to the same database.schema.identifier in the database

@jtcohen6 curious to get your take on this one too. Will poke around here and report back with findings :)

@drewbanin
Copy link
Contributor Author

ok @hhagblom & co - let's not make a habit out of this, but I did open my first PR to dbt in a couple of months over here https://github.com/fishtown-analytics/dbt/pull/3053/files

I'm going to need to spend more time here with testing, but if any of y'all want to take this for a spin locally, feedback is very welcomed :D

@hhagblom
Copy link

hhagblom commented Feb 5, 2021

ok @hhagblom & co - let's not make a habit out of this, but I did open my first PR to dbt in a couple of months over here https://github.com/fishtown-analytics/dbt/pull/3053/files

I'm going to need to spend more time here with testing, but if any of y'all want to take this for a spin locally, feedback is very welcomed :D

@drewbanin Many thanks! What I'll do is that I'll fork the repo and will do a rebase-onto a stable release and take this for a spin in our project. I'll let you know how it works out for us!

hhagblom pushed a commit to hhagblom/dbt that referenced this issue Feb 8, 2021
@jtcohen6
Copy link
Contributor

jtcohen6 commented Jun 1, 2021

@Shadowsong27 The work in #3053 got 90% of the way to supporting this, but it ran into an obstacle around defining model properties in .yml files, since each set of properties is associated with a model name only. While I haven't had a chance to confirm, I believe some of the in-progress changes to file parsing (#3364) may incidentally resolve this by storing a fuller reference and changing the way that dbt adds properties to model nodes. So I'd like to revisit this once v0.20 is out the door.

@mvgijssel
Copy link

Sorry for the +1 🙈 but we just ran into this as well. We're moving through different layers, raw -> clean -> common, where the models are refined in each layer. Currently we have to name the models differently coming from each layer to prevent the name collision.

@louis-vines
Copy link

Is there any news on the status of this issue? This and unit testing are the two big ones for me to make dbt feel like the complete real deal :)

@jtcohen6
Copy link
Contributor

@louis-vines This is still very much top-of-mind for me. I don't have any immediate plans for development that I can share, but I am thinking that this will be an important ingredient for supporting dbt deployments at larger orgs. Probably part of a larger capability around cross-project lineage, which could exist without this capability, but would be much much more desirable with it.

@jtcohen6
Copy link
Contributor

For folks following this issue, and interested in namespacing capabilities as a crucial tool for scaling out dbt deployments: I'd be very curious to hear your thoughts on the proposal outlined in #5244

@extradosages
Copy link

extradosages commented May 24, 2022

How does anyone use dbt at scale without this? My team was hoping to switch over to dbt but without some form of namespacing I don't think we can afford the friction in adjusting our database and workflow!

@tevariou
Copy link

By prefixing every models?

@hernanparra
Copy link

How does anyone use dbt at scale without this? My team was hoping to switch over to dbt but without some form of namespacing I don't think we can afford the friction in adjusting our database and workflow!

Alternatives:

  1. If you need to repeat table names in different layers maybe you can use different projects. (Not ideal)
  2. You can control the table name using alias. You have to prefix the .sql files but you can control the table name: https://docs.getdbt.com/docs/building-a-dbt-project/building-models/using-custom-aliases
    This can be auto generated by a macro.

@extradosages
Copy link

extradosages commented May 24, 2022

How does anyone use dbt at scale without this? My team was hoping to switch over to dbt but without some form of namespacing I don't think we can afford the friction in adjusting our database and workflow!

Alternatives:

1. If you need to repeat table names in different layers maybe you can use different projects. (Not ideal)

2. You can control the table name using alias. You have to prefix the .sql files but you can control the table name: https://docs.getdbt.com/docs/building-a-dbt-project/building-models/using-custom-aliases
   This can be auto generated by a macro.

Oh, I rather like the second option. Thank you for the suggestion, @hernanparra.

@stevenlw-porpoise
Copy link

How does anyone use dbt at scale without this? My team was hoping to switch over to dbt but without some form of namespacing I don't think we can afford the friction in adjusting our database and workflow!

Alternatives:

  1. If you need to repeat table names in different layers maybe you can use different projects. (Not ideal)
  2. You can control the table name using alias. You have to prefix the .sql files but you can control the table name: https://docs.getdbt.com/docs/building-a-dbt-project/building-models/using-custom-aliases
    This can be auto generated by a macro.

I'm not sure I understand point 2. I set up a sample project like

project/
|-- models
|   |-- example
|       |-- my_first_dbt_model.sql
|       |-- my_second_dbt_model.sql
|
|   |-- folder2
|       |-- my_first_dbt_model.sql

where folder2/my_first_dbt_model.sql is

-- My other first model
{{ config(alias='different_model_name') }}
select 1

This still fails because I have two models with the same name even though I wanted to alias one of the models to a different identifier.

19:28:01  Running with dbt=1.1.2
19:28:01  Encountered an error:
Compilation Error
  dbt found two models with the name "my_first_dbt_model".
  
  Since these resources have the same name, dbt will be unable to find the correct resource
  when looking for ref("my_first_dbt_model").
  
  To fix this, change the name of one of these resources:
  - model.dbt_sample.my_first_dbt_model (models/folder2/my_first_dbt_model.sql)
  - model.dbt_sample.my_first_dbt_model (models/example/my_first_dbt_model.sql)

@elyobo
Copy link

elyobo commented Sep 3, 2022

@stevenlw-porpoise I think the suggestion is to use alias in the opposite direction, e.g. call your file different_model_name.sql but use the alias to rename it back to my_first_dbt_model via the alias. It's not a great workaround, as you still need to reference different_model_name in refs which makes migration harder.

This is a bit of a hassle for us migrating some dataform pipelines over (one of the few areas that dataform is better), but we do already break things into a bunch of different pipelines (which has its own problems) which mitigates the problem somewhat.

@lawrence-vo
Copy link

Is there a fix being implemented or are we left to prefix every model? If there is a fix being implemented, is there an ETA?

Thanks!

@rja40
Copy link

rja40 commented Oct 11, 2022

@elyobo Tried that but DBT doesn't let me create it. Model name has to be unique both in custom alias and file name for same schema and database.

@RomantsovArtur
Copy link

This is one of the most significant dbt inconveniences. It would be nice to see this issue progress this year, it's been three years already.

@elyobo
Copy link

elyobo commented Oct 17, 2022

There is some promising looking work as part of the broader cross project work - #5244

@jmg-duarte
Copy link
Contributor

Maybe I am late to the party, but just like there are macros for generating a schema and a database name, why isn't there a macro to generate the model name?

I don't know how complicated it would be to add this feature, but it sure seems like overall, it would be the simplest approach. The previous behavior would be kept as default, not breaking anything and people that wanted to, can customize it to their will.

Currently, my team has a structure similar to what others have shown:

main_folder/
  data_source_X/
    data_source_X_model_1.sql
    ...
  data_source_Y/
    data_source_Y_model_1.sql
    ...

And we would like to have:

main_folder/
  data_source_X/
    model_1.sql
    ...
  data_source_Y/
    model_1.sql
    ...

But have a way to make the second approach be the same from dbt's "point of view".

We thought about something like (very rough sketch):

{% macro generate_model_name %}
  model["name"] = model["unique_id"].split(".")[-2:]
{% endmacro %}

@elyobo
Copy link

elyobo commented Oct 28, 2022

There is one, generate_alias_name, but the model make doesn't change the underlying database table does - model names still need to be unique.

Annoying but I'm sort of used to it now, the unique model makes are easier to find in editors and fits nicely with the "how we structure our projects" guide on the dbt site, and if I really want my db to reflect different names then it's doable.

I'm keen on seeing this fixed properly but hasn't ended up being a blocker on us jumping over from dataform - lack of namespace support was a problem but aliases worked well enough.

@jtcohen6 jtcohen6 removed the help_wanted Trickier changes, with a clear starting point, good for previous/experienced contributors label May 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet