Skip to content

Add dbt docs natively in Airflow via plugin#737

Merged
jbandoro merged 33 commits into
astronomer:mainfrom
dwreeves:add-dbt-docs-support
Feb 20, 2024
Merged

Add dbt docs natively in Airflow via plugin#737
jbandoro merged 33 commits into
astronomer:mainfrom
dwreeves:add-dbt-docs-support

Conversation

@dwreeves
Copy link
Copy Markdown
Collaborator

@dwreeves dwreeves commented Dec 2, 2023

Description

This PR adds a plugin (via the Airflow plugins entrypoint) that adds a menu item inside of Browse that renders the dbt docs:

image

And this is what it looks like. (This example is inside the dev docker compose):

image

The docs are rendered via an iframe with some additional hacks to make the page render in a user friendly way. I chose an iframe over vendoring the index.html in the templates for a few reasons, but mostly to support custom {% block __overview__ %} text. However, extracting the text from index.html and rendering it in a custom page is certainly an option too.

The dbt docs are specified in the Airflow config with the following parameters:

[cosmos]
dbt_docs_dir = path/to/docs/here
dbt_docs_conn_id = my_conn_id

Note that the path can be a link to any of the following:

  • S3
  • Azure Blob Storage
  • Google Cloud Storage
  • HTTP/HTTPS
  • Local storage

This is designed to work with the operators that dump the dbt docs, and the documentation changes I added make that clear.

Lastly, if docs are not hooked up, a message comes up telling the user that they should set their dbt docs up:

image

Current limitations

  • Most importantly, I need help testing the S3 / Azure / GCS integrations. I think I got them right but I'll need someone to actually try them.
  • I also wouldn't mind some help testing the UI on more browsers. I've tested both Firefox and Chrome.
  • The iframe hack is less than ideal; I would preferably want the dbt docs to have a fixed height. So instead of using the scroll bar of the Airflow UI, use the scroll bar of the dbt docs UI. The issue is basically that I am not an HTML/CSS/JavaScript person. I don't think there is any reason this shouldn't be possible, so I can continue to look into this as the PR is reviewed, or someone else can just do it for me.
  • I cannot run tests locally (lots of issues, mostly the databricks DAG in dev/dags/ fails locally), so I actually have no idea whether the test suite works. I was just planning on letting Github Actions take a stab at it.

API Decisions

The core maintainers of the repo should provide some feedback on a few high level API decisions:

  • Config variable names: Let me know if dbt_docs_dir and dbt_docs_conn_id are appropriate names. Other names could be like, dbt_docs_path, or dbt_docs_dir_conn_id, or dbt_docs_path_conn_id, etc.
  • Location in UI: I entertained two ideas: (a) Adding a menu button called Cosmos with dbt docs underneath. (b) Adding it under browse. Ultimately I decided on option 2.

Related Issue(s)

Closes #571.

Breaking Change?

This PR should not cause any breaking changes.

Checklist

  • I have made corresponding changes to the documentation (if required)
  • I have added tests that prove my fix is effective or that my feature works

@dwreeves dwreeves requested a review from a team as a code owner December 2, 2023 18:11
@dwreeves dwreeves requested a review from a team December 2, 2023 18:11
@dosubot dosubot Bot added the size:L This PR changes 100-499 lines, ignoring generated files. label Dec 2, 2023
@netlify
Copy link
Copy Markdown

netlify Bot commented Dec 2, 2023

👷 Deploy Preview for amazing-pothos-a3bca0 processing.

Name Link
🔨 Latest commit ab42fdd
🔍 Latest deploy log https://app.netlify.com/sites/amazing-pothos-a3bca0/deploys/658f676a53dec50008f71d31

@dosubot dosubot Bot added area:docs Relating to documentation, changes, fixes, improvement dbt:docs Primarily related to dbt docs command or functionality execution:docker Related to Docker execution environment labels Dec 2, 2023
@dwreeves dwreeves force-pushed the add-dbt-docs-support branch from 6a0b7a4 to 911e132 Compare December 2, 2023 19:07
@jlaneve
Copy link
Copy Markdown
Contributor

jlaneve commented Dec 2, 2023

I've been excited about this one! I haven't looked at this super in-depth yet, but how does the local filesystem option work? Could I add running dbt docs to my build process, bake it into my Airflow image, and read from there?

@dwreeves
Copy link
Copy Markdown
Collaborator Author

dwreeves commented Dec 2, 2023

I've been excited about this one! I haven't looked at this super in-depth yet, but how does the local filesystem option work? Could I add running dbt docs to my build process, bake it into my Airflow image, and read from there?

Yes, exactly. For my own professional usage this is the option I am doing.

As part of my deployment process, I dbt compile, the manifest.json gets spit out into the dags directory, and in the code I specify the manifest_path in the ProjectConfig.

I like this deployment approach because it relieves some compute pressure on the scheduler since it doesn't need to dbt ls from a clean slate on every heartbeat. (Actually, this deployment approach isn't documented; could be something I add to the docs...)

So when this feature is in, I would dbt docs generate as part of the deployment process as well. And then I would just use the docs there.

This approach does have a small downside as far as dbt docs are concerned (and now that I mention it I should note it in the documentation), which is that the count(*) doesn't get updated in the docs regularly; it's just using whatever the count was at the last deploy. Also, if a model is incremental but hasn't run before, it will render the compiled version as the non-incremental code, which becomes stale after the first run. So basically, values can become stale. But for relieving pressure on the scheduler, and now in the case of docs for reducing cloud infrastructure requirements, I do like it.


Many users will be fine with these limitations of stale caching. If anyone wants low infra but up-to-date docs, you can dbt docs generate on load of dbt_docs_index.html, but:

  • This would be slower. You'd want to add a loading GIF to the iframe to give responsive feedback to users.
  • You'd also need to be super careful about the artifacts in this case too, as just dumping artifacts into a tmp dir is leaky.

Overall no cloud infra + no pre-compile is an option, but it would require additional hacks and probably isn't something I'd encourage.

@dwreeves
Copy link
Copy Markdown
Collaborator Author

dwreeves commented Dec 2, 2023

Please bear with me as I just throw some commits at Github Actions in a desperate attempt to get the tests I added working. I cannot for the life of me get the tests working on my local machine. 😓

Also, I added the aforementioned caveats regarding local storage to the documentation.

@netlify
Copy link
Copy Markdown

netlify Bot commented Jan 4, 2024

Deploy Preview for sunny-pastelito-5ecb04 ready!

Name Link
🔨 Latest commit 62c5b3a
🔍 Latest deploy log https://app.netlify.com/sites/sunny-pastelito-5ecb04/deploys/65d4e6c831c05800086dc466
😎 Deploy Preview https://deploy-preview-737--sunny-pastelito-5ecb04.netlify.app/configuration/project-config
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@dwreeves
Copy link
Copy Markdown
Collaborator Author

What's the status of this? 🤔 The issue we left off at was that @tatiana was having some issues with the HTTP method for retrieving files, but I was unable to replicate it or figure out what the problem was (it worked on my end...). We both were able to confirm that the local method worked, and the GCS/S3/Azure methods have not been fully integration tested.

Copy link
Copy Markdown
Collaborator

@jbandoro jbandoro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the status of this? 🤔 The issue we left off at was that @tatiana was having some issues with the HTTP method for retrieving files, but I was unable to replicate it or figure out what the problem was (it worked on my end...). We both were able to confirm that the local method worked, and the GCS/S3/Azure methods have not been fully integration tested.

Thanks for another great contribution @dwreeves, this is awesome!

I confirmed that the HTTP, local file, and GCS paths work with docker-compose. @tatiana mentioned to me that the HTTP path wasn't working when running airflow standalone, and I had an issue where the webserver process kept exiting with signal 11 when trying to view the dbt docs when I was using the example docs dir you put in the docker-compose here, but it did work with a local file path in airlfow standalone mode.

Airflow recommends against using airflow standalone in production, and issues could be related to local system, and since this worked well in my testing with docker-compose, happy to approve with the minor comment below fixed before merging.

Comment thread docs/configuration/hosting-docs.rst Outdated
Co-authored-by: Justin Bandoro <79104794+jbandoro@users.noreply.github.com>
@dwreeves
Copy link
Copy Markdown
Collaborator Author

Thanks for the approval. Just updated.

@ms32035
Copy link
Copy Markdown
Contributor

ms32035 commented Feb 26, 2024

@dwreeves how does this work when there are multiple dbt projects/doc dirs?

@dwreeves
Copy link
Copy Markdown
Collaborator Author

@ms32035 I was a little worried someone would ask. It is not supported directly. It could be in theory, although I cannot think of an API+UI design for multiple dbt projects' docs that wouldn't complicate things a lot for users with just one project. I'm all ears for what user-facing interface you'd think is appropriate.

Another solution, which I find reasonably elegant, is to import projects into a "docs project" so to speak. So you create a dbt project that has a packages.yml that looks like this:

packages:
- local: ../project1
- local: ../project2

It's possible that this pattern could/should be documented. Or we just support multiple projects' docs directly, although again, I don't know how to avoid the complications for the API.

@dwreeves
Copy link
Copy Markdown
Collaborator Author

I actually do think at the very least this pattern should be documented. I'll open a PR in the next couple days.

I'm aware due to very custom setups like passing vars into the dbt compile command, that "just create a docs project" could be a nonstarter for many people. At that point though, I'd rather have a variable that turns off the blueprint for the plugin, and let users do their own thing by subclassing the plugin, and they can create their own dropdown with multiple blueprints. Most of the novel javascript and iframe machinery is there, after all.

@ms32035
Copy link
Copy Markdown
Contributor

ms32035 commented Feb 26, 2024

@dwreeves the only design idea I have is to have a list page as an entry point, where you'd have to pre-configure the folders, - a comma separated string in airflow conf. I did something similar here https://github.com/ms32035/airflow-multirepo-deploy/blob/f08d8a5863b3311bf4210d3dc2c835370dd0296d/multirepo-deploy-plugin/multirepo_deploy_plugin.py#L110

@dwreeves
Copy link
Copy Markdown
Collaborator Author

I would only support that solution if, when there is only one project, the dbt docs load normally without a list. So the list is just there if you have multiple projects.

Making each attribute into a comma separated list is that gets zip()'d is something I briefly considered. It does help make things simple for single project users, although it feels weird and slightly inexplicit as a data model. That's the trade-off, basically.

@ms32035
Copy link
Copy Markdown
Contributor

ms32035 commented Feb 26, 2024

If list length = 1 then redirect to the first element instead of rendering the table :D

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:docs Relating to documentation, changes, fixes, improvement dbt:docs Primarily related to dbt docs command or functionality execution:docker Related to Docker execution environment lgtm This PR has been approved by a maintainer size:XL This PR changes 500-999 lines, ignoring generated files. status:awaiting-reviewer The issue/PR is awaiting for a reviewer input

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Render DBT Docs

6 participants