Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race condition DuckDB in Dagster Essentials #22674

Open
kornerc opened this issue Jun 24, 2024 · 3 comments
Open

Race condition DuckDB in Dagster Essentials #22674

kornerc opened this issue Jun 24, 2024 · 3 comments
Labels
area: docs Related to documentation in general

Comments

@kornerc
Copy link

kornerc commented Jun 24, 2024

What's the issue or suggestion?

I am following the Dagster University course Dagster Essentials and encountered a race condition at the end of Lesson 4: Asset dependencies.

The assets have the following lineage:
Global_Asset_Lineage

taxi_trips and taxi_zones both write data to the DuckDB database and manhattan_stats as well as trips_by_week read from it. With DuckDB it is only possible to have one open read/write connection or multiple read connections at the same time. In my case a race condition was triggered due to multiple open open read/write connections at the same time.

It is possible to set read_only=True when connecting to the database in manhattan_stats as well as trips_by_week, to reduce the problem. However, it is still possible to get a race condition with taxi_trips and taxi_zones, since both of them need a read/write connection. Nonetheless, it is possible to circumvent the problem with IOManagers or with using op_tags and setting the maximum parallel execution of assets with this tag to 1. But they are more advanced features.

So maybe it would be a solution to have only read connections for manhattan_stats as well as trips_by_week and add a warning that in case of an error due to a DuckDB connection issue it should be sufficient to re-execute the pipeline?

Additional information

No response

Message from the maintainers

Impacted by this issue? Give it a 👍! We factor engagement into prioritization.

@kornerc kornerc added the area: docs Related to documentation in general label Jun 24, 2024
@kornerc kornerc changed the title Race condition DuckDB inDagster Essentials Race condition DuckDB in Dagster Essentials Jun 24, 2024
@cmpadden
Copy link
Contributor

Thank you @kornerc - your proposition of using read-only connectors and concurrency limits for partitioned assets is quite elegant. I agree that it would be worthwhile to add a note of this in the lesson, as I recall a few other community members running into this issue too.

I will make a ticket for us internally to update the lesson accordingly, and will update here once complete. Thank you!

@kornerc
Copy link
Author

kornerc commented Jun 26, 2024

@cmpadden thank you for your reply and the information.

Another alternative solution came into my mind: using backoff when connecting to the database, as it is done in the DuckDB resource


This might be the simplest solution because it is not necessary to manually limit the concurrency and introducing new Dagster concepts.

@cmpadden
Copy link
Contributor

Good call! I really like that approach. Thanks for following up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: docs Related to documentation in general
Projects
None yet
Development

No branches or pull requests

2 participants