-
Notifications
You must be signed in to change notification settings - Fork 2
Prefect: Index page and usage guide #299
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughReplaces an ETL grid entry in docs/ingest/etl/index.md (nifi → prefect) and adds Prefect integration docs: a landing page and a detailed usage tutorial demonstrating a CSV-to-CrateDB ETL flow using Prefect tasks/flow with SQLAlchemy. No code or API changes. Changes
Sequence Diagram(s)Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes Possibly related PRs
Suggested reviewers
Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches🧪 Generate unit tests
📜 Recent review detailsConfiguration used: CodeRabbit UI Review profile: CHILL Plan: Pro 📒 Files selected for processing (3)
✅ Files skipped from review due to trivial changes (1)
🚧 Files skipped from review as they are similar to previous changes (2)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
@coderabbitai review and please convert all spots to active voice. |
|
@amotl I'll review the PR and identify all instances that need conversion to active voice. Let me start by reviewing all the changes. 🧠 Learnings used✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (10)
docs/integrate/prefect/tutorial.md (7)
16-19: Add prerequisites for the Python DB stack.Include SQLAlchemy and the CrateDB SQLAlchemy dialect so
to_sqlworks out of the box.Apply this diff:
* **Python 3.x**: Prefect is a Python-based workflow management system, so you'll need Python installed on your machine. * **CrateDB**: To work with CrateDB, create a new cluster in [CrateDB Cloud](https://console.cratedb.cloud/). You can choose the CRFEE tier cluster that does not require any payment information. -* **Prefect**: Install Prefect using pip by running the following command in your terminal or command prompt: `pip install -U prefect` +* **Prefect**: `pip install -U prefect` +* **SQLAlchemy + CrateDB dialect**: `pip install -U sqlalchemy sqlalchemy-cratedb`
20-25: Typo: “Perfect” → “Prefect”; simplify tracking link.Fix spelling in the heading and drop tracking parameters from the Cloud URL.
Apply this diff:
-## Getting started with Perfect +## Getting started with Prefect @@ -1. To get started with Prefect, you need to connect to Prefect’s API: the easiest way is to sign up for a free forever Cloud account at [https://app.prefect.cloud/](https://app.prefect.cloud/?deviceId=cfc80edd-a234-4911-a25e-ff0d6bb2c32a&deviceId=cfc80edd-a234-4911-a25e-ff0d6bb2c32a). +1. To get started with Prefect, connect to Prefect’s API by signing up for a free Cloud account at [https://app.prefect.cloud/](https://app.prefect.cloud/).
6-10: Prefer active voice and tighten phrasing.Shift to active voice per PR objective.
Apply this diff:
-[Prefect](https://www.prefect.io/opensource/) is an open-source workflow automation and orchestration tool for data engineering, machine learning, and other data-related tasks. It allows you to define, schedule, and execute complex data workflows in a straightforward manner. +[Prefect](https://www.prefect.io/opensource/) is an open-source workflow orchestration tool for data engineering, machine learning, and other data tasks. You define, schedule, and execute complex data workflows with straightforward Python code. @@ -Prefect workflows are defined using *Python code*. Each step in the workflow is represented as a "task," and tasks can be connected to create a directed acyclic graph (DAG). The workflow defines the sequence of task execution and can include conditional logic and branching. Furthermore, Prefect provides built-in scheduling features that set up cron-like schedules for the flow. You can also parameterize your flow, allowing a run of the same flow with different input values. +You define Prefect workflows in Python. Each step is a “task,” and tasks form a directed acyclic graph (DAG). Flows can branch and include conditional logic. Prefect also provides built‑in scheduling and flow parameters so you can run the same flow with different inputs. @@ -This tutorial will explore how CrateDB and Prefect come together to streamline data ingestion, transformation, and loading (ETL) processes with a few lines of Python code. +This tutorial shows how to combine CrateDB and Prefect to streamline ETL with a few lines of Python.
29-30: Avoid first‑person “We’ll dive …”; use neutral, direct phrasing.Matches the style used across docs.
Apply this diff:
-We'll dive into the basics of Prefect by creating a simple workflow with tasks that fetch data from a source, perform basic transformations, and load it into CrateDB. For this example, we will use [the yellow taxi trip data](https://github.com/DataTalksClub/nyc-tlc-data/releases/download/yellow/yellow_tripdata_2021-01.csv.gz), which includes pickup time, geo-coordinates, number of passengers, and several other variables. The goal is to create a workflow that does a basic transformation on this data and inserts it into a CrateDB table named `trip_data`: +This section walks you through a simple workflow that fetches data, applies a basic transformation, and loads it into CrateDB. It uses the [yellow taxi trip dataset](https://github.com/DataTalksClub/nyc-tlc-data/releases/download/yellow/yellow_tripdata_2021-01.csv.gz), which includes pickup time, geo‑coordinates, passenger count, and other fields. The goal is to write transformed data to a CrateDB table named `trip_data`:
63-71: Fix function names, arguments, and list indentation (MD005).Align prose with the code (
extract_data, notread_data;load_data("trip_data", data)) and indent the nested list consistently to satisfy markdownlint.Apply this diff:
-1. We start defining the flow by importing the necessary modules, including `prefect` for working with workflows, `pandas` for data manipulation, and `crate` for interacting with CrateDB. -2. Next, we specify the connection parameters for CrateDB and the URL for a file containing the dataset. You should modify these values according to your CrateDB Cloud setup. -3. We define three tasks using the `@task` decorator: `extract_data(url)`, `transform_data(data)`, and `load_data(table_name, transformed_data)`. Each task represents a unit of work in the workflow: - 1. The `read_data()` task loads the data from the CSV file to a `pandas` data frame. - 2. The `transform_data(data)` task takes the data frame and returns the data frame with entries where the `passenger_count` value is different than 0. - 3. The `load_data(transformed_data)` task connects to the CrateDB and loads data into the `trip_data` table. -4. We define the workflow, name it “ETL workflow“, and specify the sequence of tasks: `extract_data()`, `transform_data(data)`, and `load_data(table_name, transformed_data)`. -5. Finally, we execute the flow by calling `main_flow()`. This runs the workflow, and each task is executed in the order defined. +1. Start by importing the necessary modules: `prefect` for workflows, `pandas` for data manipulation, and SQLAlchemy for the database connection. +2. Specify the CrateDB connection URI and the dataset URL. Modify these values for your CrateDB Cloud setup. +3. Define three tasks with the `@task` decorator—`extract_data(url)`, `transform_data(df)`, and `load_data(table_name, df)`: + + 1. `extract_data()` reads the CSV into a pandas DataFrame. + 2. `transform_data(df)` filters out rows where `passenger_count` is 0. + 3. `load_data(table_name, df)` writes the data to the `trip_data` table in CrateDB. + +4. Define the flow, name it “ETL workflow,” and order the tasks: `extract_data()`, `transform_data()`, then `load_data()`. +5. Execute the flow by calling `main_flow()`. Prefect runs each task in order.
72-78: Minor copy edits: “Flow Runs”; tighten sentences.And prefer “CSV” capitalization.
Apply this diff:
-When you run this Python script, the workflow will read the trip data from a `csv` file, transform it, and load it into the CrateDB table. You can see the state of the flow run in the *Flows Runs* tab in Prefect UI: +When you run the script, the workflow reads the trip data from a CSV file, transforms it, and loads it into CrateDB. You can see the state of the run in the *Flow Runs* tab in the Prefect UI: @@ -You can enrich the ETL pipeline with many advanced features available in Prefect such as parameterization, error handling, retries, and more. Finally, after the successful execution of the workflow, you can query the data in the CrateDB: +You can enrich the pipeline with Prefect features such as parameters, error handling, and retries. After a successful run, query the data in CrateDB:
82-84: Active voice in wrap‑up.Small tweak for consistency.
Apply this diff:
-Throughout this tutorial, you made a simple Prefect workflow, defined tasks, and orchestrated data transformations and loading into CrateDB. Both tools offer extensive feature sets that you can use to optimize and scale your data workflows further. +In this tutorial, you created a simple Prefect workflow, defined tasks, and orchestrated data transformations and loading into CrateDB. Both tools offer extensive features that help you optimize and scale your data workflows.docs/integrate/prefect/index.md (3)
11-12: Optional: Sharpen tagline (active voice).Not required, but reads more crisply.
Apply this diff:
-Modern Workflow Orchestration. +Orchestrate modern data workflows in Python.
17-25: Active voice and tighten copy in “About.”Shorten and switch to direct voice.
Apply this diff:
-[Prefect] is a workflow orchestration framework for building resilient data -pipelines in Python. +Use [Prefect] to orchestrate resilient data pipelines in Python. @@ -Give your team the power to build reliable workflows without sacrificing -development speed. Prefect Core combines the freedom of pure Python -development with production-grade resilience, putting you in control of -your data operations. Transform your code into scalable workflows that -deliver consistent results. +Build reliable workflows without sacrificing development speed. Prefect combines the freedom of pure Python with production‑grade resilience, putting you in control of your data operations. Turn code into scalable workflows that deliver consistent results.
4-6: Asset sourcing (optional).The logo hotlinks an external host. Consider adding the asset to the repo or using an official brand CDN to avoid broken images during docs builds.
If you want, I can open a follow‑up issue to track moving this asset in‑repo.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
docs/ingest/etl/index.md(1 hunks)docs/integrate/index.md(1 hunks)docs/integrate/prefect/index.md(1 hunks)docs/integrate/prefect/tutorial.md(1 hunks)
🧰 Additional context used
🪛 markdownlint-cli2 (0.17.2)
docs/integrate/prefect/tutorial.md
66-66: Inconsistent indentation for list items at the same level
Expected: 0; Actual: 2
(MD005, list-indent)
67-67: Inconsistent indentation for list items at the same level
Expected: 0; Actual: 2
(MD005, list-indent)
68-68: Inconsistent indentation for list items at the same level
Expected: 0; Actual: 2
(MD005, list-indent)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Build docs
🔇 Additional comments (2)
docs/ingest/etl/index.md (1)
56-56: LGTM: Cross-link works and ordering looks correct.
{ref}prefect`` resolves to the new integration page, and placement under “Dataflow / Pipeline / Code-first” is consistent.docs/integrate/index.md (1)
62-62: LGTM: Toctree inclusion is correct.
prefect/indexappears in alphabetical position between “Power BI” and “prometheus.” No further changes needed.
c679f36 to
add4422
Compare
About
Continue adding integration guides from the community forum.
Preview
References