Skip to content

Conversation

@amotl
Copy link
Member

@amotl amotl commented Sep 14, 2025

About

Continue adding integration guides from the community forum.

Preview

References

@coderabbitai
Copy link

coderabbitai bot commented Sep 14, 2025

Walkthrough

Replaces an ETL grid entry in docs/ingest/etl/index.md (nifi → prefect) and adds Prefect integration docs: a landing page and a detailed usage tutorial demonstrating a CSV-to-CrateDB ETL flow using Prefect tasks/flow with SQLAlchemy. No code or API changes.

Changes

Cohort / File(s) Summary
ETL grid update
docs/ingest/etl/index.md
Replaced {ref}`nifi` with {ref}`prefect` in the Dataflow / Pipeline / Code-first grid; removed nifi entry, added prefect entry; caption unchanged.
Prefect integration landing
docs/integrate/prefect/index.md
New Prefect integration page with header, logo, About, and Learn sections; links to a tutorial and Prefect site; content-only addition.
Prefect tutorial/usage
docs/integrate/prefect/usage.md
New tutorial showing a CSV → transform → CrateDB ETL using Prefect tasks/flow and SQLAlchemy; includes prerequisites, setup steps, code example, screenshots, and references.

Sequence Diagram(s)

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Possibly related PRs

Suggested reviewers

  • hammerhead
  • kneth
  • karynzv

Poem

I hop through docs with nimble feet,
Swapping NiFi’s beat for Prefect’s suite. 🐇
New flows arise—extract, transform, load—
To CrateDB along the road.
Carrots in cache, tasks in a row,
Ship it, and watch the pipelines flow! 🚀

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Title Check ✅ Passed The title "Prefect: Index page and usage guide" is concise and accurately summarizes the primary change in the PR — adding a Prefect index page and a usage/tutorial guide — so it is relevant and clear for teammates scanning history.
Description Check ✅ Passed The PR description states the objective ("Continue adding integration guides"), includes a preview link and a reference to GH-102, and therefore is related to the documentation changes in this changeset.
Docstring Coverage ✅ Passed No functions found in the changes. Docstring coverage check skipped.
✨ Finishing touches
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch prefect

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c679f36 and b4bc57c.

📒 Files selected for processing (3)
  • docs/ingest/etl/index.md (1 hunks)
  • docs/integrate/prefect/index.md (1 hunks)
  • docs/integrate/prefect/usage.md (1 hunks)
✅ Files skipped from review due to trivial changes (1)
  • docs/integrate/prefect/usage.md
🚧 Files skipped from review as they are similar to previous changes (2)
  • docs/integrate/prefect/index.md
  • docs/ingest/etl/index.md
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Build docs

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@amotl amotl added reorganize Moving content around, inside and between other systems. cross linking Linking to different locations of the documentation. labels Sep 15, 2025
@amotl amotl marked this pull request as ready for review September 15, 2025 22:31
@amotl
Copy link
Member Author

amotl commented Sep 16, 2025

@coderabbitai review and please convert all spots to active voice.

@coderabbitai
Copy link

coderabbitai bot commented Sep 16, 2025

@amotl I'll review the PR and identify all instances that need conversion to active voice. Let me start by reviewing all the changes.


🧠 Learnings used
Learnt from: amotl
PR: crate/cratedb-guide#0
File: :0-0
Timestamp: 2025-09-15T17:42:25.449Z
Learning: When reviewing documentation, use active voice instead of first-person or passive voice, and emit corresponding suggestions.
✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (10)
docs/integrate/prefect/tutorial.md (7)

16-19: Add prerequisites for the Python DB stack.

Include SQLAlchemy and the CrateDB SQLAlchemy dialect so to_sql works out of the box.

Apply this diff:

 * **Python 3.x**: Prefect is a Python-based workflow management system, so you'll need Python installed on your machine.
 * **CrateDB**: To work with CrateDB, create a new cluster in [CrateDB Cloud](https://console.cratedb.cloud/). You can choose the CRFEE tier cluster that does not require any payment information.
-* **Prefect**: Install Prefect using pip by running the following command in your terminal or command prompt: `pip install -U prefect`
+* **Prefect**: `pip install -U prefect`
+* **SQLAlchemy + CrateDB dialect**: `pip install -U sqlalchemy sqlalchemy-cratedb`

20-25: Typo: “Perfect” → “Prefect”; simplify tracking link.

Fix spelling in the heading and drop tracking parameters from the Cloud URL.

Apply this diff:

-## Getting started with Perfect
+## Getting started with Prefect
@@
-1. To get started with Prefect, you need to connect to Prefect’s API: the easiest way is to sign up for a free forever Cloud account at [https://app.prefect.cloud/](https://app.prefect.cloud/?deviceId=cfc80edd-a234-4911-a25e-ff0d6bb2c32a&deviceId=cfc80edd-a234-4911-a25e-ff0d6bb2c32a).
+1. To get started with Prefect, connect to Prefect’s API by signing up for a free Cloud account at [https://app.prefect.cloud/](https://app.prefect.cloud/).

6-10: Prefer active voice and tighten phrasing.

Shift to active voice per PR objective.

Apply this diff:

-[Prefect](https://www.prefect.io/opensource/) is an open-source workflow automation and orchestration tool for data engineering, machine learning, and other data-related tasks. It allows you to define, schedule, and execute complex data workflows in a straightforward manner.
+[Prefect](https://www.prefect.io/opensource/) is an open-source workflow orchestration tool for data engineering, machine learning, and other data tasks. You define, schedule, and execute complex data workflows with straightforward Python code.
@@
-Prefect workflows are defined using *Python code*. Each step in the workflow is represented as a "task," and tasks can be connected to create a directed acyclic graph (DAG). The workflow defines the sequence of task execution and can include conditional logic and branching. Furthermore, Prefect provides built-in scheduling features that set up cron-like schedules for the flow. You can also parameterize your flow, allowing a run of the same flow with different input values.
+You define Prefect workflows in Python. Each step is a “task,” and tasks form a directed acyclic graph (DAG). Flows can branch and include conditional logic. Prefect also provides built‑in scheduling and flow parameters so you can run the same flow with different inputs.
@@
-This tutorial will explore how CrateDB and Prefect come together to streamline data ingestion, transformation, and loading (ETL) processes with a few lines of Python code.
+This tutorial shows how to combine CrateDB and Prefect to streamline ETL with a few lines of Python.

29-30: Avoid first‑person “We’ll dive …”; use neutral, direct phrasing.

Matches the style used across docs.

Apply this diff:

-We'll dive into the basics of Prefect by creating a simple workflow with tasks that fetch data from a source, perform basic transformations, and load it into CrateDB. For this example, we will use [the yellow taxi trip data](https://github.com/DataTalksClub/nyc-tlc-data/releases/download/yellow/yellow_tripdata_2021-01.csv.gz), which includes pickup time, geo-coordinates, number of passengers, and several other variables. The goal is to create a workflow that does a basic transformation on this data and inserts it into a CrateDB table named `trip_data`:
+This section walks you through a simple workflow that fetches data, applies a basic transformation, and loads it into CrateDB. It uses the [yellow taxi trip dataset](https://github.com/DataTalksClub/nyc-tlc-data/releases/download/yellow/yellow_tripdata_2021-01.csv.gz), which includes pickup time, geo‑coordinates, passenger count, and other fields. The goal is to write transformed data to a CrateDB table named `trip_data`:

63-71: Fix function names, arguments, and list indentation (MD005).

Align prose with the code (extract_data, not read_data; load_data("trip_data", data)) and indent the nested list consistently to satisfy markdownlint.

Apply this diff:

-1. We start defining the flow by importing the necessary modules, including `prefect` for working with workflows, `pandas` for data manipulation, and `crate` for interacting with CrateDB.
-2. Next, we specify the connection parameters for CrateDB and the URL for a file containing the dataset. You should modify these values according to your CrateDB Cloud setup.
-3. We define three tasks using the `@task` decorator: `extract_data(url)`, `transform_data(data)`, and `load_data(table_name, transformed_data)`. Each task represents a unit of work in the workflow:
-  1. The `read_data()` task loads the data from the CSV file to a `pandas` data frame.
-  2. The `transform_data(data)` task takes the data frame and returns the data frame with entries where the `passenger_count` value is different than 0.
-  3. The `load_data(transformed_data)` task connects to the CrateDB and loads data into the `trip_data` table.
-4. We define the workflow, name it “ETL workflow“, and specify the sequence of tasks: `extract_data()`, `transform_data(data)`, and `load_data(table_name, transformed_data)`.
-5. Finally, we execute the flow by calling `main_flow()`. This runs the workflow, and each task is executed in the order defined.
+1. Start by importing the necessary modules: `prefect` for workflows, `pandas` for data manipulation, and SQLAlchemy for the database connection.
+2. Specify the CrateDB connection URI and the dataset URL. Modify these values for your CrateDB Cloud setup.
+3. Define three tasks with the `@task` decorator—`extract_data(url)`, `transform_data(df)`, and `load_data(table_name, df)`:
+
+    1. `extract_data()` reads the CSV into a pandas DataFrame.
+    2. `transform_data(df)` filters out rows where `passenger_count` is 0.
+    3. `load_data(table_name, df)` writes the data to the `trip_data` table in CrateDB.
+
+4. Define the flow, name it “ETL workflow,” and order the tasks: `extract_data()`, `transform_data()`, then `load_data()`.
+5. Execute the flow by calling `main_flow()`. Prefect runs each task in order.

72-78: Minor copy edits: “Flow Runs”; tighten sentences.

And prefer “CSV” capitalization.

Apply this diff:

-When you run this Python script, the workflow will read the trip data from a `csv` file, transform it, and load it into the CrateDB table. You can see the state of the flow run in the *Flows Runs* tab in Prefect UI:
+When you run the script, the workflow reads the trip data from a CSV file, transforms it, and loads it into CrateDB. You can see the state of the run in the *Flow Runs* tab in the Prefect UI:
@@
-You can enrich the ETL pipeline with many advanced features available in Prefect such as parameterization, error handling, retries, and more. Finally, after the successful execution of the workflow, you can query the data in the CrateDB:
+You can enrich the pipeline with Prefect features such as parameters, error handling, and retries. After a successful run, query the data in CrateDB:

82-84: Active voice in wrap‑up.

Small tweak for consistency.

Apply this diff:

-Throughout this tutorial, you made a simple Prefect workflow, defined tasks, and orchestrated data transformations and loading into CrateDB. Both tools offer extensive feature sets that you can use to optimize and scale your data workflows further.
+In this tutorial, you created a simple Prefect workflow, defined tasks, and orchestrated data transformations and loading into CrateDB. Both tools offer extensive features that help you optimize and scale your data workflows.
docs/integrate/prefect/index.md (3)

11-12: Optional: Sharpen tagline (active voice).

Not required, but reads more crisply.

Apply this diff:

-Modern Workflow Orchestration.
+Orchestrate modern data workflows in Python.

17-25: Active voice and tighten copy in “About.”

Shorten and switch to direct voice.

Apply this diff:

-[Prefect] is a workflow orchestration framework for building resilient data
-pipelines in Python.
+Use [Prefect] to orchestrate resilient data pipelines in Python.
@@
-Give your team the power to build reliable workflows without sacrificing
-development speed. Prefect Core combines the freedom of pure Python
-development with production-grade resilience, putting you in control of
-your data operations. Transform your code into scalable workflows that
-deliver consistent results.
+Build reliable workflows without sacrificing development speed. Prefect combines the freedom of pure Python with production‑grade resilience, putting you in control of your data operations. Turn code into scalable workflows that deliver consistent results.

4-6: Asset sourcing (optional).

The logo hotlinks an external host. Consider adding the asset to the repo or using an official brand CDN to avoid broken images during docs builds.

If you want, I can open a follow‑up issue to track moving this asset in‑repo.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 591e346 and 62f2889.

📒 Files selected for processing (4)
  • docs/ingest/etl/index.md (1 hunks)
  • docs/integrate/index.md (1 hunks)
  • docs/integrate/prefect/index.md (1 hunks)
  • docs/integrate/prefect/tutorial.md (1 hunks)
🧰 Additional context used
🪛 markdownlint-cli2 (0.17.2)
docs/integrate/prefect/tutorial.md

66-66: Inconsistent indentation for list items at the same level
Expected: 0; Actual: 2

(MD005, list-indent)


67-67: Inconsistent indentation for list items at the same level
Expected: 0; Actual: 2

(MD005, list-indent)


68-68: Inconsistent indentation for list items at the same level
Expected: 0; Actual: 2

(MD005, list-indent)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Build docs
🔇 Additional comments (2)
docs/ingest/etl/index.md (1)

56-56: LGTM: Cross-link works and ordering looks correct.

{ref}prefect`` resolves to the new integration page, and placement under “Dataflow / Pipeline / Code-first” is consistent.

docs/integrate/index.md (1)

62-62: LGTM: Toctree inclusion is correct.

prefect/index appears in alphabetical position between “Power BI” and “prometheus.” No further changes needed.

@amotl amotl requested a review from karynzv September 17, 2025 20:11
@amotl amotl force-pushed the prefect branch 2 times, most recently from c679f36 to add4422 Compare September 18, 2025 14:45
@amotl amotl requested a review from kneth September 19, 2025 21:25
@amotl amotl changed the title Prefect: Index page and starter tutorial Prefect: Index page and usage guide Sep 23, 2025
@amotl amotl merged commit 8f3d2eb into main Sep 30, 2025
3 checks passed
@amotl amotl deleted the prefect branch September 30, 2025 11:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cross linking Linking to different locations of the documentation. reorganize Moving content around, inside and between other systems.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants