-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve and test dataframe API examples in docs #11290
Conversation
|
||
`DataFrame` in `DataFrame` is modeled after the Pandas DataFrame interface, and is a thin wrapper over LogicalPlan that adds functionality for building and executing those plans. | ||
|
||
```rust |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved this example to the end as I think it makes more sense once you see the DataFrame in action
); | ||
|
||
#[cfg(doctest)] | ||
doc_comment::doctest!( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This runs the examples as part of cargo test --doc
@@ -29,16 +29,15 @@ using the [`SessionContext::sql`] method. For lower level control such as | |||
preventing DDL, you can use [`SessionContext::sql_with_options`] or the | |||
[`SessionState`] APIs | |||
|
|||
[`sessioncontext`]: https://docs.rs/datafusion/latest/datafusion/execution/context/struct.SessionContext.html |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved this down to the bottom of the doc so the link for SessionContext::sql
wasn't duplicated (there was a doc warning about this)
7d3c37c
to
d6540db
Compare
|
||
#[tokio::main] | ||
async fn main() -> Result<()> { | ||
let ctx = SessionContext::new(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While these examples are now more verbose, they all stand on their own, which I think @efredine suggested would be an improvement
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I learnt a lot reading and reviewing this ;-). Left some suggestions as seen through the eyes of a new user like me.
You can use `collect` or `execute_stream` to execute the query. | ||
DataFusion [`DataFrame`]s are modeled after the [Pandas DataFrame] interface, | ||
and is implemented as thin wrapper over a [`LogicalPlan`] that adds | ||
functionality for building and executing those plans. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok taking advantage of my relative ignorance to share some things that might be confusing to new users like me.
I think it might be better to start with a short section called Reading a Dataframe
? Because that's probably the first thing people will want to do. It should show a simple example of reading a csv and displaying it. We just want to establish that when you read from a file the thing you get back is a dataframe! Then I wonder if the next section might be called Generating a New Dataframe
? And I think it might be preferable to have the SQL example second rather than first?
Interestingly, the distinction between a "table" and a "dataframe" is hazy to me. There is also some sort of subtle distinction going on here between a dataframe as a thing that contains some data (which is how I think about it mentally when I read from a csv) and a thing which contains an executable plan that performs a transformation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great feedback -- I looked around and the more basic introduction seems like it is in https://datafusion.apache.org/user-guide/dataframe.html (the "user guide"). I'll add some text that points there and rearrange the content (as well as make a PR to clean up that page)
|
||
#[tokio::main] | ||
async fn main() -> Result<()>{ | ||
let ctx = SessionContext::new(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok - this comment, really clarified the distinction between a table and a dataframe for me. In essence, the simplest possible dataframe is one that scans a table and that table can be in a file or in memory. I think this might be worth including in the introduction. Maybe worth consistently using scan when reading a file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to clarify and add some additional comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This example really drives it home.
And in the near future we'll be able to turn it back into SQL which probably wouldn't belong here but is cool all the same ;-).
// read the contents of a CSV file into a DataFrame | ||
let df = ctx.read_csv("tests/data/example.csv", CsvReadOptions::new()).await?; | ||
// execute the query and collect the results as a Vec<RecordBatch> | ||
let batches = df.collect().await?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For consistency with the next example, it might be worth iterating the batch here as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much @efredine for the insightful feedback - I have tried to address all comments. I would appreciate knowing what you think of the new version if you get a chance
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this looks great!
Feel free to tag me on these example changes. I share you view that reviewing and refining documentation and examples is high impact and it's a great way for me to continue learning more.
|
||
#[tokio::main] | ||
async fn main() -> Result<()>{ | ||
let ctx = SessionContext::new(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This example really drives it home.
And in the near future we'll be able to turn it back into SQL which probably wouldn't belong here but is cool all the same ;-).
Co-authored-by: Eric Fredine <[email protected]>
Co-authored-by: Eric Fredine <[email protected]>
I actually think we can do it now: datafusion/datafusion-examples/examples/plan_to_sql.rs Lines 118 to 139 in 6f330c9
Maybe we should add a function to |
Thank you very much @efredine -- I couldn't agree more. Now I just need to find more time (it takes me much longer to update docs typically than it does to work on code :) ) |
// ---|------------- | ||
// 1 | 9000 | ||
// 2 | 8000 | ||
// 3 | 7000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The in-memory examples are concise and its easy to get the gist of what's going on. But it also throws people in to the deep end of the Arrow format which lacks a gentle introduction IMO. The Arrow-rs documentation gets immediately into the weeds!
It's likely that many users might never even need to know or access the arrow format directly. They will just read and write to csv or parquet.
I don't think this needs to change, but perhaps what's missing is a section on how and when to use the Arrow format? A gentler introduction to Record Batches.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a gentle arrow introduction would be awesome -- here is a ticke tracking such a thing upstream: apache/arrow-rs#4071
I actually think the basic content / structure could be copied from https://jorgecarleitao.github.io/arrow2/main/guide/ with the examples being updated to reflect arrow-rs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could also add a small section in the DataFusion docs about record batches as well - filed #11336 to track that idea
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm thanks @alamb
* Improve and test dataframe API examples in docs * Update introduction with pointer to user guide * Make example consistent * Make read_csv comment consistent * clarifications * prettier + tweaks * Update docs/source/library-user-guide/using-the-dataframe-api.md Co-authored-by: Eric Fredine <[email protected]> * Update docs/source/library-user-guide/using-the-dataframe-api.md Co-authored-by: Eric Fredine <[email protected]> --------- Co-authored-by: Eric Fredine <[email protected]>
* Improve and test dataframe API examples in docs * Update introduction with pointer to user guide * Make example consistent * Make read_csv comment consistent * clarifications * prettier + tweaks * Update docs/source/library-user-guide/using-the-dataframe-api.md Co-authored-by: Eric Fredine <[email protected]> * Update docs/source/library-user-guide/using-the-dataframe-api.md Co-authored-by: Eric Fredine <[email protected]> --------- Co-authored-by: Eric Fredine <[email protected]>
Which issue does this PR close?
Part of #11172
Rationale for this change
I am trying to make the documentation better and examples easier to find
I would like to consolidate the examples for using DataFrames from examples https://github.com/apache/datafusion/tree/main/datafusion-examples/examples (e.g dataframe.rs and dataframe_in_memory and dataframe_output
However, first I wanted to make sure that the existing DataFrame API docs were in good shape -- and it turns out they needed some attention to get the examples compiling
So to keep the size of this PR small, I started with some improvements to the initial content.
I will actually consolidate some of the dataframe examples in a follow on PR
What changes are included in this PR?
Are these changes tested?
Yes, now these examples run as part of the CI
Are there any user-facing changes?