[data] update datasets API structure by matthewdeng · Pull Request #27592 · ray-project/ray

matthewdeng · 2022-08-06T00:46:50Z

Signed-off-by: Matthew Deng matt@anyscale.com

Why are these changes needed?

Refactor Datasets API docs for easier navigation: Ray Datasets API

Changes

Create a new Datasets API base page.
Split existing APIs into separate pages.
Split Dataset and DatasetPipeline methods into separate sections.
1. Used autosummary to generate overview tables at the top of each of these pages. Open to other suggestions e.g. moving the summary to the top of each section instead.
2. Note: Every time we add a new method we need to explicitly add it here as well.
Add Input/Output APIs.
1. I chose to split these primarily by data format rather than type, since it's easier to navigate, and the existing Creating Datasets User Guide already does the latter.
Add Block and DataBatch (should we add these aliases?)
Remove existing package-ref.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Matthew Deng <matt@anyscale.com>

clarkzinzow · 2022-08-08T21:43:41Z

This is looking awesome! The main thing that needs to be tweaked is the CSS around the autosummary tables; it looks like its bleeding into the right sidebar which is causing the docstring summaries to be cut off, but if it were properly constrained to its box, those docstring summaries should properly wrap.

Also I don't know if the tabbed view is going to work since there are a lot of tabs and the tab titles are too long... I think that a flat layout at the top of the page, or breaking up the autosummary into each section, would be less clunky.

doc/source/data/api/dataset_pipeline.rst

doc/source/data/api/dataset.rst

jianoaix · 2022-08-08T21:53:40Z

doc/source/data/api/dataset.rst

+        ray.data.Dataset.randomize_block_order
+        ray.data.Dataset.repartition
+
+.. tabbed:: Splitting and Merging Datasets


May be a bit tricky going forward, merging ops are transformations and splitting ops are consumptions.

Any concrete suggestions for how to structure this in this PR? (As a user I don't think I'd immediately go to Consuming Datasets if I were looking to split my dataset.

matthewdeng · 2022-08-08T22:08:42Z

@clarkzinzow

This is looking awesome! The main thing that needs to be tweaked is the CSS around the autosummary tables; it looks like its bleeding into the right sidebar which is causing the docstring summaries to be cut off, but if it were properly constrained to its box, those docstring summaries should properly wrap.

Yep this just got fixed in master yesterday! #27611 Will merge master to include this.

Also I don't know if the tabbed view is going to work since there are a lot of tabs and the tab titles are too long... I think that a flat layout at the top of the page, or breaking up the autosummary into each section, would be less clunky.

Agreed, do you have a preference?

clarkzinzow · 2022-08-08T23:45:48Z

@matthewdeng How about a single mono-table at the top, ordered alphabetically, with just the method names instead of the f"{module}.{class}.{method}" scheme?

Signed-off-by: Matthew Deng <matt@anyscale.com>

…-docs

Signed-off-by: Matthew Deng <matt@anyscale.com>

matthewdeng · 2022-08-09T00:43:28Z

@clarkzinzow updated the summary tables!

Decided to keep the same ordering/format, but use bold "headers" instead of a table. Thought this would help with logical navigation.
Set :nosignatures: since the parameter names aren't very insightful.
I couldn't figure out how to get read of the module/class names...

clarkzinzow · 2022-08-09T15:18:30Z

@matthewdeng Awesome, that's way better! I'll do a full review of this this morning.

The table width still seems to be off (running under the right sidebar) but it looks like you already merged master. 🤔 Any idea what's going on there?

matthewdeng · 2022-08-09T15:40:29Z

@clarkzinzow yeah looks like the generated docs are from the commit before
I merged master: https://readthedocs.org/projects/ray/builds/17659166/

Let me try to trigger it again...

…-docs

matthewdeng · 2022-08-09T15:42:18Z

doc/source/data/api/dataset_pipeline.rst

+
+.. automethod:: ray.data.DatasetPipeline.stats
+
+.. automethod:: ray.data.DatasetPipeline.sum


Note: I wasn't sure where to put this one, seems kind of out of place but I also didn't want to create an entire Grouped and Global Aggregations section just for this.

clarkzinzow

LGTM overall, mostly just a few nits

doc/source/data/api/input_output.rst

clarkzinzow · 2022-08-09T15:33:22Z

doc/source/data/api/input_output.rst


-Ray Datasets API
-================
+Input/Output


I see that you went with a flat list rather than grouping these APIs by data type, e.g. tabular, tensor, text, binary, etc. Looking at it now, I think that the flat list is better for quick discoverability, but we might want to reexamine such a grouping if we add support for a few more file formats to keep this section's size in the ToC manageable.

But yeah, the flat list looks good to me for now!

Yeah, basically we can follow the headers in https://docs.ray.io/en/master/data/creating-datasets.html, let me know if you think it's worthwhile to add in this PR!

I think the current flat list looks good to me!

doc/source/data/api/input_output.rst

doc/source/data/api/dataset.rst

clarkzinzow · 2022-08-09T15:48:49Z

doc/source/data/api/dataset_pipeline.rst

+    ray.data.DatasetPipeline.split
+    ray.data.DatasetPipeline.split_at_indices
+
+**Creating DatasetPipelines**


I feel like there's a better name for this section, but I haven't been able to think of one.

doc/source/data/api/grouped_dataset.rst

doc/source/data/api/data_representations.rst

clarkzinzow · 2022-08-09T15:57:06Z

doc/source/data/api/dataset_pipeline.rst

+
+.. automethod:: ray.data.DatasetPipeline.add_column
+
+.. automethod:: ray.data.DatasetPipeline.drop_columns


It would be great if these could pop up in the right sidebar, nested under the section title, while scrolling as is done for subsections, but unfortunately that doesn't appear to be possible either: sphinx-doc/sphinx#6316

☹️

doc/source/data/api/dataset.rst

Signed-off-by: Matthew Deng <matt@anyscale.com>

c21

LGTM. Seeing a conflict with master needed to be resolved.

clarkzinzow

LGTM, awesome work! 🎉

jianoaix · 2022-08-12T01:16:09Z

doc/source/data/api/dataset.rst

+.. autosummary::
+    :nosignatures:
+
+    ray.data.Dataset.map


A stretch request: add a ref to each API, so that we can pinpoint to individual API, which is useful. I think what's needed is adding something like ".. _dataset-map-ref:".

This should already be natively supported!

:py:meth:`ray.data.Dataset.map`

@jianoaix For the individual APIs, that should already be doable with cross-referencing via e.g. :meth:`ray.data.Dataset.map`

OK, can it be linked with a url? e.g. https://docs.ray.io/en/master/data/package-ref.html#dataset-api, if we point a user to map(), can we give a ulr pointing to it?

Found it: https://docs.ray.io/en/master/data/package-ref.html#ray.data.Dataset.map. I guess I was looking at the navigation list on the right-hand-side which doesn't show it.

ericl · 2022-08-12T06:09:50Z

doc/source/data/api/dataset.rst

+    ray.data.Dataset.write_csv
+    ray.data.Dataset.write_numpy
+    ray.data.Dataset.write_datasource
+    ray.data.Dataset.to_torch


Should we actually deprecate these two in 2.0?

Refactor Datasets API docs for easier navigation: [Ray Datasets API](https://ray--27592.org.readthedocs.build/en/27592/data/api/api.html) 1. Create a new Datasets API base page. 2. Split existing APIs into separate pages. 3. Split `Dataset` and `DatasetPipeline` methods into separate sections. 1. Used `autosummary` to generate overview tables at the top of each of these pages. Open to other suggestions e.g. moving the summary to the top of each section instead. 2. **Note:** Every time we add a new method we need to explicitly add it here as well. 4. Add Input/Output APIs. 1. I chose to split these primarily by data format rather than type, since it's easier to navigate, and the existing [Creating Datasets](https://docs.ray.io/en/master/data/creating-datasets.html) User Guide already does the latter. 6. Add `Block` and `DataBatch` (should we add these aliases?) 7. Remove existing `package-ref`.

* [data] update datasets API structure (#27592) * [data][docs] fix broken links (#27818)

Refactor Datasets API docs for easier navigation: [Ray Datasets API](https://ray--27592.org.readthedocs.build/en/27592/data/api/api.html) ### Changes 1. Create a new Datasets API base page. 2. Split existing APIs into separate pages. 3. Split `Dataset` and `DatasetPipeline` methods into separate sections. 1. Used `autosummary` to generate overview tables at the top of each of these pages. Open to other suggestions e.g. moving the summary to the top of each section instead. 2. **Note:** Every time we add a new method we need to explicitly add it here as well. 4. Add Input/Output APIs. 1. I chose to split these primarily by data format rather than type, since it's easier to navigate, and the existing [Creating Datasets](https://docs.ray.io/en/master/data/creating-datasets.html) User Guide already does the latter. 6. Add `Block` and `DataBatch` (should we add these aliases?) 7. Remove existing `package-ref`. Signed-off-by: Stefan van der Kleij <s.vanderkleij@viroteq.com>

matthewdeng added 2 commits August 5, 2022 17:10

[data] update datasets API structure

b3c14fd

Signed-off-by: Matthew Deng <matt@anyscale.com>

split dataset/pipeline methods

a5ee632

Signed-off-by: Matthew Deng <matt@anyscale.com>

matthewdeng assigned clarkzinzow Aug 6, 2022

matthewdeng added 2 commits August 7, 2022 18:57

address TODOs

3182738

Signed-off-by: Matthew Deng <matt@anyscale.com>

remove torch and tf from io

f207f95

Signed-off-by: Matthew Deng <matt@anyscale.com>

matthewdeng changed the title ~~[WIP][data] update datasets API structure~~ [data] update datasets API structure Aug 8, 2022

matthewdeng marked this pull request as ready for review August 8, 2022 03:15

matthewdeng requested review from a team, clarkzinzow, ericl, jianoaix, jjyao, maxpumperla and scv119 as code owners August 8, 2022 03:15

stephanie-wang added the copyediting-required label Aug 8, 2022

matthewdeng added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Aug 8, 2022

jianoaix reviewed Aug 8, 2022

View reviewed changes

doc/source/data/api/dataset_pipeline.rst Outdated Show resolved Hide resolved

doc/source/data/api/dataset.rst Show resolved Hide resolved

jianoaix reviewed Aug 8, 2022

View reviewed changes

matthewdeng added 3 commits August 8, 2022 17:24

address comments

7357bae

Signed-off-by: Matthew Deng <matt@anyscale.com>

Merge branch 'master' of github.com:ray-project/ray into datasets-api…

85afda2

…-docs

remove signatures

ab87b62

Signed-off-by: Matthew Deng <matt@anyscale.com>

Merge branch 'master' of github.com:ray-project/ray into datasets-api…

daa65f8

…-docs

matthewdeng commented Aug 9, 2022

View reviewed changes

clarkzinzow reviewed Aug 9, 2022

View reviewed changes

c21 reviewed Aug 9, 2022

View reviewed changes

doc/source/data/api/dataset.rst Show resolved Hide resolved

address comments

2fdcd31

Signed-off-by: Matthew Deng <matt@anyscale.com>

matthewdeng requested review from c21, clarkzinzow and jianoaix August 11, 2022 16:25

c21 approved these changes Aug 12, 2022

View reviewed changes

Merge branch 'master' into datasets-api-docs

5684334

clarkzinzow approved these changes Aug 12, 2022

View reviewed changes

jianoaix reviewed Aug 12, 2022

View reviewed changes

jianoaix approved these changes Aug 12, 2022

View reviewed changes

ericl reviewed Aug 12, 2022

View reviewed changes

ericl approved these changes Aug 12, 2022

View reviewed changes

ericl merged commit 9a0c1f5 into ray-project:master Aug 12, 2022

matthewdeng mentioned this pull request Aug 12, 2022

[data][docs] fix broken links #27818

Merged

7 tasks

matthewdeng mentioned this pull request Aug 12, 2022

[cherry-pick][data] update datasets API structure #27836

Merged

7 tasks

matthewdeng added a commit that referenced this pull request Aug 13, 2022

[cherry-pick][data] update datasets API structure (#27836)

3901e66

* [data] update datasets API structure (#27592) * [data][docs] fix broken links (#27818)


		.. automethod:: ray.data.DatasetPipeline.stats

		.. automethod:: ray.data.DatasetPipeline.sum No newline at end of file

Conversation

matthewdeng commented Aug 6, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are these changes needed?

Changes

Related issue number

Checks

Uh oh!

clarkzinzow commented Aug 8, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

matthewdeng commented Aug 8, 2022

Uh oh!

clarkzinzow commented Aug 8, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

matthewdeng commented Aug 9, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

clarkzinzow commented Aug 9, 2022

Uh oh!

matthewdeng commented Aug 9, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

clarkzinzow left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

c21 left a comment

Choose a reason for hiding this comment

Uh oh!

clarkzinzow left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

clarkzinzow Aug 12, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

matthewdeng commented Aug 6, 2022 •

edited

Loading

clarkzinzow commented Aug 8, 2022 •

edited

Loading

clarkzinzow commented Aug 8, 2022 •

edited

Loading

matthewdeng commented Aug 9, 2022 •

edited

Loading

clarkzinzow Aug 12, 2022 •

edited

Loading