Skip to content

[data] update datasets API structure#27592

Merged
ericl merged 10 commits intoray-project:masterfrom
matthewdeng:datasets-api-docs
Aug 12, 2022
Merged

[data] update datasets API structure#27592
ericl merged 10 commits intoray-project:masterfrom
matthewdeng:datasets-api-docs

Conversation

@matthewdeng
Copy link
Contributor

@matthewdeng matthewdeng commented Aug 6, 2022

Signed-off-by: Matthew Deng matt@anyscale.com

Why are these changes needed?

Refactor Datasets API docs for easier navigation: Ray Datasets API

Changes

  1. Create a new Datasets API base page.
  2. Split existing APIs into separate pages.
  3. Split Dataset and DatasetPipeline methods into separate sections.
    1. Used autosummary to generate overview tables at the top of each of these pages. Open to other suggestions e.g. moving the summary to the top of each section instead.
    2. Note: Every time we add a new method we need to explicitly add it here as well.
  4. Add Input/Output APIs.
    1. I chose to split these primarily by data format rather than type, since it's easier to navigate, and the existing Creating Datasets User Guide already does the latter.
  5. Add Block and DataBatch (should we add these aliases?)
  6. Remove existing package-ref.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Matthew Deng <matt@anyscale.com>
Signed-off-by: Matthew Deng <matt@anyscale.com>
Signed-off-by: Matthew Deng <matt@anyscale.com>
Signed-off-by: Matthew Deng <matt@anyscale.com>
@matthewdeng matthewdeng changed the title [WIP][data] update datasets API structure [data] update datasets API structure Aug 8, 2022
@matthewdeng matthewdeng marked this pull request as ready for review August 8, 2022 03:15
@matthewdeng matthewdeng added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Aug 8, 2022
@clarkzinzow
Copy link
Contributor

clarkzinzow commented Aug 8, 2022

This is looking awesome! The main thing that needs to be tweaked is the CSS around the autosummary tables; it looks like its bleeding into the right sidebar which is causing the docstring summaries to be cut off, but if it were properly constrained to its box, those docstring summaries should properly wrap.

Also I don't know if the tabbed view is going to work since there are a lot of tabs and the tab titles are too long... I think that a flat layout at the top of the page, or breaking up the autosummary into each section, would be less clunky.

ray.data.Dataset.randomize_block_order
ray.data.Dataset.repartition

.. tabbed:: Splitting and Merging Datasets
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be a bit tricky going forward, merging ops are transformations and splitting ops are consumptions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any concrete suggestions for how to structure this in this PR? (As a user I don't think I'd immediately go to Consuming Datasets if I were looking to split my dataset.

@matthewdeng
Copy link
Contributor Author

@clarkzinzow

This is looking awesome! The main thing that needs to be tweaked is the CSS around the autosummary tables; it looks like its bleeding into the right sidebar which is causing the docstring summaries to be cut off, but if it were properly constrained to its box, those docstring summaries should properly wrap.

Yep this just got fixed in master yesterday! #27611 Will merge master to include this.

Also I don't know if the tabbed view is going to work since there are a lot of tabs and the tab titles are too long... I think that a flat layout at the top of the page, or breaking up the autosummary into each section, would be less clunky.

Agreed, do you have a preference?

@clarkzinzow
Copy link
Contributor

clarkzinzow commented Aug 8, 2022

@matthewdeng How about a single mono-table at the top, ordered alphabetically, with just the method names instead of the f"{module}.{class}.{method}" scheme?

Signed-off-by: Matthew Deng <matt@anyscale.com>
Signed-off-by: Matthew Deng <matt@anyscale.com>
@matthewdeng
Copy link
Contributor Author

matthewdeng commented Aug 9, 2022

@clarkzinzow updated the summary tables!

  1. Decided to keep the same ordering/format, but use bold "headers" instead of a table. Thought this would help with logical navigation.
  2. Set :nosignatures: since the parameter names aren't very insightful.
  3. I couldn't figure out how to get read of the module/class names...

image

@clarkzinzow
Copy link
Contributor

@matthewdeng Awesome, that's way better! I'll do a full review of this this morning.

The table width still seems to be off (running under the right sidebar) but it looks like you already merged master. 🤔 Any idea what's going on there?

@matthewdeng
Copy link
Contributor Author

@clarkzinzow yeah looks like the generated docs are from the commit before
I merged master: https://readthedocs.org/projects/ray/builds/17659166/

Let me try to trigger it again...


.. automethod:: ray.data.DatasetPipeline.stats

.. automethod:: ray.data.DatasetPipeline.sum No newline at end of file
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: I wasn't sure where to put this one, seems kind of out of place but I also didn't want to create an entire Grouped and Global Aggregations section just for this.

Copy link
Contributor

@clarkzinzow clarkzinzow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall, mostly just a few nits


Ray Datasets API
================
Input/Output
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that you went with a flat list rather than grouping these APIs by data type, e.g. tabular, tensor, text, binary, etc. Looking at it now, I think that the flat list is better for quick discoverability, but we might want to reexamine such a grouping if we add support for a few more file formats to keep this section's size in the ToC manageable.

But yeah, the flat list looks good to me for now!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, basically we can follow the headers in https://docs.ray.io/en/master/data/creating-datasets.html, let me know if you think it's worthwhile to add in this PR!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the current flat list looks good to me!

ray.data.DatasetPipeline.split
ray.data.DatasetPipeline.split_at_indices

**Creating DatasetPipelines**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like there's a better name for this section, but I haven't been able to think of one.


.. automethod:: ray.data.DatasetPipeline.add_column

.. automethod:: ray.data.DatasetPipeline.drop_columns
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great if these could pop up in the right sidebar, nested under the section title, while scrolling as is done for subsections, but unfortunately that doesn't appear to be possible either: sphinx-doc/sphinx#6316

☹️

Signed-off-by: Matthew Deng <matt@anyscale.com>
Copy link
Contributor

@c21 c21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Seeing a conflict with master needed to be resolved.

Copy link
Contributor

@clarkzinzow clarkzinzow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, awesome work! 🎉

.. autosummary::
:nosignatures:

ray.data.Dataset.map
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A stretch request: add a ref to each API, so that we can pinpoint to individual API, which is useful. I think what's needed is adding something like ".. _dataset-map-ref:".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should already be natively supported!

:py:meth:`ray.data.Dataset.map`

Copy link
Contributor

@clarkzinzow clarkzinzow Aug 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jianoaix For the individual APIs, that should already be doable with cross-referencing via e.g. :meth:`ray.data.Dataset.map`

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, can it be linked with a url? e.g. https://docs.ray.io/en/master/data/package-ref.html#dataset-api, if we point a user to map(), can we give a ulr pointing to it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found it: https://docs.ray.io/en/master/data/package-ref.html#ray.data.Dataset.map. I guess I was looking at the navigation list on the right-hand-side which doesn't show it.

ray.data.Dataset.write_csv
ray.data.Dataset.write_numpy
ray.data.Dataset.write_datasource
ray.data.Dataset.to_torch
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we actually deprecate these two in 2.0?

@ericl ericl merged commit 9a0c1f5 into ray-project:master Aug 12, 2022
@matthewdeng matthewdeng mentioned this pull request Aug 12, 2022
7 tasks
matthewdeng added a commit to matthewdeng/ray that referenced this pull request Aug 12, 2022
Refactor Datasets API docs for easier navigation: [Ray Datasets API](https://ray--27592.org.readthedocs.build/en/27592/data/api/api.html)

1. Create a new Datasets API base page.
2. Split existing APIs into separate pages.
3. Split `Dataset` and `DatasetPipeline` methods into separate sections.
     1. Used `autosummary` to generate overview tables at the top of each of these pages. Open to other suggestions e.g. moving the summary to the top of each section instead.
     2. **Note:** Every time we add a new method we need to explicitly add it here as well.
4. Add Input/Output APIs.
     1. I chose to split these primarily by data format rather than type, since it's easier to navigate, and the existing [Creating Datasets](https://docs.ray.io/en/master/data/creating-datasets.html) User Guide already does the latter.
6. Add `Block` and `DataBatch` (should we add these aliases?)
7. Remove existing `package-ref`.
matthewdeng added a commit that referenced this pull request Aug 13, 2022
* [data] update datasets API structure (#27592)
* [data][docs] fix broken links (#27818)
Stefan-1313 pushed a commit to Stefan-1313/ray_mod that referenced this pull request Aug 18, 2022
Refactor Datasets API docs for easier navigation: [Ray Datasets API](https://ray--27592.org.readthedocs.build/en/27592/data/api/api.html)

### Changes

1. Create a new Datasets API base page.
2. Split existing APIs into separate pages.
3. Split `Dataset` and `DatasetPipeline` methods into separate sections.
     1. Used `autosummary` to generate overview tables at the top of each of these pages. Open to other suggestions e.g. moving the summary to the top of each section instead.
     2. **Note:** Every time we add a new method we need to explicitly add it here as well.
4. Add Input/Output APIs.
     1. I chose to split these primarily by data format rather than type, since it's easier to navigate, and the existing [Creating Datasets](https://docs.ray.io/en/master/data/creating-datasets.html) User Guide already does the latter.
6. Add `Block` and `DataBatch` (should we add these aliases?)
7. Remove existing `package-ref`.

Signed-off-by: Stefan van der Kleij <s.vanderkleij@viroteq.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

tests-ok The tagger certifies test failures are unrelated and assumes personal liability.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants