Further optimize `from_pandas_edgelist` with cudf by eriknw · Pull Request #4528 · rapidsai/cugraph

eriknw · 2024-07-09T11:58:13Z

This continues #4525 (and this comment) to avoid copies and to be more optimal whether using pandas, cudf, or cudf.pandas. Notably, using s.to_numpy with cudf will return a numpy array, but cudf.pandas may return a cupy array (proxy).

Also, s.to_numpy(copy=False) (from comment) is not used, b/c cudf's to_numpy raises if copy=False. We get the behavior we want by not specifying copy=.

I don't know if this is the best way to determine whether a copy occurred or not, but this seems like a useful pattern to establish, because we want to make ingest more efficient.

CC @rlratzel

rlratzel · 2024-07-23T15:30:44Z

python/nx-cugraph/nx_cugraph/convert_matrix.py

+        if is_src_copied:
+            src_indices = src_array
+        else:
+            src_indices = cp.array(src_array)


Maybe I'm missing something obvious, but couldn't the same behavior be achieved by just doing the following?

Suggested change

if is_src_copied:

src_indices = src_array

else:

src_indices = cp.array(src_array)

src_indices = cp.array(src_array, copy=False)

I think you meant:

src_indices = cp.array(src_array, copy=not is_src_copied)

which should work as expected.

I found if-else branches more clear to use here, and easy for us to do given that we already have the booleans around.

And once again, if you didn't catch it from one of my previous comments in previous PRs, numpy 2 changed semantics:
https://numpy.org/doc/stable/numpy_2_0_migration_guide.html#adapting-to-changes-in-the-copy-keyword
so I think it's okay to be extra clear here.

Thanks for the explanation, but no, actually that snippet wasn't what I was thinking. But after re-reading again I think there's some necessary side effects that I was missing, but I'd rather not assume I know what they are. Let's chat offline and I'll update the comment afterwards.

I think my questioning of why not just use copy=False and therefore why _cp_iscopied_asarray is needed at all is because I didn't initially realize this function was intentionally making copies of the incoming dataframe series for the graph to own. That makes sense and I confirmed this with Erik offline (although we should probably think about a future improvement to prevent yet another copy happening when the PLC graph is made, but that can be for later) so my only request is a comment mentioning that. Perhaps something like this here:

# Now that the arrays have been extracted from the input # dataframe, create copies the graph instance can own. If a # copy already occurred during the extraction step above, # don't copy again.

but maybe something even shorter and better, maybe just in the docstring for the function could simply be

This function will create a copy of the input dataframe data source and target series to be owned by the returned Graph object.

Thanks, code comments added. NetworkX doesn't share ownership with input objects, so I figure neither should we.

rlratzel · 2024-07-23T15:38:12Z

Thanks @eriknw , I really like the attention to detail with the tests! I just had one question (see review) and I'm wondering if the answer to it could significantly change the PR.

rlratzel

Thanks for meeting offline and clarifying things! I updated the review with the explanation. LGTM overall, but I have just one request for a comment to add.

rlratzel · 2024-07-25T09:50:46Z

python/nx-cugraph/nx_cugraph/convert_matrix.py

+        if is_src_copied:
+            src_indices = src_array
+        else:
+            src_indices = cp.array(src_array)


I think my questioning of why not just use copy=False and therefore why _cp_iscopied_asarray is needed at all is because I didn't initially realize this function was intentionally making copies of the incoming dataframe series for the graph to own. That makes sense and I confirmed this with Erik offline (although we should probably think about a future improvement to prevent yet another copy happening when the PLC graph is made, but that can be for later) so my only request is a comment mentioning that. Perhaps something like this here:

# Now that the arrays have been extracted from the input # dataframe, create copies the graph instance can own. If a # copy already occurred during the extraction step above, # don't copy again.

but maybe something even shorter and better, maybe just in the docstring for the function could simply be

This function will create a copy of the input dataframe data source and target series to be owned by the returned Graph object.

…void_copies

…share ownership, so neither should we

rlratzel

Thanks for the new comments.

rlratzel · 2024-07-25T14:45:26Z

/merge

rlratzel · 2024-07-30T13:51:57Z

/merge

Further optimize from_pandas_edgelist with cudf

e56ca08

eriknw added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change python nx-cugraph labels Jul 9, 2024

eriknw requested a review from a team as a code owner July 9, 2024 11:58

eriknw added 6 commits July 9, 2024 05:01

Oops fix copyright (and install/run pre-commit)

6261969

Merge branch 'branch-24.08' into df_avoid_copies

8676c9b

Add basic tests that smoked out a couple issues :)

768b44d

Test create_using too

530b482

Merge branch 'branch-24.08' into df_avoid_copies

20ad596

Merge branch 'branch-24.08' into df_avoid_copies

2b18d39

rlratzel reviewed Jul 23, 2024

View reviewed changes

eriknw added 2 commits July 24, 2024 11:51

Merge branch 'branch-24.08' into df_avoid_copies

82d076d

Merge branch 'branch-24.08' into df_avoid_copies

f397d7f

rlratzel reviewed Jul 25, 2024

View reviewed changes

eriknw added 2 commits July 25, 2024 06:26

Merge branch 'df_avoid_copies' of github.com:eriknw/cugraph into df_a…

eb981ef

…void_copies

Add comments about not sharing ownership of arrays. NetworkX doesn't …

3b047ec

…share ownership, so neither should we

rlratzel approved these changes Jul 25, 2024

View reviewed changes

Merge branch 'branch-24.08' into df_avoid_copies

7a3f74c

rapids-bot bot merged commit 4ff7acb into rapidsai:branch-24.08 Jul 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Further optimize `from_pandas_edgelist` with cudf#4528

Further optimize `from_pandas_edgelist` with cudf#4528
rapids-bot[bot] merged 12 commits intorapidsai:branch-24.08from
eriknw:df_avoid_copies

eriknw commented Jul 9, 2024

Uh oh!

rlratzel Jul 23, 2024

Uh oh!

eriknw Jul 23, 2024

Uh oh!

eriknw Jul 23, 2024

Uh oh!

rlratzel Jul 24, 2024

Uh oh!

rlratzel Jul 25, 2024

Uh oh!

eriknw Jul 25, 2024

Uh oh!

rlratzel commented Jul 23, 2024

Uh oh!

rlratzel left a comment

Uh oh!

rlratzel Jul 25, 2024

Uh oh!

rlratzel left a comment

Uh oh!

rlratzel commented Jul 25, 2024

Uh oh!

rlratzel commented Jul 30, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

eriknw commented Jul 9, 2024

Uh oh!

rlratzel Jul 23, 2024

Choose a reason for hiding this comment

Uh oh!

eriknw Jul 23, 2024

Choose a reason for hiding this comment

Uh oh!

eriknw Jul 23, 2024

Choose a reason for hiding this comment

Uh oh!

rlratzel Jul 24, 2024

Choose a reason for hiding this comment

Uh oh!

rlratzel Jul 25, 2024

Choose a reason for hiding this comment

Uh oh!

eriknw Jul 25, 2024

Choose a reason for hiding this comment

Uh oh!

rlratzel commented Jul 23, 2024

Uh oh!

rlratzel left a comment

Choose a reason for hiding this comment

Uh oh!

rlratzel Jul 25, 2024

Choose a reason for hiding this comment

Uh oh!

rlratzel left a comment

Choose a reason for hiding this comment

Uh oh!

rlratzel commented Jul 25, 2024

Uh oh!

rlratzel commented Jul 30, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants