Further optimize from_pandas_edgelist with cudf#4528
Further optimize from_pandas_edgelist with cudf#4528rapids-bot[bot] merged 12 commits intorapidsai:branch-24.08from
from_pandas_edgelist with cudf#4528Conversation
| if is_src_copied: | ||
| src_indices = src_array | ||
| else: | ||
| src_indices = cp.array(src_array) |
There was a problem hiding this comment.
Maybe I'm missing something obvious, but couldn't the same behavior be achieved by just doing the following?
| if is_src_copied: | |
| src_indices = src_array | |
| else: | |
| src_indices = cp.array(src_array) | |
| src_indices = cp.array(src_array, copy=False) |
There was a problem hiding this comment.
I think you meant:
src_indices = cp.array(src_array, copy=not is_src_copied)which should work as expected.
I found if-else branches more clear to use here, and easy for us to do given that we already have the booleans around.
There was a problem hiding this comment.
And once again, if you didn't catch it from one of my previous comments in previous PRs, numpy 2 changed semantics:
https://numpy.org/doc/stable/numpy_2_0_migration_guide.html#adapting-to-changes-in-the-copy-keyword
so I think it's okay to be extra clear here.
There was a problem hiding this comment.
Thanks for the explanation, but no, actually that snippet wasn't what I was thinking. But after re-reading again I think there's some necessary side effects that I was missing, but I'd rather not assume I know what they are. Let's chat offline and I'll update the comment afterwards.
There was a problem hiding this comment.
I think my questioning of why not just use copy=False and therefore why _cp_iscopied_asarray is needed at all is because I didn't initially realize this function was intentionally making copies of the incoming dataframe series for the graph to own. That makes sense and I confirmed this with Erik offline (although we should probably think about a future improvement to prevent yet another copy happening when the PLC graph is made, but that can be for later) so my only request is a comment mentioning that. Perhaps something like this here:
# Now that the arrays have been extracted from the input
# dataframe, create copies the graph instance can own. If a
# copy already occurred during the extraction step above,
# don't copy again.
but maybe something even shorter and better, maybe just in the docstring for the function could simply be
This function will create a copy of the input dataframe data
source and target series to be owned by the returned Graph object.
There was a problem hiding this comment.
Thanks, code comments added. NetworkX doesn't share ownership with input objects, so I figure neither should we.
|
Thanks @eriknw , I really like the attention to detail with the tests! I just had one question (see review) and I'm wondering if the answer to it could significantly change the PR. |
rlratzel
left a comment
There was a problem hiding this comment.
Thanks for meeting offline and clarifying things! I updated the review with the explanation. LGTM overall, but I have just one request for a comment to add.
| if is_src_copied: | ||
| src_indices = src_array | ||
| else: | ||
| src_indices = cp.array(src_array) |
There was a problem hiding this comment.
I think my questioning of why not just use copy=False and therefore why _cp_iscopied_asarray is needed at all is because I didn't initially realize this function was intentionally making copies of the incoming dataframe series for the graph to own. That makes sense and I confirmed this with Erik offline (although we should probably think about a future improvement to prevent yet another copy happening when the PLC graph is made, but that can be for later) so my only request is a comment mentioning that. Perhaps something like this here:
# Now that the arrays have been extracted from the input
# dataframe, create copies the graph instance can own. If a
# copy already occurred during the extraction step above,
# don't copy again.
but maybe something even shorter and better, maybe just in the docstring for the function could simply be
This function will create a copy of the input dataframe data
source and target series to be owned by the returned Graph object.
…share ownership, so neither should we
rlratzel
left a comment
There was a problem hiding this comment.
Thanks for the new comments.
|
/merge |
|
/merge |
This continues #4525 (and this comment) to avoid copies and to be more optimal whether using pandas, cudf, or cudf.pandas. Notably, using
s.to_numpywith cudf will return a numpy array, butcudf.pandasmay return a cupy array (proxy).Also,
s.to_numpy(copy=False)(from comment) is not used, b/c cudf'sto_numpyraises ifcopy=False. We get the behavior we want by not specifyingcopy=.I don't know if this is the best way to determine whether a copy occurred or not, but this seems like a useful pattern to establish, because we want to make ingest more efficient.
CC @rlratzel