-
Notifications
You must be signed in to change notification settings - Fork 296
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Discuss handling N/A or NULL values for Columnar Data in Eager #383
Comments
Currently, in Arrow Datasets each record batch is checked for NULL values and raises an error if there are any, which isn't the ideal way to handle it. |
Handling NULL value under tf.data.Dataset might not be very easy, as tf.data.Dataset is an itrable and lacks indexing. On the other hand, processing null values with pure Tensor should be straightforward: assuming NULL values are sparse, then an array could be used to only store the indices of the location where value could be NULL. During runtime it is a matter of tf.scatter (or tf.scatter_nd) to filter out the NULL values. When a file like parquet or arrow are read, effectively we could return two Tensors:
|
@BryanCutler Tried with NULL value with CSV. In Arrow NULL value has already been handled nicely so it is not a big issue if the data source is indexable: indexable means random access and we could always have an separate access function to take a
Looks like pandas does have The issue is more or less related to iterable: iterable assumes no random access and no look back. So NULL value has to be obtained at the same time when value is retrieved. |
Following the discussion on #366, using eager mode gives more flexibility to handle certain aspects that are common in columnar data, such as N/A or NULL values.
The text was updated successfully, but these errors were encountered: