Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discuss handling N/A or NULL values for Columnar Data in Eager #383

Open
BryanCutler opened this issue Jul 26, 2019 · 3 comments
Open

Discuss handling N/A or NULL values for Columnar Data in Eager #383

BryanCutler opened this issue Jul 26, 2019 · 3 comments

Comments

@BryanCutler
Copy link
Member

Following the discussion on #366, using eager mode gives more flexibility to handle certain aspects that are common in columnar data, such as N/A or NULL values.

Previously, when we create a dataset, we create a dataset in one shot for every column. For example, for CsvDataset, we have to specify EVERY column before hand before it even runs.

The requirement was actually because we used to implement against 1.13/1.14 where TF Graph is static. So we need to know EVERYTHING before hand, in order to run the graph (or pass to tf.keras).

Now as we move to TF 2.0, knowing everything before hand is not necessarily anymore. We could just parse the file and find the meta data in eager mode, then build the dataset to pass to tf.keras.

In this situation, I am wondering if it makes sense to focus on "building dataset with one column at a time"? Something like:

# read parquet file and find all columns
# then build multiple datasets and build one dataset at a time for each column
dataset = zip([ParquetDataset(filename, column) for column in columns])

The reason, is that, when we try to build a dataset from ALL columns, we assume all columns should have the same number of records. But this is not the case for many files such as HDFS or Feather (if I understand correctly).

I noticed this issue when I tries to play with pandas. Just realized in our current implementation, it is hard to apply NA or null field.

But with TF 2.0 and eager execution, we actually have more freedom to handle those situations. For example, we could do additional bit masking before merge different columns.

From that standpoint, maybe it makes more sense to focus on building a dataset with only one column at a time?

@BryanCutler
Copy link
Member Author

Currently, in Arrow Datasets each record batch is checked for NULL values and raises an error if there are any, which isn't the ideal way to handle it.

@yongtang
Copy link
Member

Handling NULL value under tf.data.Dataset might not be very easy, as tf.data.Dataset is an itrable and lacks indexing.

On the other hand, processing null values with pure Tensor should be straightforward: assuming NULL values are sparse, then an array could be used to only store the indices of the location where value could be NULL. During runtime it is a matter of tf.scatter (or tf.scatter_nd) to filter out the NULL values.

When a file like parquet or arrow are read, effectively we could return two Tensors:

  1. The value Tensor where NULL value could be anything (masked by (2))
  2. The mask index Tensor indicating the locations of NULL values.

@yongtang
Copy link
Member

yongtang commented Sep 12, 2019

@BryanCutler Tried with NULL value with CSV. In Arrow NULL value has already been handled nicely so it is not a big issue if the data source is indexable: indexable means random access and we could always have an separate access function to take a null() and return bool mask.

I did notice that in pandas there is only a concept of NaN, not null. For us, I think having a separate access function for indexable to get bool mask of NULL values would be good enough.

Looks like pandas does have isnull but this one is not true is None unless it is an object array.

The issue is more or less related to iterable: iterable assumes no random access and no look back. So NULL value has to be obtained at the same time when value is retrieved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants