Discuss handling N/A or NULL values for Columnar Data in Eager #383

BryanCutler · 2019-07-26T18:22:35Z

Following the discussion on #366, using eager mode gives more flexibility to handle certain aspects that are common in columnar data, such as N/A or NULL values.

Previously, when we create a dataset, we create a dataset in one shot for every column. For example, for CsvDataset, we have to specify EVERY column before hand before it even runs.

The requirement was actually because we used to implement against 1.13/1.14 where TF Graph is static. So we need to know EVERYTHING before hand, in order to run the graph (or pass to tf.keras).

Now as we move to TF 2.0, knowing everything before hand is not necessarily anymore. We could just parse the file and find the meta data in eager mode, then build the dataset to pass to tf.keras.

In this situation, I am wondering if it makes sense to focus on "building dataset with one column at a time"? Something like:

# read parquet file and find all columns
# then build multiple datasets and build one dataset at a time for each column
dataset = zip([ParquetDataset(filename, column) for column in columns])

The reason, is that, when we try to build a dataset from ALL columns, we assume all columns should have the same number of records. But this is not the case for many files such as HDFS or Feather (if I understand correctly).

I noticed this issue when I tries to play with pandas. Just realized in our current implementation, it is hard to apply NA or null field.

But with TF 2.0 and eager execution, we actually have more freedom to handle those situations. For example, we could do additional bit masking before merge different columns.

From that standpoint, maybe it makes more sense to focus on building a dataset with only one column at a time?

BryanCutler · 2019-07-26T18:24:40Z

Currently, in Arrow Datasets each record batch is checked for NULL values and raises an error if there are any, which isn't the ideal way to handle it.

yongtang · 2019-07-29T02:22:25Z

Handling NULL value under tf.data.Dataset might not be very easy, as tf.data.Dataset is an itrable and lacks indexing.

On the other hand, processing null values with pure Tensor should be straightforward: assuming NULL values are sparse, then an array could be used to only store the indices of the location where value could be NULL. During runtime it is a matter of tf.scatter (or tf.scatter_nd) to filter out the NULL values.

When a file like parquet or arrow are read, effectively we could return two Tensors:

The value Tensor where NULL value could be anything (masked by (2))
The mask index Tensor indicating the locations of NULL values.

yongtang · 2019-09-12T00:41:58Z

@BryanCutler Tried with NULL value with CSV. In Arrow NULL value has already been handled nicely so it is not a big issue if the data source is indexable: indexable means random access and we could always have an separate access function to take a null() and return bool mask.

~~I did notice that in pandas there is only a concept of NaN, not null.~~ For us, I think having a separate access function for indexable to get bool mask of NULL values would be good enough.

Looks like pandas does have isnull but this one is not true is None unless it is an object array.

The issue is more or less related to iterable: iterable assumes no random access and no look back. So NULL value has to be obtained at the same time when value is retrieved.

BryanCutler added the discussion label Jul 26, 2019

yongtang mentioned this issue Sep 11, 2019

GRPCIOServer for multiple data input #473

Closed

yongtang mentioned this issue Sep 12, 2019

Add isnull support to CSVIOTensor #476

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discuss handling N/A or NULL values for Columnar Data in Eager #383

Discuss handling N/A or NULL values for Columnar Data in Eager #383

BryanCutler commented Jul 26, 2019

BryanCutler commented Jul 26, 2019

yongtang commented Jul 29, 2019

yongtang commented Sep 12, 2019 •

edited

Loading

Discuss handling N/A or NULL values for Columnar Data in Eager #383

Discuss handling N/A or NULL values for Columnar Data in Eager #383

Comments

BryanCutler commented Jul 26, 2019

BryanCutler commented Jul 26, 2019

yongtang commented Jul 29, 2019

yongtang commented Sep 12, 2019 • edited Loading

yongtang commented Sep 12, 2019 •

edited

Loading