Added readdlm option to ignore empty columns. #5403

tanmaykm · 2014-01-15T10:26:04Z

This adds a new option ignore_empty_columns for readdlm. Setting this to true results in adjoining column delimiters being squashed instead of resulting in a cell with empty string. This option can be used to read fixed width format data files.

e.g: with delimiter set to default (whitespace), and ignore_empty_columns set to false (default):

julia> iob = IOBuffer("  1  2  3  4\n 11 12 13 14");

julia> readdlm(iob)
2x9 Array{Any,2}:
 ""    " "   1.0    ""  2.0  ""  3.0  ""   4.0
 ""  11.0   12.0  13.0   ""  ""   ""  ""  14.0

With ignore_empty_columns set to true:

julia> iob = IOBuffer("  1  2  3  4\n 11 12 13 14");

julia> readdlm(iob; ignore_empty_columns=true)
2x4 Array{Float64,2}:
  1.0   2.0   3.0   4.0
 11.0  12.0  13.0  14.0

Also updated tests and docs for this option and added some new tests for utf8 strings.

nalimilan · 2014-01-15T10:42:29Z

How about calling this skip_repeated_separators or something like that? When you have such a file, you typically do not think of it as containing empty columns. That's the way you think about it because readlm() considered them as separate columns.

JeffBezanson · 2014-01-15T21:58:33Z

This is a great feature. I'm not sure what to call the option though; three words with underscores is pretty awkward.

johnmyleswhite · 2014-01-15T22:07:25Z

Sorry to be contrary, but I'm not sure this is such a great feature. Fixed-width files aren't that similar to delimited files, so I don't think readdlm should be made capable of reading them. After all, the defining characteristic of fixed width files is that they don't ever contain delimiters, not that they contain a haphazard number of consecutive delimiters.

readtable handles the specific example given in this pull request by treating whitespace separation as a highly idiosyncratic special case in which the delimiter is allowed to be repeated arbitrarily many times. This wouldn't make sense for files separated by commas, but it is how R processes whitespace separated files.

JeffBezanson · 2014-01-15T22:13:19Z

I'm sure you're right, but I think this feature has other uses. For example a delimited file might have empty columns that you'd rather throw out than get an Array{Any} back.

Whether we expose this flag or not, should we effectively set it to true when the delimiter(s) are whitespace?

johnmyleswhite · 2014-01-15T22:17:45Z

Based on the large number of files in the wild with erratic numbers of whitespace in them, something like this flag should probably be turned on by default for whitespace-separated files.

Regarding your Array{Any} point, throwing out an entire column definitely seems reasonable. But what if only one entry is missing?

JeffBezanson · 2014-01-15T22:24:40Z

True the Array{Any} argument is a bit weak; you also have the option of specifying the element type, allowing you to get NaNs instead of empty strings.

@tanmaykm why is the second entry above " " instead of ""?

Anyway, for now, let's turn this on by default for whitespace-separated files, and not yet expose the option in the interface.

mschauer · 2014-01-15T22:50:34Z

A bug in the old readdlm, one also gets

julia> iob = IOBuffer(",,,1,,,,2,3,,,4,\n,,,1,,,1\n");

julia>  readdlm(iob, ',', Any)
2x13 Array{Any,2}:
 ""  ","  ""  1.0  ""  ""   ""  2.0  3.0  ""  ""  4.0  ""
 ""  ""   ""  1.0  ""  ""  1.0   ""   ""  ""  ""   ""  ""

tanmaykm · 2014-01-16T04:10:29Z

Yes, that seems to be a bug. I shall update the PR with the bug fixed, this option turned on by default for whitespace separated files and not exposed in the API.

tanmaykm · 2014-01-16T10:33:53Z

Updated. The ignore_empty_columns option is not exposed in the API, but it's behavior is turned on implicitly for whitespace delimited data.

Here are the results with this change:

julia> iob = IOBuffer("  1  2  3  4\n 11 12 13 14");

julia> readdlm(iob)
2x4 Array{Float64,2}:
  1.0   2.0   3.0   4.0
 11.0  12.0  13.0  14.0

julia> iob = IOBuffer(",,,1,,,,2,3,,,4,\n,,,1,,,1\n");

julia> readdlm(iob, ',')
2x13 Array{Any,2}:
 ""  ""  ""  1.0  ""  ""   ""  2.0  3.0  ""  ""  4.0  ""
 ""  ""  ""  1.0  ""  ""  1.0   ""   ""  ""  ""   ""  ""

kmsquire · 2014-01-16T14:02:02Z

Can you clarify what happens with tab-delimited data (i.e., reading splits
on tabs, but not spaces)? I work frequently with tab-delimited data, where
two tabs implies the same behavior as two commas in csv files.

tanmaykm · 2014-01-16T15:19:24Z

They will be treated similar to space delimited data. That is:

readdlm(IOBuffer("1\t2\t\t3\n4\t5\t6\n"), '\t')
2x3 Array{Float64,2}:
 1.0  2.0  3.0
 4.0  5.0  6.0

Whereas the same data delimited with comma would yield:

julia> readdlm(IOBuffer("1,2,,3\n4,5,6\n"), ',')
2x4 Array{Any,2}:
 1.0  2.0   ""  3.0
 4.0  5.0  6.0   ""

Would it be preferable instead to turn this option on only when no delimiters are specified by the caller and default delimiters are used?

kmsquire · 2014-01-16T18:21:07Z

To me that seems reasonable. If others disagree, I would ask to special-case tab-separated files, since they are pretty common, but the fewer special cases, the better.

johnmyleswhite · 2014-01-16T23:27:14Z

I don't think tab-separated files need to be treated as a special case. In my experience, whitespace separated means "any whitespace character any number of times without need for consistency within a file". People and systems (e.g. Hive) who use tabs as separators usually mean something much stricter than this where only tabs count and repetition singles column changes.

tanmaykm · 2014-01-17T05:32:37Z

Agree. Ignoring adjoining delimiters only when no delimiter character is specified by the caller would appropriately handle both types of files - where tabs/spaces are significant and the ones with padded columns.

I shall update the PR with this change.

…djoining whitespaces. fixed bug in handling empty columns. updated tests and docs. fixes JuliaLang#5391

tanmaykm · 2014-01-18T08:18:07Z

Updated. Now:

julia> readdlm(IOBuffer("1\t2\t\t3\n4\t5\t6\n"), '\t')
2x4 Array{Any,2}:
 1.0  2.0   ""  3.0
 4.0  5.0  6.0   ""

julia> readdlm(IOBuffer("1\t2\t\t3\n4\t5\t6\n"))
2x3 Array{Float64,2}:
 1.0  2.0  3.0
 4.0  5.0  6.0

julia> readdlm(IOBuffer("  1  2  3  4\n 11 12 13 14"))
2x4 Array{Float64,2}:
  1.0   2.0   3.0   4.0
 11.0  12.0  13.0  14.0

kmsquire · 2014-01-18T15:20:59Z

That works for me.

On Saturday, January 18, 2014, Tanmay Mohapatra [email protected]
wrote:

Updated. Now:

julia> readdlm(IOBuffer("1\t2\t\t3\n4\t5\t6\n"), '\t')
2x4 Array{Any,2}:
1.0 2.0 "" 3.0
4.0 5.0 6.0 ""

julia> readdlm(IOBuffer("1\t2\t\t3\n4\t5\t6\n"))
2x3 Array{Float64,2}:
1.0 2.0 3.0
4.0 5.0 6.0

julia> readdlm(IOBuffer(" 1 2 3 4\n 11 12 13 14"))
2x4 Array{Float64,2}:
1.0 2.0 3.0 4.0
11.0 12.0 13.0 14.0

—
Reply to this email directly or view it on GitHubhttps://github.com//pull/5403#issuecomment-32677168
.

Added readdlm option to ignore empty columns.

when no delimiter is specified, delimiters are taken as one or more a…

def6c1d

…djoining whitespaces. fixed bug in handling empty columns. updated tests and docs. fixes JuliaLang#5391

JeffBezanson added a commit that referenced this pull request Jan 18, 2014

Merge pull request #5403 from tanmaykm/readcsv

b465c3e

Added readdlm option to ignore empty columns.

JeffBezanson merged commit b465c3e into JuliaLang:master Jan 18, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added readdlm option to ignore empty columns. #5403

Added readdlm option to ignore empty columns. #5403

tanmaykm commented Jan 15, 2014

nalimilan commented Jan 15, 2014

JeffBezanson commented Jan 15, 2014

johnmyleswhite commented Jan 15, 2014

JeffBezanson commented Jan 15, 2014

johnmyleswhite commented Jan 15, 2014

JeffBezanson commented Jan 15, 2014

mschauer commented Jan 15, 2014

tanmaykm commented Jan 16, 2014

tanmaykm commented Jan 16, 2014

kmsquire commented Jan 16, 2014

tanmaykm commented Jan 16, 2014

kmsquire commented Jan 16, 2014

johnmyleswhite commented Jan 16, 2014

tanmaykm commented Jan 17, 2014

tanmaykm commented Jan 18, 2014

kmsquire commented Jan 18, 2014

Added readdlm option to ignore empty columns. #5403

Added readdlm option to ignore empty columns. #5403

Conversation

tanmaykm commented Jan 15, 2014

nalimilan commented Jan 15, 2014

JeffBezanson commented Jan 15, 2014

johnmyleswhite commented Jan 15, 2014

JeffBezanson commented Jan 15, 2014

johnmyleswhite commented Jan 15, 2014

JeffBezanson commented Jan 15, 2014

mschauer commented Jan 15, 2014

tanmaykm commented Jan 16, 2014

tanmaykm commented Jan 16, 2014

kmsquire commented Jan 16, 2014

tanmaykm commented Jan 16, 2014

kmsquire commented Jan 16, 2014

johnmyleswhite commented Jan 16, 2014

tanmaykm commented Jan 17, 2014

tanmaykm commented Jan 18, 2014

kmsquire commented Jan 18, 2014