Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added readdlm option to ignore empty columns. #5403

Merged
merged 1 commit into from
Jan 18, 2014

Conversation

tanmaykm
Copy link
Member

This adds a new option ignore_empty_columns for readdlm. Setting this to true results in adjoining column delimiters being squashed instead of resulting in a cell with empty string. This option can be used to read fixed width format data files.

e.g: with delimiter set to default (whitespace), and ignore_empty_columns set to false (default):

julia> iob = IOBuffer("  1  2  3  4\n 11 12 13 14");

julia> readdlm(iob)
2x9 Array{Any,2}:
 ""    " "   1.0    ""  2.0  ""  3.0  ""   4.0
 ""  11.0   12.0  13.0   ""  ""   ""  ""  14.0

With ignore_empty_columns set to true:

julia> iob = IOBuffer("  1  2  3  4\n 11 12 13 14");

julia> readdlm(iob; ignore_empty_columns=true)
2x4 Array{Float64,2}:
  1.0   2.0   3.0   4.0
 11.0  12.0  13.0  14.0

Also updated tests and docs for this option and added some new tests for utf8 strings.

@nalimilan
Copy link
Member

How about calling this skip_repeated_separators or something like that? When you have such a file, you typically do not think of it as containing empty columns. That's the way you think about it because readlm() considered them as separate columns.

@JeffBezanson
Copy link
Member

This is a great feature. I'm not sure what to call the option though; three words with underscores is pretty awkward.

@johnmyleswhite
Copy link
Member

Sorry to be contrary, but I'm not sure this is such a great feature. Fixed-width files aren't that similar to delimited files, so I don't think readdlm should be made capable of reading them. After all, the defining characteristic of fixed width files is that they don't ever contain delimiters, not that they contain a haphazard number of consecutive delimiters.

readtable handles the specific example given in this pull request by treating whitespace separation as a highly idiosyncratic special case in which the delimiter is allowed to be repeated arbitrarily many times. This wouldn't make sense for files separated by commas, but it is how R processes whitespace separated files.

@JeffBezanson
Copy link
Member

I'm sure you're right, but I think this feature has other uses. For example a delimited file might have empty columns that you'd rather throw out than get an Array{Any} back.

Whether we expose this flag or not, should we effectively set it to true when the delimiter(s) are whitespace?

@johnmyleswhite
Copy link
Member

Based on the large number of files in the wild with erratic numbers of whitespace in them, something like this flag should probably be turned on by default for whitespace-separated files.

Regarding your Array{Any} point, throwing out an entire column definitely seems reasonable. But what if only one entry is missing?

@JeffBezanson
Copy link
Member

True the Array{Any} argument is a bit weak; you also have the option of specifying the element type, allowing you to get NaNs instead of empty strings.

@tanmaykm why is the second entry above " " instead of ""?

Anyway, for now, let's turn this on by default for whitespace-separated files, and not yet expose the option in the interface.

@mschauer
Copy link
Contributor

A bug in the old readdlm, one also gets

julia> iob = IOBuffer(",,,1,,,,2,3,,,4,\n,,,1,,,1\n");

julia>  readdlm(iob, ',', Any)
2x13 Array{Any,2}:
 ""  ","  ""  1.0  ""  ""   ""  2.0  3.0  ""  ""  4.0  ""
 ""  ""   ""  1.0  ""  ""  1.0   ""   ""  ""  ""   ""  ""

@tanmaykm
Copy link
Member Author

Yes, that seems to be a bug. I shall update the PR with the bug fixed, this option turned on by default for whitespace separated files and not exposed in the API.

@tanmaykm
Copy link
Member Author

Updated. The ignore_empty_columns option is not exposed in the API, but it's behavior is turned on implicitly for whitespace delimited data.

Here are the results with this change:

julia> iob = IOBuffer("  1  2  3  4\n 11 12 13 14");

julia> readdlm(iob)
2x4 Array{Float64,2}:
  1.0   2.0   3.0   4.0
 11.0  12.0  13.0  14.0

julia> iob = IOBuffer(",,,1,,,,2,3,,,4,\n,,,1,,,1\n");

julia> readdlm(iob, ',')
2x13 Array{Any,2}:
 ""  ""  ""  1.0  ""  ""   ""  2.0  3.0  ""  ""  4.0  ""
 ""  ""  ""  1.0  ""  ""  1.0   ""   ""  ""  ""   ""  ""

@kmsquire
Copy link
Member

Can you clarify what happens with tab-delimited data (i.e., reading splits
on tabs, but not spaces)? I work frequently with tab-delimited data, where
two tabs implies the same behavior as two commas in csv files.

@tanmaykm
Copy link
Member Author

They will be treated similar to space delimited data. That is:

readdlm(IOBuffer("1\t2\t\t3\n4\t5\t6\n"), '\t')
2x3 Array{Float64,2}:
 1.0  2.0  3.0
 4.0  5.0  6.0

Whereas the same data delimited with comma would yield:

julia> readdlm(IOBuffer("1,2,,3\n4,5,6\n"), ',')
2x4 Array{Any,2}:
 1.0  2.0   ""  3.0
 4.0  5.0  6.0   ""

Would it be preferable instead to turn this option on only when no delimiters are specified by the caller and default delimiters are used?

@kmsquire
Copy link
Member

To me that seems reasonable. If others disagree, I would ask to special-case tab-separated files, since they are pretty common, but the fewer special cases, the better.

@johnmyleswhite
Copy link
Member

I don't think tab-separated files need to be treated as a special case. In my experience, whitespace separated means "any whitespace character any number of times without need for consistency within a file". People and systems (e.g. Hive) who use tabs as separators usually mean something much stricter than this where only tabs count and repetition singles column changes.

@tanmaykm
Copy link
Member Author

Agree. Ignoring adjoining delimiters only when no delimiter character is specified by the caller would appropriately handle both types of files - where tabs/spaces are significant and the ones with padded columns.

I shall update the PR with this change.

…djoining whitespaces.

fixed bug in handling empty columns.
updated tests and docs.
fixes JuliaLang#5391
@tanmaykm
Copy link
Member Author

Updated. Now:

julia> readdlm(IOBuffer("1\t2\t\t3\n4\t5\t6\n"), '\t')
2x4 Array{Any,2}:
 1.0  2.0   ""  3.0
 4.0  5.0  6.0   ""

julia> readdlm(IOBuffer("1\t2\t\t3\n4\t5\t6\n"))
2x3 Array{Float64,2}:
 1.0  2.0  3.0
 4.0  5.0  6.0

julia> readdlm(IOBuffer("  1  2  3  4\n 11 12 13 14"))
2x4 Array{Float64,2}:
  1.0   2.0   3.0   4.0
 11.0  12.0  13.0  14.0

@kmsquire
Copy link
Member

That works for me.

On Saturday, January 18, 2014, Tanmay Mohapatra [email protected]
wrote:

Updated. Now:

julia> readdlm(IOBuffer("1\t2\t\t3\n4\t5\t6\n"), '\t')
2x4 Array{Any,2}:
1.0 2.0 "" 3.0
4.0 5.0 6.0 ""

julia> readdlm(IOBuffer("1\t2\t\t3\n4\t5\t6\n"))
2x3 Array{Float64,2}:
1.0 2.0 3.0
4.0 5.0 6.0

julia> readdlm(IOBuffer(" 1 2 3 4\n 11 12 13 14"))
2x4 Array{Float64,2}:
1.0 2.0 3.0 4.0
11.0 12.0 13.0 14.0


Reply to this email directly or view it on GitHubhttps://github.com//pull/5403#issuecomment-32677168
.

JeffBezanson added a commit that referenced this pull request Jan 18, 2014
Added readdlm option to ignore empty columns.
@JeffBezanson JeffBezanson merged commit b465c3e into JuliaLang:master Jan 18, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants