-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added readdlm option to ignore empty columns. #5403
Conversation
How about calling this |
This is a great feature. I'm not sure what to call the option though; three words with underscores is pretty awkward. |
Sorry to be contrary, but I'm not sure this is such a great feature. Fixed-width files aren't that similar to delimited files, so I don't think
|
I'm sure you're right, but I think this feature has other uses. For example a delimited file might have empty columns that you'd rather throw out than get an Whether we expose this flag or not, should we effectively set it to |
Based on the large number of files in the wild with erratic numbers of whitespace in them, something like this flag should probably be turned on by default for whitespace-separated files. Regarding your |
True the @tanmaykm why is the second entry above Anyway, for now, let's turn this on by default for whitespace-separated files, and not yet expose the option in the interface. |
A bug in the old readdlm, one also gets
|
Yes, that seems to be a bug. I shall update the PR with the bug fixed, this option turned on by default for whitespace separated files and not exposed in the API. |
Updated. The Here are the results with this change:
|
Can you clarify what happens with tab-delimited data (i.e., reading splits |
They will be treated similar to space delimited data. That is:
Whereas the same data delimited with comma would yield:
Would it be preferable instead to turn this option on only when no delimiters are specified by the caller and default delimiters are used? |
To me that seems reasonable. If others disagree, I would ask to special-case tab-separated files, since they are pretty common, but the fewer special cases, the better. |
I don't think tab-separated files need to be treated as a special case. In my experience, whitespace separated means "any whitespace character any number of times without need for consistency within a file". People and systems (e.g. Hive) who use tabs as separators usually mean something much stricter than this where only tabs count and repetition singles column changes. |
Agree. Ignoring adjoining delimiters only when no delimiter character is specified by the caller would appropriately handle both types of files - where tabs/spaces are significant and the ones with padded columns. I shall update the PR with this change. |
…djoining whitespaces. fixed bug in handling empty columns. updated tests and docs. fixes JuliaLang#5391
Updated. Now:
|
That works for me. On Saturday, January 18, 2014, Tanmay Mohapatra [email protected]
|
Added readdlm option to ignore empty columns.
This adds a new option
ignore_empty_columns
for readdlm. Setting this totrue
results in adjoining column delimiters being squashed instead of resulting in a cell with empty string. This option can be used to read fixed width format data files.e.g: with delimiter set to default (whitespace), and
ignore_empty_columns
set to false (default):With
ignore_empty_columns
set to true:Also updated tests and docs for this option and added some new tests for
utf8
strings.