Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Names are dropped when converting a NamedArray to a DataFrame #76

Open
arnaudmgh opened this issue Mar 24, 2019 · 3 comments
Open

Names are dropped when converting a NamedArray to a DataFrame #76

arnaudmgh opened this issue Mar 24, 2019 · 3 comments

Comments

@arnaudmgh
Copy link

arnaudmgh commented Mar 24, 2019

I ran into a problem when writing the result of freqtable to a CSV file: I converted to DataFrame and lost all the names.

The solution I came up with was to overwrite CSV.write follows:

function CSV.write(file::Union{String, IO}, named::NamedArray)
  nsc = names(named)
  named2 = hcat(nsc[1], named)
  named2 = DataFrame(named2)
  names!(named2, Symbol.(vcat("row_names", string.(nsc[2]))))
  CSV.write(file, named2)
end

I'd be willing to help, submit a PR or else, depending on what suggestions.

  • this is a problem with the DataFrame function, so may-be overwriting DataFrame is better;
  • on the other hand, this solution adds a dependence on DataFrames.jl, so using Tables.jl may-be more appropriate.
  • That may be a special solution for FreqTables.jl, and therefore, should this belong to FreqTables.jl instead?

Please let me know what would help and make sense. Thanks!

@nalimilan
Copy link
Contributor

This is definitely not specific to FreqTables (implementing the method there would be type piracy), so it should either use Tables.jl or a special DataFrames constructor. Tables.jl doesn't support arrays, so that leaves DataFrames.

Though there's some tension with the way AbstractMatrix behaves: DataFrame(::AbstractMatrix) gives a data frame with the same dimensions as the input. Yet DataFrame(::NamedMatrix) would have an additional column giving the row names. That means NamedArray wouldn't completely work like other AbstractArray objects. A solution would be to have a keyword argument to add row names, which would be off by default.

Another consideration is that a different conversion rule can be considered for higher-dimensional NamedArray objects: have one column per dimension and one row per cell. This is how it works for example in R if you call as.data.frame on a table object (but not on an R named array). This is useful in particular for frequency tables. Maybe we can find a different solution for that, though (something like stack.

@arnaudmgh
Copy link
Author

Thank for the explanations and the good points @nalimilan. I agree the transformation of higher dimensional arrays performed by R's as.data.frame looks very much like a stack operation.

So, indeed there is some tension between the intuitive 2 dimensional solution and the higher dimension tables - the function I wrote above would ignore higher dimensions.

One possibility would be to stack by default, even 2d arrays. A user can always unstack if necessary.

@dietercastel
Copy link
Contributor

dietercastel commented Jul 2, 2020

This should be solved with #99 for arbitrary dimensions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants