Skip to content

Proposal: easier column extraction and data cleaning #146

@piever

Description

@piever

I was thinking that something like the @df macro in StatPlots would benefit many different packages, by allowing normal arrays from the output of a query to be fed directly to a function (especially as it can be done without even explicitly collecting the query, see here). What I was wondering is whether something similar could live in Query as well. I'm thinking about a macro of the style:

@replace_complete_cols df f(_..a, _..b, _..c .+ 1)

which would replace the _..s expression with the respective columns converted to a regular Array (it would exclude rows where a column that is being used is missing data). There are two more tools that would be helpful to implement this functionality and would go well together with it:

  • a @dropna stand-alone macro that would filter rows with no missing values
  • as mentioned here, the possibility to have a tuple in a @select statement, which could then be collected as a tuple of Arrays. Without that, selecting an arbitrary number of columns is a bit cumbersome (I haven't found a way of selecting multiple columns with a NamedTuple iterator because there doesn't seem to be a type stable way of generating a NamedTuple without manually typing each element, whereas list comprehension works just fine for tuples).

Do you believe that this kind of macro belongs to Query.jl or should it live somewhere else?
Also, what syntax would you think is best? What I put here is pretty much a placeholder.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions