ENH: Allow use of colon operator to slice ranges by column names #393

sglyon · 2013-11-06T04:30:23Z

This seems like reasonable functionality that is currently not implemented:

julia> df = DataFrame(quote
       A = [1:10]
       B = [1:10] .* 2
       C = [1:10] .* 3
       end
       )
10x3 DataFrame:
          A  B  C
[1,]      1  2  3
[2,]      2  4  6
[3,]      3  6  9
[4,]      4  8 12
[5,]      5 10 15
[6,]      6 12 18
[7,]      7 14 21
[8,]      8 16 24
[9,]      9 18 27
[10,]    10 20 30


julia> df["A":"B"]
ERROR: no method colon(ASCIIString,ASCIIString)

I would expect the return value to be something like:

10x2 DataFrame:
          A  B
[1,]      1  2
[2,]      2  4
[3,]      3  6
[4,]      4  8
[5,]      5 10
[6,]      6 12
[7,]      7 14
[8,]      8 16
[9,]      9 18
[10,]    10 20

The text was updated successfully, but these errors were encountered:

johnmyleswhite · 2013-11-06T05:17:43Z

I don't think this is feasible given Julia's semantics. Let me explain my concerns:

Julia doesn't treat length-1 strings and characters as equivalent. Characters have a natural counting order, so that 'a':'d' actually makes sense. In contrast, "a":"d"doesn't really make sense because Julia doesn't impose any counting order on strings: Julia implements only a lexicographic ordering on strings.
Even if we were to invent a meaning for "a":"d", I would be fairly opposed to any attempt to make it work for DataFrames, but not work elsewhere in the language. In general, I think Julia has the great virtue of almost purely local semantics, which ensure that expressions have well-defined meanings in all contexts and don't vary based on surrounding factors. Making "a":"d" mean something inside of brackets that it doesn't mean outside of them would break this contract. If "a":"d" were to acquire meaning, that meaning should be defined in the core language, not in a library. If we could get all of the people in charge of Julia's core language to agree on a proper counting ordering for strings, then we could try to do this.

kmsquire · 2013-11-06T05:43:39Z

Hi John, I think that "a" and "d" are meant to be column labels in a DataFrame, and that "a":"d" is specifically meant to change meaning depending on the order of the columns in a dataframe (as in pandas). For example, one DataFrame might define columns as ["a", "mean", "var", "d"], and "a":"d" would be interpreted as columns 1 through 4, but a different DataFrame could have ["a", "z", "d"], and "a":"d" would be columns 1:3. I don't think there's any intent or use for "a":"d" to have a global meaning.

The only trick in actually defining the function here is that it has the name colon(a::String, b::String), since colons are a special syntax used to define symbols.

sglyon · 2013-11-06T05:45:22Z

Thanks @kmsquire, that is exactly what I intended in the original post

johnmyleswhite · 2013-11-06T05:50:05Z

I'm not sure I like the idea of allowing the meaning of an expression like "A":"D" to vary depending on the surrounding container. It starts to require something like delayed evaluation. I suppose Julia does already have end, but I'm kind of loathe to encourage that kind of magic to spread outside of the core language.

That said, I'll defer to majority opinion if other people really like this idea.

simonster · 2013-11-06T06:18:47Z

This proposal bears some ideological similarity to JuliaLang/julia#1032, but I agree with @johnmyleswhite that it's a little awkward, especially outside of Base. From an implementation standpoint, there's no way to avoid giving "a":"d" global meaning; if we define colon(a::String, b::String), no other code can define its own meaning for that syntax.

kmsquire · 2013-11-07T01:44:20Z

if we define colon(a::String, b::String), no other code can define its own meaning for that syntax.

Yeah, I see your point. (Actually, I think other code could define the same thing, it's just that the code last compiled would win, which wouldn't be good for consistency...)

One way forward would be to decide in Base that colon(a::String, b::String) always produces a StrColon(a,b) type, and then other code could dispatch on that.

johnmyleswhite · 2013-11-07T06:24:53Z

I spent today thinking about this. In addition to @simonster's concerns about introducing a meaning for "a":"d" that's not in Base, what makes me uncomfortable about this is that it will make code hard to reason about in isolation. If this were in Julia code, you would need to know a lot about the context of an indexing operation to know what "a":"d" would evaluate to. I think that's bad for people reading code. It would probably also make it hard to write static analysis tools for Julia.

These concerns are actually not a problem for a construct like a[1:end], which can always be rewritten as a[1:length(a)] without any knowledge about the context in which they are evaluated.

HarlanH · 2013-11-07T12:51:43Z

Another option that doesn't require funny syntax is to put back the group
feature in column names. Spencer, for a while, before it got hard to
manage, we had a feature where you could give a name to a group of columns,
then use that as a reference, or a formula in glm or whatever: outcome ~
Manipulated + Context or whatever.

On Thu, Nov 7, 2013 at 1:24 AM, John Myles White
[email protected]:

I spent today thinking about this. In addition to @simonsterhttps://github.com/simonster's
concerns about introducing a meaning for "a":"d" that's not in Base, what
makes me uncomfortable about this is that it will make code hard to reason
about in isolation. If this were in Julia code, you would need to know a
lot about the context of an indexing operation to know what "a":"d" would
evaluate to. I think that's bad for people reading code. It would probably
also make it hard to write static analysis tools for Julia.

These concerns are actually not a problem for a construct like a[1:end],
which can always be rewritten as a[1:length(a)] without any knowledge
about the context in which they are evaluated.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/393#issuecomment-27940459
.

sglyon · 2013-11-07T14:37:41Z

I actually came across this feature in an old PR while searching for hierarchical indexing. I noticed that the PR was merged, but was surprised to see that I couldn't find the functionality.

Why did it get removed?

tshort · 2013-11-07T14:54:44Z

The grouping feature added quite a bit of complexity that was difficult to
support as the code base changed.

kmsquire · 2013-11-07T22:47:26Z

Countering @johnmyleswhite, the purpose of df["a":"d"] would be to include all columns from "a" to "d". The alternative, currently, would be

julia> df[index(df)["A"]:index(df)["C"]]
10x2 DataFrame:
          A  B
[1,]      1  2
[2,]      2  4
[3,]      3  6
[4,]      4  8
[5,]      5 10
[6,]      6 12
[7,]      7 14
[8,]      8 16
[9,]      9 18
[10,]    10 20

Although I can certainly reason about what it means, I find that notation rather ugly.

Another option, of course, is just to use numbers, as df[1:3]. I find that much harder to reason about, and much harder to write if I want something beyond the 7th column.

johnmyleswhite · 2013-11-08T00:33:00Z

I'd like to hear what someone in Julia core thinks of this, since this change might end up affecting the whole language and not just this package.

For me, what's not so great about this approach is that I use strings as indices when I don't care about the order of columns in the DataFrame and I use numbers when I do care.

But to use this syntax, I have to care about the order of the strings -- saying "a":"d" only makes sense if you have perfect knowledge of the order of the columns. What happens when someone adds a new column between "a" and "d"? Your old code breaks unexpectedly?

Without knowing something about all of the columns in the DataFrame, you don't even know how many columns you'll get back. That's a non-trivial change from all of the non-expression based indexing we currently have.

Anyway, I'll back down and merge this kind of change if others really want it.

-- John

On Nov 7, 2013, at 2:47 PM, Kevin Squire [email protected] wrote:

Countering @johnmyleswhite, the purpose of df["a":"d"] would be to include all columns from "a" to "d". The alternative, currently, would be

julia> df[index(df)["A"]:index(df)["C"]]
10x2 DataFrame:
A B
[1,] 1 2
[2,] 2 4
[3,] 3 6
[4,] 4 8
[5,] 5 10
[6,] 6 12
[7,] 7 14
[8,] 8 16
[9,] 9 18
[10,] 10 20
Although I can certainly reason about what it means, I find that notation rather ugly.

Another option, of course, is just to use numbers, as df[1:3]. I find that much harder to reason about, and much harder to write if I want something beyond the 7th column.

—
Reply to this email directly or view it on GitHub.

tshort · 2013-11-08T01:14:23Z

Here are other ideas on this theme.

df[ cols"colZ:colB" ]
df[ :(colZ : colB) ]
df[ colrange(df, "colZ", "colB") ]  # you can do this now, but you might be better off with:
colrange(df, "colZ", "colB")  # again, you can do this now
df[ colrange("colZ", "colB") ]  # here colrange() is a curried function

If I were to need this a lot (and I don't), I'd probably use the colrange(df, "colZ", "colB") option.

The first two of these ideas could also be used to give column names without quotes like:

df[ cols"colZ, colB, colA" ]
df[ :(colZ, colB, colA) ]

The curried function option is interesting in that you could have a numerics function that selects numeric columns, and it could be used as df[ numerics ].

Anyway, I think Stefan said once that we already have too many ways to do things, so I probably shouldn't fan the fire:)

kmsquire · 2013-11-08T01:58:33Z

@johnmyleswhite, it might just be that I use DataFrames in a slightly different way than you're used to.

I have some tables where the format is prespecified (e.g., chromosome name, location, + specific columns with information about those regions), which I mostly interact with in pandas. Order matters, at least for the first 3-8 columns, and ordering within groups somewhat matters after that. There may be 250-300 columns. Of course, I don't want to look at all columns at once, but sometimes I want a group of them where I know the first and last label. Plus I want the genomic location, and possibly some other info from the first few columns. So, e.g., I'd like to be able to do:

df[["CHROM", "POS", "REF", "ALT", "DISEASES_PHENOTYPES":"Consequence_severest"], :]

This tells me a lot about what's in the resulting table (genomic location and disease information).

There might be other ways to do this in julia, and if so, that's great. (@tshort, thanks for the colrange pointer!) I'd just like the method to be not too much less flexible, expressive, or understandable than what I do now in pandas.

johnmyleswhite · 2013-11-08T02:09:08Z

That use case does make this seem much more reasonable.

Let's see what @StefanKarpinski, @ViralBShah or @JeffBezanson think. If any of them are on board, I'll stop complaining.

StefanKarpinski · 2013-11-08T22:33:30Z

Overloading : like this seems like a big no-no to me. However, the use-case does make some sense. One thought is to use "bar".."foo" to mean the interval of strings that are lexicographically between "bar" and "foo", but that only helps here if the column names are lexicographically ordered. I kind of think that explicitly taking indices is kind of a good thing since otherwise it's a bit weird for column ordering to be significant. Maybe there could be a convenience function for this?

nalimilan · 2013-11-09T20:03:38Z

+1. Lexicographic order sounds more robust than order of columns in the DataFrame. I think such a feature is supported in common statistical software (SAS, Stata IIRC). A separate colrange() function for the latter would be useful, but not for [].

johnmyleswhite · 2013-11-09T22:32:54Z

I'm glad other people are also a little turned off by this suggestion.

Let's bikeshed the best name for colrange to determine how to do this. This should be easy to implement once we agree on the interface.

kmsquire · 2013-11-09T22:51:06Z

I'd only ask that something like this be permissible:

df[["CHROM", "POS", "REF", "ALT", 
    colrange("DISEASES_PHENOTYPES","Consequence_severest")], :]

quinnj · 2017-09-07T04:18:32Z

Yeah, supporting (:col1):(:col3) is clearly not going to happen these days, but if someone wants to take a stab at a string macro or use of the .. operator, I think it could be entertained. Part of the issue is that DataFrames has moved to symbol indexing, and :col1..:col3 won't work because it tries to parse the ..: operator, which isn't valid.

bkamins · 2019-07-25T01:07:55Z

I would close it. We have fixed standard indexing API. If someone needs to do it Tables.columnindex can be used to get what is needed (it is not the shortest syntax imaginable but it is good enough IMO):

start, stop = Tables.columnindex(Ref(df), (:col1, :col2))
select(df, start:stop)

Feel free to reopen this if you disagree.

nalimilan · 2019-07-25T13:27:29Z

I think we should support something like JuliaDB's select(t, Between(start, stop)). That's something that also exists in dplyr.

bkamins · 2019-07-25T13:46:03Z

OK. Then Between should be moved to DataAPI.jl first. I am OK to add this.

bkamins · 2019-07-25T13:57:16Z

@quinnj + @piever: do you think it should go to DataAPI.jl or Tables.jl?

piever · 2019-07-26T19:58:22Z

I don't have a strong preference either way, maybe DataAPI makes the most sense as it really is just an API. Slgihtly off-topic, I would suggest to also add the All selector, which takes the union of all selectors: https://juliacomputing.github.io/JuliaDB.jl/latest/api/#IndexedTables.All (if one wants to select two intervals for example).

bkamins · 2019-07-26T20:06:32Z

Sure - adding All has been a pending request. So let us move both Between and All to DataAPI.jl.

@quinnj - are you OK with this?

quinnj · 2019-07-26T20:13:23Z

Sure

bkamins · 2019-08-11T08:50:23Z

Added in #1914

bkamins mentioned this issue Jan 15, 2019

DataFrames.jl roadmap #1678

Closed

31 tasks

bkamins closed this as completed Jul 25, 2019

nalimilan reopened this Jul 25, 2019

bkamins mentioned this issue Jul 26, 2019

Integration with DataAPI.jl JuliaData/IndexedTables.jl#261

Closed

bkamins closed this as completed Aug 11, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Allow use of colon operator to slice ranges by column names #393

ENH: Allow use of colon operator to slice ranges by column names #393

sglyon commented Nov 6, 2013

johnmyleswhite commented Nov 6, 2013

kmsquire commented Nov 6, 2013

sglyon commented Nov 6, 2013

johnmyleswhite commented Nov 6, 2013

simonster commented Nov 6, 2013

kmsquire commented Nov 7, 2013

johnmyleswhite commented Nov 7, 2013

HarlanH commented Nov 7, 2013

sglyon commented Nov 7, 2013

tshort commented Nov 7, 2013

kmsquire commented Nov 7, 2013

johnmyleswhite commented Nov 8, 2013

tshort commented Nov 8, 2013

kmsquire commented Nov 8, 2013

johnmyleswhite commented Nov 8, 2013

StefanKarpinski commented Nov 8, 2013

nalimilan commented Nov 9, 2013

johnmyleswhite commented Nov 9, 2013

kmsquire commented Nov 9, 2013

quinnj commented Sep 7, 2017

bkamins commented Jul 25, 2019

nalimilan commented Jul 25, 2019

bkamins commented Jul 25, 2019

bkamins commented Jul 25, 2019

piever commented Jul 26, 2019

bkamins commented Jul 26, 2019

quinnj commented Jul 26, 2019

bkamins commented Aug 11, 2019

ENH: Allow use of colon operator to slice ranges by column names #393

ENH: Allow use of colon operator to slice ranges by column names #393

Comments

sglyon commented Nov 6, 2013

johnmyleswhite commented Nov 6, 2013

kmsquire commented Nov 6, 2013

sglyon commented Nov 6, 2013

johnmyleswhite commented Nov 6, 2013

simonster commented Nov 6, 2013

kmsquire commented Nov 7, 2013

johnmyleswhite commented Nov 7, 2013

HarlanH commented Nov 7, 2013

sglyon commented Nov 7, 2013

tshort commented Nov 7, 2013

kmsquire commented Nov 7, 2013

johnmyleswhite commented Nov 8, 2013

tshort commented Nov 8, 2013

kmsquire commented Nov 8, 2013

johnmyleswhite commented Nov 8, 2013

StefanKarpinski commented Nov 8, 2013

nalimilan commented Nov 9, 2013

johnmyleswhite commented Nov 9, 2013

kmsquire commented Nov 9, 2013

quinnj commented Sep 7, 2017

bkamins commented Jul 25, 2019

nalimilan commented Jul 25, 2019

bkamins commented Jul 25, 2019

bkamins commented Jul 25, 2019

piever commented Jul 26, 2019

bkamins commented Jul 26, 2019

quinnj commented Jul 26, 2019

bkamins commented Aug 11, 2019