-
Notifications
You must be signed in to change notification settings - Fork 369
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Allow use of colon operator to slice ranges by column names #393
Comments
I don't think this is feasible given Julia's semantics. Let me explain my concerns:
|
Hi John, I think that "a" and "d" are meant to be column labels in a DataFrame, and that The only trick in actually defining the function here is that it has the name |
Thanks @kmsquire, that is exactly what I intended in the original post |
I'm not sure I like the idea of allowing the meaning of an expression like That said, I'll defer to majority opinion if other people really like this idea. |
This proposal bears some ideological similarity to JuliaLang/julia#1032, but I agree with @johnmyleswhite that it's a little awkward, especially outside of Base. From an implementation standpoint, there's no way to avoid giving |
Yeah, I see your point. (Actually, I think other code could define the same thing, it's just that the code last compiled would win, which wouldn't be good for consistency...) One way forward would be to decide in |
I spent today thinking about this. In addition to @simonster's concerns about introducing a meaning for These concerns are actually not a problem for a construct like |
Another option that doesn't require funny syntax is to put back the group On Thu, Nov 7, 2013 at 1:24 AM, John Myles White
|
I actually came across this feature in an old PR while searching for hierarchical indexing. I noticed that the PR was merged, but was surprised to see that I couldn't find the functionality. Why did it get removed? |
The grouping feature added quite a bit of complexity that was difficult to |
Countering @johnmyleswhite, the purpose of julia> df[index(df)["A"]:index(df)["C"]]
10x2 DataFrame:
A B
[1,] 1 2
[2,] 2 4
[3,] 3 6
[4,] 4 8
[5,] 5 10
[6,] 6 12
[7,] 7 14
[8,] 8 16
[9,] 9 18
[10,] 10 20 Although I can certainly reason about what it means, I find that notation rather ugly. Another option, of course, is just to use numbers, as |
I'd like to hear what someone in Julia core thinks of this, since this change might end up affecting the whole language and not just this package. For me, what's not so great about this approach is that I use strings as indices when I don't care about the order of columns in the DataFrame and I use numbers when I do care. But to use this syntax, I have to care about the order of the strings -- saying "a":"d" only makes sense if you have perfect knowledge of the order of the columns. What happens when someone adds a new column between "a" and "d"? Your old code breaks unexpectedly? Without knowing something about all of the columns in the DataFrame, you don't even know how many columns you'll get back. That's a non-trivial change from all of the non-expression based indexing we currently have. Anyway, I'll back down and merge this kind of change if others really want it. -- John
|
Here are other ideas on this theme. df[ cols"colZ:colB" ]
df[ :(colZ : colB) ]
df[ colrange(df, "colZ", "colB") ] # you can do this now, but you might be better off with:
colrange(df, "colZ", "colB") # again, you can do this now
df[ colrange("colZ", "colB") ] # here colrange() is a curried function If I were to need this a lot (and I don't), I'd probably use the The first two of these ideas could also be used to give column names without quotes like: df[ cols"colZ, colB, colA" ]
df[ :(colZ, colB, colA) ] The curried function option is interesting in that you could have a Anyway, I think Stefan said once that we already have too many ways to do things, so I probably shouldn't fan the fire:) |
@johnmyleswhite, it might just be that I use DataFrames in a slightly different way than you're used to. I have some tables where the format is prespecified (e.g., chromosome name, location, + specific columns with information about those regions), which I mostly interact with in pandas. Order matters, at least for the first 3-8 columns, and ordering within groups somewhat matters after that. There may be 250-300 columns. Of course, I don't want to look at all columns at once, but sometimes I want a group of them where I know the first and last label. Plus I want the genomic location, and possibly some other info from the first few columns. So, e.g., I'd like to be able to do: df[["CHROM", "POS", "REF", "ALT", "DISEASES_PHENOTYPES":"Consequence_severest"], :] This tells me a lot about what's in the resulting table (genomic location and disease information). There might be other ways to do this in julia, and if so, that's great. (@tshort, thanks for the |
That use case does make this seem much more reasonable. Let's see what @StefanKarpinski, @ViralBShah or @JeffBezanson think. If any of them are on board, I'll stop complaining. |
Overloading : like this seems like a big no-no to me. However, the use-case does make some sense. One thought is to use |
+1. Lexicographic order sounds more robust than order of columns in the DataFrame. I think such a feature is supported in common statistical software (SAS, Stata IIRC). A separate |
I'm glad other people are also a little turned off by this suggestion. Let's bikeshed the best name for |
I'd only ask that something like this be permissible: df[["CHROM", "POS", "REF", "ALT",
colrange("DISEASES_PHENOTYPES","Consequence_severest")], :] |
Yeah, supporting |
I would close it. We have fixed standard indexing API. If someone needs to do it
Feel free to reopen this if you disagree. |
I think we should support something like JuliaDB's |
OK. Then |
I don't have a strong preference either way, maybe DataAPI makes the most sense as it really is just an API. Slgihtly off-topic, I would suggest to also add the |
Sure - adding @quinnj - are you OK with this? |
Sure |
Added in #1914 |
This seems like reasonable functionality that is currently not implemented:
I would expect the return value to be something like:
The text was updated successfully, but these errors were encountered: