Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSV Column Rename #530

Closed
milafrerichs opened this issue Jan 27, 2016 · 15 comments
Closed

CSV Column Rename #530

milafrerichs opened this issue Jan 27, 2016 · 15 comments

Comments

@milafrerichs
Copy link

Hello @onyxfish ,

thanks for csvkit and your other cool tools like agate and proof.

I wanted to ask if you would integrate a new part into csvkit.

CSVRename

csvrename would allow you to change the header columns of your dataset. agate has a similar tool for agate.Table and I often have to rename columns.

Right now I'm using the header shell script from the book Data Science for the Command Line Toolkit
It works but everybody has to install it by themselfes so I cannot share my data pipelines(Makefiles).

I already created a csvrename and it would work as follow:

Rename/Replace all headers

Replace all the colum headers with new ones, as long as the list has the same length as the columns.

csvrename -n e,d,c,b,a

Rename specific column headers

csvrename -n d,c -c b,a

I potentially would add another argument to select the columns by index:

csvrename -n d,c -i 2,1

What do you think? Or do you have another easy way to do it?
Thanks.

@jpmckinney
Copy link
Member

@onyxfish Is your opinion the same as in #310 (comment) ?

@onyxfish
Copy link
Collaborator

onyxfish commented Feb 5, 2016

I think so. This still feels like it crosses the line over into the realm of things the command line is a bad environment for. Two comma separated, quoted lists of columns names are just not a very clear way of expressing this behavior—and the length of the commands gets unwieldy very fast.

That being said, I had a need for something like this just this week, so I can feel the pain.

@jpmckinney What do you think?

@jpmckinney
Copy link
Member

I think the common case would be to rename one or two columns, not all the columns, in which case the length of the command is fine. I have needed this, too, when, for whatever reason, the government changed one header in one file in a set of files.

@onyxfish
Copy link
Collaborator

onyxfish commented Feb 5, 2016

It's been my experience that typically when those kind of changes happen columns are also inserted and removed, as was the case in #310. For instance, this is the case with Census Bureau County Business Patterns data files, which pickup a new column suddenly in 2008. It's opening the door to that cascade of related "slight tweak" problems that I'm leery of.

@jpmckinney
Copy link
Member

I prefer #245 over #310 (and #245 would fix the underlying issue that led to #310). I would like a solution for #245.

@onyxfish
Copy link
Collaborator

onyxfish commented Feb 5, 2016

That's reasonable. That would also have resolved my issue with the CBP data. Happy to consider that as an extension to agate.Table.merge.

@jpmckinney
Copy link
Member

csvstack is currently streaming, and I'd like to preserve that.

@onyxfish
Copy link
Collaborator

onyxfish commented Feb 5, 2016

Well for what it's worth this is now implemented in agate. It should be pretty straightforward to duplicate the logic for the csvkit streaming interface.

@jpmckinney
Copy link
Member

Noting that the method is agate.Table.rename

@jpmckinney
Copy link
Member

Thank you for suggesting this new CSV tool. However, the maintainers have decided to not author, merge or maintain new tools; there is simply not enough time to do so. Our focus is instead on making the existing tools as good as possible.

We encourage you to create and maintain your own tool as a separate Python package. You may want to use the agate library, which csvkit uses for most of its CSV reading and writing. Doing so will make it easier to maintain common behavior with csvkit’s tools.

@metasoarous
Copy link

metasoarous commented Mar 8, 2017

This is disappointing, quite frankly. This would be a very useful feature for quickly cleaning up data from the command line. I don't at all see how this crosses into "the realm of things the command line is a bad environment for". On the contrary, I've been able to write rather intricate scripts/pipelines for data munging, and this is the sort of thing I hate having to drop down to sed for.

@jpmckinney
Copy link
Member

@metasoarous That was not the reason for closing the issue - re-read the last comment before closing. If you want this feature, implement it. The maintainers are not your free labor.

@metasoarous
Copy link

metasoarous commented Mar 8, 2017

I read the comment before writing. I also didn't demand that anyone work on it. The original poster indicated they'd already created an implementation. I was only making a plea/suggestion that it be considered for inclusion, and am saddened not only about this issue but that you and the rest of the team are categorically apposed to any new tools. It's your project though. Obviously you have the right to do what you like with it. As a user, I just wanted to let you know how I feel about it.

@cosmoKenney
Copy link

renaming columns can be done with csvsql. Just alias the names:

select "My First Column" as "FirstColumn", "My Second Column" as "SecondColumn" --...

@jpmckinney
Copy link
Member

Open issue: #396

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants