Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Normalize (or rename) headers #396

Open
pudo opened this issue Apr 7, 2015 · 16 comments
Open

Normalize (or rename) headers #396

pudo opened this issue Apr 7, 2015 · 16 comments

Comments

@pudo
Copy link

pudo commented Apr 7, 2015

This is specifically with regards to csvsql, where loading a CSV file with Some manually entered - header (TM) will give you a data structure that is really hard to query. But I think having the ability to essentially slugify and transliterate a headers would be useful for other tools like csvclean aswell.

I tried looking into how this could be done, and it looks like parse_column_identifiers (here) could be an appropriate place, but that would require pulling the option through many intermediate functions. Is there a better place?

@jeroenjanssens
Copy link
Contributor

A non-csvkit solution would be to first pipe the CSV file through a little script I wrote called header (https://github.com/jeroenjanssens/data-science-at-the-command-line/blob/master/tools/header). For example:

cat manual.csv | header -r "some,sane,column,names" | csvsql --query "..."

Hope this helps in the meantime.

@fitnr
Copy link
Contributor

fitnr commented Aug 11, 2015

Other options if you don't have access to additional utilities:

$ { head -1 input.csv | sed -e 's/bad/good/g' ; tail +2 input.csv ; } | csvsql

or

$ { echo 'sane,column,names' ; tail +2 input.csv ; } | csvsql

@smari
Copy link

smari commented Sep 28, 2015

I would also like this feature -- specifically for making CSV header rows lowercase when appropriate.

@jpmckinney
Copy link
Member

@pudo Want to take a stab at a PR?

@onyxfish
Copy link
Collaborator

onyxfish commented Feb 5, 2016

I'd be in favor of finding a way to make this part of agate, either as a flag to the table constructor or as a slugify method that forks a table. That would make it a trivial addition to any tool that uses the table. Will open a ticket on agate.

@jeroenjanssens
Copy link
Contributor

@smari , in case you want a solution now [edit: I should have payed attention to when you actually commented here], here's a way to lowercase the column names using header (which, I now see, I also mentioned above):

$ cat iris.csv | head -n 5
"Sepal.Length","Sepal.Width","Petal.Length","Petal.Width","Species"
5.1,3.5,1.4,0.2,"setosa"
4.9,3,1.4,0.2,"setosa"
4.7,3.2,1.3,0.2,"setosa"
4.6,3.1,1.5,0.2,"setosa"
$ cat iris.csv | header -e "tr '.[:upper:]' '_[:lower:]'" | head -n 5
"sepal_length","sepal_width","petal_length","petal_width","species"
5.1,3.5,1.4,0.2,"setosa"
4.9,3,1.4,0.2,"setosa"
4.7,3.2,1.3,0.2,"setosa"
4.6,3.1,1.5,0.2,"setosa"

PS. I'm not saying anything about whether or not this should be a feature of either csvkit or agate, I'm just trying to be helpful :)

@danvk
Copy link

danvk commented Oct 17, 2016

Copying over this comment from #525...

In case it helps anyone else, you can lowercase all column names in a Postgres database after you import using this magical incantation from Stack Overflow:

\t on
select 'ALTER TABLE '||'"'||table_name||'"'||' RENAME COLUMN '||'"'||column_name||'"'||' TO ' || lower(column_name)||';' 
from information_schema.columns 
where table_schema = 'public' and lower(column_name) != column_name
\g /tmp/go_to_lower
\i /tmp/go_to_lower

@onyxfish
Copy link
Collaborator

I'm very much in favor of this feature, but I'm not sure where it should live since it affects so many things. Maybe it's part of csvformat. Implementation should be pretty straightforward, leveraging agate.utils.slugify and the column_names argument to Table.from_csv.

@jpmckinney
Copy link
Member

Waiting on wireservice/agate#660 in case agate API changes.

@jpmckinney jpmckinney changed the title Feature request: normalizing header names Normalize (or rename) header names May 21, 2018
@jpmckinney jpmckinney changed the title Normalize (or rename) header names Normalize (or rename) headers May 21, 2018
@jpmckinney
Copy link
Member

May be able to be inspired from master...culebron:master

@smnorris
Copy link
Contributor

@majestique
Copy link

majestique commented Oct 11, 2018

Another way to lowercase header:

awk 'NR==1{$0=tolower($0)} 1' input.csv > output.csv

@jpmckinney
Copy link
Member

Related: wireservice/agate#668

@iainelder
Copy link

iainelder commented Jan 21, 2022

A non-csvkit solution would be to first pipe the CSV file through a little script I wrote called header

@jeroenjanssens, I found it in your dsutils repo: https://github.com/jeroenjanssens/dsutils/header

I have list of account IDs in a bash variable that I want to write to CSV:

111111111111
222222222222
333333333333

Now I can use your header command to set the header like this.

printf "$accounts" \
| in2csv --format csv --no-header-row --no-inference \
| header -r account_id

Note that I'm actually using in2csv first to be reasonably sure that the input is a valid, single-column CSV file without a header. in2csv even adds the header "a" here, but I can't find a way to change that value!

So I use header -r to replace the header with the one I want.

Here's what in2csv would output:

a
111111111111
222222222222
333333333333

And here's what header would output:

account_id
111111111111
222222222222
333333333333

It seems like the right place for csvkit to support would be a new option in the in2csv command that works like the header -a and header -r options. Something like --add-header-row and --replace-header-row? These options along with --no-header-row would be mutually exclusive.

@jpmckinney jpmckinney added this to the Priority milestone Oct 18, 2023
@Morikko
Copy link

Morikko commented Nov 16, 2023

A way to add a header on a pipe output in a single command without external tooling:

csvcut -c '3,10,7' input.csv \
    | ( echo "col1,col2,col3" ; cat -; ) > output.csv

@jpmckinney
Copy link
Member

jpmckinney commented May 2, 2024

csvkit 2.0.0 adds csvclean --header-normalize-space as one simple normalization.

Some other potential normalizations implemented here: https://github.com/dannguyen/csvmedkit/blob/main/csvmedkit/utils/csvnorm.py

For inspiration, csvmedkit has a csvheader command: https://github.com/dannguyen/csvmedkit/blob/main/csvmedkit/utils/csvheader.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.