Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

csvstack: handle reordered columns automatically #245

Closed
metasoarous opened this issue Dec 17, 2013 · 29 comments
Closed

csvstack: handle reordered columns automatically #245

metasoarous opened this issue Dec 17, 2013 · 29 comments
Labels

Comments

@metasoarous
Copy link

I would expect that csvstack would look at headers and stack data intelligently, but it does not. It simply cats the first file together with all but the first line of remaining files.

In particular, if column names occur in a different order, or if some file has columns that another does not, the results are not consistent with what one would expect.

For example:

csvlook a.csv =>
|----+----|
|  x | y  |
|----+----|
|  1 | 2  |
|  3 | 4  |
|  5 | 6  |
|----+----|


csvlook b.csv =>
|----+-------|
|  y | z     |
|----+-------|
|  8 | this  |
|  9 | that  |
|----+-------|

csvstack a.csv b.csv | csvlook =>
|----+-------|
|  x | y     |
|----+-------|
|  1 | 2     |
|  3 | 4     |
|  5 | 6     |
|  8 | this  |
|  9 | that  |
|----+-------|

I would expect the following:

csvstack a.csv b.csv | csvlook =>
|----+---+-------|
|  x | y | z     |
|----+---+-------|
|  1 | 2 |       |
|  3 | 4 |       |
|  5 | 6 |       |
|    | 8 | this  |
|    | 9 | that  |
|----|---+-------|
@jpmckinney
Copy link
Member

Look at implementation of agate.Table.merge (but don't use it to preserve streaming).

@jpmckinney jpmckinney added this to the Quick milestone Jun 8, 2016
@benjaminsims
Copy link

I was looking for this functionality; I worked around it by combining into JSONLines format, then back into CSV:

csvjson --stream file_1.csv >> combined.jsonl
csvjson --stream file_2.csv >> combined.jsonl
in2csv --format ndjson combined.jsonl >> combined.csv

@onyxfish onyxfish changed the title csvstack is rather unintelligent when it comes to files with different headers csvstack: handle reordered columns automatically Dec 29, 2016
@jpmckinney jpmckinney modified the milestone: Picks Dec 29, 2016
@matsen
Copy link

matsen commented Nov 28, 2018

I needed this, and implemented it like so with click and pandas:

#! /usr/bin/env python

import click
import pandas as pd


@click.command()
@click.option('--out', type=click.File('w'), help='Output file path.', required=True)
@click.argument('in_paths', nargs=-1)
def csvstack(out, in_paths):
    """
    Like csvkit's csvstack, but can deal with varying columns.
    Note that this sorts the columns by name (part of merging columns).
    """
    pd.concat([pd.read_csv(path) for path in in_paths], sort=True).to_csv(out, index=False)


if __name__ == '__main__':
    csvstack()

@nick-hahner
Copy link

@matsen you might want to change it to pd.read_csv(path, dtype=str) because pandas might otherwise mangle your data a little. Another improvement that I've found helpful is adding line_terminator='\r\n' to to_csv(). Reason is that sometimes there's a \r somewhere in a text column and pandas won't quote that properly otherwise so downstream tools might get confused and think the row is truncated.

Also, hoping someone finds time to work on this feature in csvstack. It'd be helpful!

@pombredanne
Copy link

pombredanne commented Feb 27, 2019

@matsen pandas concat/append effectively does the job alright. In fact I think using pandas.append(df, sort=False) instead may be a more generic solution.... but each use case varies.
The renaming with a 2 suffix of agate.join https://agate.readthedocs.io/en/1.6.1/api/table.html#agate.Table.join is a likely problem as well as the fact that the row header is truncated to keep only the first csv.

@jpmckinney I would assume that adding pandas is not an option as this is kinda does double duty with agate, correct?

@pombredanne
Copy link

@matsen actually I meant using DataFrame.append() but it does not really behave as concat. A sort=False keeps the original order of the columns which is best IMHO (e.g. new columns are appended where they first show up in a CSV from the list)

@jpmckinney
Copy link
Member

csvkit has the same original author as agate, and has the same design principles, so I don't anticipate using pandas in csvkit: https://agate.readthedocs.io/en/1.6.1/about.html#principles

Folks are welcome to use pandas for some things and csvkit for other things. This issue remains open because it's possible to fix in agate/csvkit – we just lack implementation time.

@tlongers
Copy link

tlongers commented Apr 9, 2019

For now perhaps tweak the csvstack documentation to say that the column headings that will appear in the output are taken from the first input file, irrespective of whether subsequent files have different column headings.

@jpmckinney jpmckinney modified the milestone: 1.0.5 Apr 28, 2019
@jpmckinney
Copy link
Member

Thanks - I added "Files are assumed to have the same columns in the same order."

@ad-si
Copy link

ad-si commented Jan 7, 2020

I was really surprised to learn that this is not the default behavior! 😳

@alexivkin
Copy link

Yes, without this behaviour "csvstack" is no better than a (cat file1;tail +1 file2) > joinedfile

lcorbasson pushed a commit to lcorbasson/csvkit that referenced this issue Sep 7, 2020
@metasoarous
Copy link
Author

metasoarous commented Nov 17, 2020

It looks like this has been resolved by @onyxfish in f698221? Any chance of getting this merged in here and deployed?

@jpmckinney
Copy link
Member

@metasoarous I don't think that commit from 2015 solved this issue... despite the commit message.

@metasoarous
Copy link
Author

Roger

@panozzaj
Copy link

FYI the following seems to work for me for picking a specific file's columns, if they match exactly but might be in a different order:

fruits1.csv:

name,col1,col2,col3
apple,1,2,3
banana,2,4,6

fruits2.csv:

col3,col2,col1,name
9,6,3,cherry
$ csvstack <(csvcut -c `head -1 fruits1.csv` fruits1.csv) <(csvcut -c `head -1 fruits1.csv` fruits2.csv)
name,col1,col2,col3
apple,1,2,3
banana,2,4,6
cherry,3,6,9

Basically this says to cut the columns (in order) from fruits1.csv for each of the CSVs, and then merge them together.

@jpmckinney
Copy link
Member

jpmckinney commented Sep 21, 2021

Thank you @panozzaj ! I've added the example to the docs in 33646ed

It can be simplified as:

$ csvstack fruits1.csv <(csvcut -c `head -1 fruits1.csv` fruits2.csv)

@panozzaj
Copy link

Ah, even better! Thanks for the simplification and adding to the docs!

@iainelder
Copy link

Thanks @panozzaj and @jpmckinney . I just got tripped up by this in my own data.

When will it be available in the published documentation? I don't see your latest updates here.

https://csvkit.readthedocs.io/en/latest/scripts/csvstack.html

@jpmckinney
Copy link
Member

I've rebuilt the docs now!

@executionByFork
Copy link

executionByFork commented Jul 8, 2022

I am in need of this functionality too. I have several CSV files with partially matching headers.

Given the following inputs

Header1 Header2 Header3
1x 2x 3x
1y 2y 3y
Header1 Header3 Header4
1z 3z 4z
Header4 Header1
4a 1a

I would expect

Header1 Header2 Header3 Header4
1x 2x 3x
1y 2y 3y
1z 3z 4z
1a 4a

However, this doesn't seem possible with the current toolset, and requires workaround solutions. I believe the code logic could go through each file sequentially. The table columns would be initialized with the headers from the first csv, and the data rows added as normal. For subsequent files, the headers would be checked. if any new ones are found, a column is appended to the end of the table and the data for all existing rows set to empty (or even an argument value). From there, every header from the second csv would exist in the stacked table, and it would just be a matter of inserting each row's data into the proper column. Rinse and repeat.

Are there any plans to implement this? Are there problems I am not seeing? This seems like a basic feature with a straightforward implementation, but this issue has been open for over eight years... I notice that there is a partially complete PR open for this issue but it is unmerged with no activity from the author since submitting it.

@jpmckinney
Copy link
Member

Are there any plans to implement this? Are there problems I am not seeing?

Yes, the problem is that the data from each file should be streamed to the output, without holding all the data in memory. Streaming is possible as long as there are the same number of columns. Otherwise, you end up creating a CSV with ragged rows (later rows have more columns than the header row).

if any new ones are found, a column is appended to the end of the table and the data for all existing rows set to empty

This can't be done if you're streaming the data.

@jpmckinney
Copy link
Member

jpmckinney commented Jul 17, 2022

I'm closing this issue (same number of columns, but in different order) as fixed via documentation: https://csvkit.readthedocs.io/en/latest/scripts/csvstack.html

Joining files with the same columns but in different orders, in Bash, assuming the header row does not contain newlines:

csvstack file1.csv <(csvcut -c `head -1 file1.csv` file2.csv)

Handling additional columns is out of scope for this issue. You would have to pre-process each file to collect the headings (only need to read the header row), then read the files again to write the output. One wrinkle is that, csvkit, in general, is designed to support standard input, which of course can't be pre-processed.

@jpmckinney
Copy link
Member

The PR #1146 remains open, which follows the approach described in the previous comment. Its checklist items still need to be completed.

@wavexx
Copy link

wavexx commented Jul 18, 2022

I'm not too fond of the solution proposed in the example. I've been lurking in this issue for a long time, but IMHO the proper solution is in #1146 as the proposed example is too fragile.

In the past years I extended my own toolkit with "tblstack" (which is how I needed csvstack to behave in the first place). https://github.com/EuracBiomedicalResearch/tblutils/blob/master/tblstack (this one is actually self-contained if someone is interested). It does require files to be pre-labeled and have constant column counts. When I was writing this some years ago (and the rest of the tools contained in that repo), perl IO was quite a bit faster for most operations, which is kind of sad. Nowdays I mostly migrated my workflows to use "miller", which I highly recommend as one of the most comprehensive toolkits out there (no affiliation here - I'm just a dev tired of munging text files).

@jpmckinney
Copy link
Member

@wavexx Can you add documentation for tblstack? I'm happy to link to it from the docs. miller is already linked: https://csvkit.readthedocs.io/en/latest/cli.html#processing

@jpmckinney
Copy link
Member

jpmckinney commented Jul 18, 2022

@wavexx Hmm, nevermind, from a quick read of the tblstack code, it doesn't seem to support quoting (i.e. it doesn't support multi-line values, or use of the separator within values), so it only works with a subset of CSVs.

@jpmckinney
Copy link
Member

I'm looking at miller's cat command, and the example (at least) has the same limitation as csvstack: https://miller.readthedocs.io/en/latest/reference-verbs/#cat

@wavexx
Copy link

wavexx commented Jul 18, 2022

Yup, tblutils doesn't actually support csv quoting directly. I was working with genomics data, where performance and consistency was the main concern: the idea was to unquote/label (if needed) the files as a first step using tbl2tbl, then never handle quotes or reference columns by numbers ever again, so there are no really plans to change it either - I'm no longer directly involved. I just wanted to provide a couple of alternatives.

@jpmckinney
Copy link
Member

This is now part of csvkit 1.1.1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests