csvstack: handle reordered columns automatically #245

metasoarous · 2013-12-17T22:15:06Z

I would expect that csvstack would look at headers and stack data intelligently, but it does not. It simply cats the first file together with all but the first line of remaining files.

In particular, if column names occur in a different order, or if some file has columns that another does not, the results are not consistent with what one would expect.

For example:

csvlook a.csv =>
|----+----|
|  x | y  |
|----+----|
|  1 | 2  |
|  3 | 4  |
|  5 | 6  |
|----+----|


csvlook b.csv =>
|----+-------|
|  y | z     |
|----+-------|
|  8 | this  |
|  9 | that  |
|----+-------|

csvstack a.csv b.csv | csvlook =>
|----+-------|
|  x | y     |
|----+-------|
|  1 | 2     |
|  3 | 4     |
|  5 | 6     |
|  8 | this  |
|  9 | that  |
|----+-------|

I would expect the following:

csvstack a.csv b.csv | csvlook =>
|----+---+-------|
|  x | y | z     |
|----+---+-------|
|  1 | 2 |       |
|  3 | 4 |       |
|  5 | 6 |       |
|    | 8 | this  |
|    | 9 | that  |
|----|---+-------|

The text was updated successfully, but these errors were encountered:

jpmckinney · 2016-02-09T04:00:36Z

Look at implementation of agate.Table.merge (but don't use it to preserve streaming).

benjaminsims · 2016-12-08T17:48:40Z

I was looking for this functionality; I worked around it by combining into JSONLines format, then back into CSV:

csvjson --stream file_1.csv >> combined.jsonl
csvjson --stream file_2.csv >> combined.jsonl
in2csv --format ndjson combined.jsonl >> combined.csv

matsen · 2018-11-28T22:22:31Z

I needed this, and implemented it like so with click and pandas:

#! /usr/bin/env python

import click
import pandas as pd


@click.command()
@click.option('--out', type=click.File('w'), help='Output file path.', required=True)
@click.argument('in_paths', nargs=-1)
def csvstack(out, in_paths):
    """
    Like csvkit's csvstack, but can deal with varying columns.
    Note that this sorts the columns by name (part of merging columns).
    """
    pd.concat([pd.read_csv(path) for path in in_paths], sort=True).to_csv(out, index=False)


if __name__ == '__main__':
    csvstack()

nick-hahner · 2019-01-18T16:05:16Z

@matsen you might want to change it to pd.read_csv(path, dtype=str) because pandas might otherwise mangle your data a little. Another improvement that I've found helpful is adding line_terminator='\r\n' to to_csv(). Reason is that sometimes there's a \r somewhere in a text column and pandas won't quote that properly otherwise so downstream tools might get confused and think the row is truncated.

Also, hoping someone finds time to work on this feature in csvstack. It'd be helpful!

pombredanne · 2019-02-27T08:09:15Z

@matsen pandas concat/append effectively does the job alright. In fact I think using pandas.append(df, sort=False) instead may be a more generic solution.... but each use case varies.
The renaming with a 2 suffix of agate.join https://agate.readthedocs.io/en/1.6.1/api/table.html#agate.Table.join is a likely problem as well as the fact that the row header is truncated to keep only the first csv.

@jpmckinney I would assume that adding pandas is not an option as this is kinda does double duty with agate, correct?

pombredanne · 2019-02-27T08:20:36Z

@matsen actually I meant using DataFrame.append() but it does not really behave as concat. A sort=False keeps the original order of the columns which is best IMHO (e.g. new columns are appended where they first show up in a CSV from the list)

jpmckinney · 2019-02-27T16:08:57Z

csvkit has the same original author as agate, and has the same design principles, so I don't anticipate using pandas in csvkit: https://agate.readthedocs.io/en/1.6.1/about.html#principles

Folks are welcome to use pandas for some things and csvkit for other things. This issue remains open because it's possible to fix in agate/csvkit – we just lack implementation time.

tlongers · 2019-04-09T11:16:05Z

For now perhaps tweak the csvstack documentation to say that the column headings that will appear in the output are taken from the first input file, irrespective of whether subsequent files have different column headings.

…o csvstack, #245

jpmckinney · 2019-04-29T00:35:32Z

Thanks - I added "Files are assumed to have the same columns in the same order."

ad-si · 2020-01-07T14:31:47Z

I was really surprised to learn that this is not the default behavior! 😳

alexivkin · 2020-05-06T18:16:58Z

Yes, without this behaviour "csvstack" is no better than a (cat file1;tail +1 file2) > joinedfile

metasoarous · 2020-11-17T23:11:45Z

It looks like this has been resolved by @onyxfish in f698221? Any chance of getting this merged in here and deployed?

jpmckinney · 2020-11-22T22:43:21Z

@metasoarous I don't think that commit from 2015 solved this issue... despite the commit message.

metasoarous · 2020-11-23T19:00:34Z

Roger

panozzaj · 2021-09-20T22:01:31Z

FYI the following seems to work for me for picking a specific file's columns, if they match exactly but might be in a different order:

fruits1.csv:

name,col1,col2,col3
apple,1,2,3
banana,2,4,6

fruits2.csv:

col3,col2,col1,name
9,6,3,cherry

$ csvstack <(csvcut -c `head -1 fruits1.csv` fruits1.csv) <(csvcut -c `head -1 fruits1.csv` fruits2.csv)
name,col1,col2,col3
apple,1,2,3
banana,2,4,6
cherry,3,6,9

Basically this says to cut the columns (in order) from fruits1.csv for each of the CSVs, and then merge them together.

jpmckinney · 2021-09-21T16:54:22Z

Thank you @panozzaj ! I've added the example to the docs in 33646ed

It can be simplified as:

$ csvstack fruits1.csv <(csvcut -c `head -1 fruits1.csv` fruits2.csv)

panozzaj · 2021-09-22T17:17:24Z

Ah, even better! Thanks for the simplification and adding to the docs!

closes wireservice#245

iainelder · 2021-12-22T11:45:20Z

Thanks @panozzaj and @jpmckinney . I just got tripped up by this in my own data.

When will it be available in the published documentation? I don't see your latest updates here.

https://csvkit.readthedocs.io/en/latest/scripts/csvstack.html

jpmckinney · 2021-12-23T19:36:36Z

I've rebuilt the docs now!

executionByFork · 2022-07-08T22:51:33Z

I am in need of this functionality too. I have several CSV files with partially matching headers.

Given the following inputs

Header1	Header2	Header3
1x	2x	3x
1y	2y	3y

Header1	Header3	Header4
1z	3z	4z

Header4	Header1
4a	1a

I would expect

Header1	Header2	Header3	Header4
1x	2x	3x
1y	2y	3y
1z		3z	4z
1a			4a

However, this doesn't seem possible with the current toolset, and requires workaround solutions. I believe the code logic could go through each file sequentially. The table columns would be initialized with the headers from the first csv, and the data rows added as normal. For subsequent files, the headers would be checked. if any new ones are found, a column is appended to the end of the table and the data for all existing rows set to empty (or even an argument value). From there, every header from the second csv would exist in the stacked table, and it would just be a matter of inserting each row's data into the proper column. Rinse and repeat.

Are there any plans to implement this? Are there problems I am not seeing? This seems like a basic feature with a straightforward implementation, but this issue has been open for over eight years... I notice that there is a partially complete PR open for this issue but it is unmerged with no activity from the author since submitting it.

jpmckinney · 2022-07-17T20:52:26Z

Are there any plans to implement this? Are there problems I am not seeing?

Yes, the problem is that the data from each file should be streamed to the output, without holding all the data in memory. Streaming is possible as long as there are the same number of columns. Otherwise, you end up creating a CSV with ragged rows (later rows have more columns than the header row).

if any new ones are found, a column is appended to the end of the table and the data for all existing rows set to empty

This can't be done if you're streaming the data.

jpmckinney · 2022-07-17T20:53:23Z

I'm closing this issue (same number of columns, but in different order) as fixed via documentation: https://csvkit.readthedocs.io/en/latest/scripts/csvstack.html

Joining files with the same columns but in different orders, in Bash, assuming the header row does not contain newlines:
csvstack file1.csv <(csvcut -c `head -1 file1.csv` file2.csv)

Handling additional columns is out of scope for this issue. You would have to pre-process each file to collect the headings (only need to read the header row), then read the files again to write the output. One wrinkle is that, csvkit, in general, is designed to support standard input, which of course can't be pre-processed.

jpmckinney · 2022-07-17T21:02:06Z

The PR #1146 remains open, which follows the approach described in the previous comment. Its checklist items still need to be completed.

wavexx · 2022-07-18T09:27:22Z

I'm not too fond of the solution proposed in the example. I've been lurking in this issue for a long time, but IMHO the proper solution is in #1146 as the proposed example is too fragile.

In the past years I extended my own toolkit with "tblstack" (which is how I needed csvstack to behave in the first place). https://github.com/EuracBiomedicalResearch/tblutils/blob/master/tblstack (this one is actually self-contained if someone is interested). It does require files to be pre-labeled and have constant column counts. When I was writing this some years ago (and the rest of the tools contained in that repo), perl IO was quite a bit faster for most operations, which is kind of sad. Nowdays I mostly migrated my workflows to use "miller", which I highly recommend as one of the most comprehensive toolkits out there (no affiliation here - I'm just a dev tired of munging text files).

jpmckinney · 2022-07-18T17:43:40Z

@wavexx Can you add documentation for tblstack? I'm happy to link to it from the docs. miller is already linked: https://csvkit.readthedocs.io/en/latest/cli.html#processing

jpmckinney · 2022-07-18T17:46:06Z

@wavexx Hmm, nevermind, from a quick read of the tblstack code, it doesn't seem to support quoting (i.e. it doesn't support multi-line values, or use of the separator within values), so it only works with a subset of CSVs.

jpmckinney · 2022-07-18T17:51:54Z

I'm looking at miller's cat command, and the example (at least) has the same limitation as csvstack: https://miller.readthedocs.io/en/latest/reference-verbs/#cat

wavexx · 2022-07-18T18:02:39Z

Yup, tblutils doesn't actually support csv quoting directly. I was working with genomics data, where performance and consistency was the main concern: the idea was to unquote/label (if needed) the files as a first step using tbl2tbl, then never handle quotes or reference columns by numbers ever again, so there are no really plans to change it either - I'm no longer directly involved. I just wanted to provide a couple of alternatives.

jpmckinney · 2023-02-22T18:23:46Z

This is now part of csvkit 1.1.1.

onyxfish added Normal Priority labels Sep 5, 2014

jpmckinney mentioned this issue Nov 23, 2015

Ability to add new columns before using csvstack to merge file with extra columns #310

Closed

jpmckinney added best-practice and removed feature labels Feb 2, 2016

This was referenced Feb 5, 2016

CSV Column Rename #530

Closed

csvstack: Use agate.Table.merge #562

Closed

jpmckinney added feature Low Priority and removed best-practice labels Jun 3, 2016

jpmckinney added this to the Quick milestone Jun 8, 2016

onyxfish mentioned this issue Dec 29, 2016

csvstack: handle reordered columns automatically #688

Closed

onyxfish changed the title ~~csvstack is rather unintelligent when it comes to files with different headers~~ csvstack: handle reordered columns automatically Dec 29, 2016

jpmckinney modified the milestone: Picks Dec 29, 2016

jpmckinney added Normal Priority and removed Low Priority labels Jul 25, 2017

jpmckinney mentioned this issue Sep 26, 2017

csvstack with varying headers/columns doesn't merge correctly #888

Closed

jpmckinney modified the milestone: 1.0.5 Apr 28, 2019

jpmckinney added a commit that referenced this issue Apr 29, 2019

Add "Files are assumed to have the same columns in the same order." t…

48584f1

…o csvstack, #245

lcorbasson pushed a commit to lcorbasson/csvkit that referenced this issue Sep 7, 2020

Aggregation caching. Closes wireservice#245.

f698221

fgregg mentioned this issue Sep 9, 2021

heterogeous headers in csvstack #1141

Closed

jpmckinney added a commit that referenced this issue Sep 21, 2021

docs: Add example for files with columns in disorder, #245

33646ed

jpmckinney added a commit that referenced this issue Sep 21, 2021

docs: Add clarification, #245

1f2f322

fgregg added a commit to datamade/csvkit that referenced this issue Sep 26, 2021

if csvstack files have headers, union the headers and align data

4b09c15

closes wireservice#245

fgregg mentioned this issue Sep 26, 2021

if csvstack files have headers, union the headers and align data #1146

Closed

2 tasks

jpmckinney closed this as completed Jul 17, 2022

jpmckinney mentioned this issue Mar 20, 2024

csvstack is column agnostic and corrupts output #1233

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

csvstack: handle reordered columns automatically #245

csvstack: handle reordered columns automatically #245

metasoarous commented Dec 17, 2013

jpmckinney commented Feb 9, 2016

benjaminsims commented Dec 8, 2016

matsen commented Nov 28, 2018

nick-hahner commented Jan 18, 2019

pombredanne commented Feb 27, 2019 •

edited

Loading

pombredanne commented Feb 27, 2019

jpmckinney commented Feb 27, 2019

tlongers commented Apr 9, 2019

jpmckinney commented Apr 29, 2019

ad-si commented Jan 7, 2020

alexivkin commented May 6, 2020

metasoarous commented Nov 17, 2020 •

edited

Loading

jpmckinney commented Nov 22, 2020

metasoarous commented Nov 23, 2020

panozzaj commented Sep 20, 2021

jpmckinney commented Sep 21, 2021 •

edited

Loading

panozzaj commented Sep 22, 2021

iainelder commented Dec 22, 2021

jpmckinney commented Dec 23, 2021

executionByFork commented Jul 8, 2022 •

edited

Loading

jpmckinney commented Jul 17, 2022

jpmckinney commented Jul 17, 2022 •

edited

Loading

jpmckinney commented Jul 17, 2022

wavexx commented Jul 18, 2022 •

edited

Loading

jpmckinney commented Jul 18, 2022

jpmckinney commented Jul 18, 2022 •

edited

Loading

jpmckinney commented Jul 18, 2022

wavexx commented Jul 18, 2022

jpmckinney commented Feb 22, 2023

csvstack: handle reordered columns automatically #245

csvstack: handle reordered columns automatically #245

Comments

metasoarous commented Dec 17, 2013

jpmckinney commented Feb 9, 2016

benjaminsims commented Dec 8, 2016

matsen commented Nov 28, 2018

nick-hahner commented Jan 18, 2019

pombredanne commented Feb 27, 2019 • edited Loading

pombredanne commented Feb 27, 2019

jpmckinney commented Feb 27, 2019

tlongers commented Apr 9, 2019

jpmckinney commented Apr 29, 2019

ad-si commented Jan 7, 2020

alexivkin commented May 6, 2020

metasoarous commented Nov 17, 2020 • edited Loading

jpmckinney commented Nov 22, 2020

metasoarous commented Nov 23, 2020

panozzaj commented Sep 20, 2021

jpmckinney commented Sep 21, 2021 • edited Loading

panozzaj commented Sep 22, 2021

iainelder commented Dec 22, 2021

jpmckinney commented Dec 23, 2021

executionByFork commented Jul 8, 2022 • edited Loading

jpmckinney commented Jul 17, 2022

jpmckinney commented Jul 17, 2022 • edited Loading

jpmckinney commented Jul 17, 2022

wavexx commented Jul 18, 2022 • edited Loading

jpmckinney commented Jul 18, 2022

jpmckinney commented Jul 18, 2022 • edited Loading

jpmckinney commented Jul 18, 2022

wavexx commented Jul 18, 2022

jpmckinney commented Feb 22, 2023

pombredanne commented Feb 27, 2019 •

edited

Loading

metasoarous commented Nov 17, 2020 •

edited

Loading

jpmckinney commented Sep 21, 2021 •

edited

Loading

executionByFork commented Jul 8, 2022 •

edited

Loading

jpmckinney commented Jul 17, 2022 •

edited

Loading

wavexx commented Jul 18, 2022 •

edited

Loading

jpmckinney commented Jul 18, 2022 •

edited

Loading