-
Notifications
You must be signed in to change notification settings - Fork 607
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
csvstack: handle reordered columns automatically #245
Comments
Look at implementation of |
I was looking for this functionality; I worked around it by combining into JSONLines format, then back into CSV:
|
I needed this, and implemented it like so with click and pandas: #! /usr/bin/env python
import click
import pandas as pd
@click.command()
@click.option('--out', type=click.File('w'), help='Output file path.', required=True)
@click.argument('in_paths', nargs=-1)
def csvstack(out, in_paths):
"""
Like csvkit's csvstack, but can deal with varying columns.
Note that this sorts the columns by name (part of merging columns).
"""
pd.concat([pd.read_csv(path) for path in in_paths], sort=True).to_csv(out, index=False)
if __name__ == '__main__':
csvstack() |
@matsen you might want to change it to Also, hoping someone finds time to work on this feature in csvstack. It'd be helpful! |
@matsen pandas concat/append effectively does the job alright. In fact I think using @jpmckinney I would assume that adding pandas is not an option as this is kinda does double duty with agate, correct? |
@matsen actually I meant using DataFrame.append() but it does not really behave as concat. A sort=False keeps the original order of the columns which is best IMHO (e.g. new columns are appended where they first show up in a CSV from the list) |
csvkit has the same original author as agate, and has the same design principles, so I don't anticipate using pandas in csvkit: https://agate.readthedocs.io/en/1.6.1/about.html#principles Folks are welcome to use pandas for some things and csvkit for other things. This issue remains open because it's possible to fix in agate/csvkit – we just lack implementation time. |
For now perhaps tweak the |
Thanks - I added "Files are assumed to have the same columns in the same order." |
I was really surprised to learn that this is not the default behavior! 😳 |
Yes, without this behaviour "csvstack" is no better than a |
@metasoarous I don't think that commit from 2015 solved this issue... despite the commit message. |
Roger |
FYI the following seems to work for me for picking a specific file's columns, if they match exactly but might be in a different order: fruits1.csv:
fruits2.csv:
Basically this says to cut the columns (in order) from fruits1.csv for each of the CSVs, and then merge them together. |
Ah, even better! Thanks for the simplification and adding to the docs! |
Thanks @panozzaj and @jpmckinney . I just got tripped up by this in my own data. When will it be available in the published documentation? I don't see your latest updates here. https://csvkit.readthedocs.io/en/latest/scripts/csvstack.html |
I've rebuilt the docs now! |
I am in need of this functionality too. I have several CSV files with partially matching headers. Given the following inputs
I would expect
However, this doesn't seem possible with the current toolset, and requires workaround solutions. I believe the code logic could go through each file sequentially. The table columns would be initialized with the headers from the first csv, and the data rows added as normal. For subsequent files, the headers would be checked. if any new ones are found, a column is appended to the end of the table and the data for all existing rows set to empty (or even an argument value). From there, every header from the second csv would exist in the stacked table, and it would just be a matter of inserting each row's data into the proper column. Rinse and repeat. Are there any plans to implement this? Are there problems I am not seeing? This seems like a basic feature with a straightforward implementation, but this issue has been open for over eight years... I notice that there is a partially complete PR open for this issue but it is unmerged with no activity from the author since submitting it. |
Yes, the problem is that the data from each file should be streamed to the output, without holding all the data in memory. Streaming is possible as long as there are the same number of columns. Otherwise, you end up creating a CSV with ragged rows (later rows have more columns than the header row).
This can't be done if you're streaming the data. |
I'm closing this issue (same number of columns, but in different order) as fixed via documentation: https://csvkit.readthedocs.io/en/latest/scripts/csvstack.html
Handling additional columns is out of scope for this issue. You would have to pre-process each file to collect the headings (only need to read the header row), then read the files again to write the output. One wrinkle is that, csvkit, in general, is designed to support standard input, which of course can't be pre-processed. |
The PR #1146 remains open, which follows the approach described in the previous comment. Its checklist items still need to be completed. |
I'm not too fond of the solution proposed in the example. I've been lurking in this issue for a long time, but IMHO the proper solution is in #1146 as the proposed example is too fragile. In the past years I extended my own toolkit with "tblstack" (which is how I needed csvstack to behave in the first place). https://github.com/EuracBiomedicalResearch/tblutils/blob/master/tblstack (this one is actually self-contained if someone is interested). It does require files to be pre-labeled and have constant column counts. When I was writing this some years ago (and the rest of the tools contained in that repo), perl IO was quite a bit faster for most operations, which is kind of sad. Nowdays I mostly migrated my workflows to use "miller", which I highly recommend as one of the most comprehensive toolkits out there (no affiliation here - I'm just a dev tired of munging text files). |
@wavexx Can you add documentation for |
@wavexx Hmm, nevermind, from a quick read of the |
I'm looking at miller's |
Yup, tblutils doesn't actually support csv quoting directly. I was working with genomics data, where performance and consistency was the main concern: the idea was to unquote/label (if needed) the files as a first step using tbl2tbl, then never handle quotes or reference columns by numbers ever again, so there are no really plans to change it either - I'm no longer directly involved. I just wanted to provide a couple of alternatives. |
This is now part of csvkit 1.1.1. |
I would expect that csvstack would look at headers and stack data intelligently, but it does not. It simply cats the first file together with all but the first line of remaining files.
In particular, if column names occur in a different order, or if some file has columns that another does not, the results are not consistent with what one would expect.
For example:
I would expect the following:
The text was updated successfully, but these errors were encountered: