Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"skip lines" feature #336

Closed
amnonkhen opened this issue Sep 15, 2014 · 8 comments · Fixed by #775
Closed

"skip lines" feature #336

amnonkhen opened this issue Sep 15, 2014 · 8 comments · Fixed by #775
Labels
Milestone

Comments

@amnonkhen
Copy link

Sometimes the first few lines of a csv need to be skipped (header comments, copyright lines). It would be nice to have this capability in csvkit.
I think it belongs in the reader's constructor.

@eyeseast
Copy link

You can do this with tail on *nix systems. Just add it to your pipeline:

in2csv somefile.xls | tail -n +7 | csvcut -c 1,2,3 > newfile.csv

That would convert somefile.xls to csv, then chop off the first seven lines (note the + sign), then cut all but the first three columns, and finally safe to newfile.csv.

I just ran into the same issue with school data in Massachusetts.

@amnonkhen
Copy link
Author

I realize about tail. It, however, is applicable for command line usage,
while I suggested this feature for the programmatic interface of csvkit,
namely the Reader classes.

On Thu, Sep 18, 2014 at 5:05 PM, Chris Amico [email protected]
wrote:

You can do this with tail on *nix systems. Just add it to your pipeline:

in2csv somefile.xls | tail -n +7 | csvcut -c 1,2,3 > newfile.csv

That would convert somefile.xls to csv, then chop off the first seven
lines (note the + sign), then cut all but the first three columns, and
finally safe to newfile.csv.

I just ran into the same issue with school data in Massachusetts.


Reply to this email directly or view it on GitHub
#336 (comment).

@onyxfish onyxfish added this to the 1.0 milestone Sep 20, 2014
@kiranpalla
Copy link

There are many kinds of Junk lines that can appear in a csv / text file, especially in fixed width reports. They generally appear as report headers / footers, page headers / footers, blank / junk rows. I handle such kind of reports regularly. Some old reports cause so much trouble as they have junk rows after each data row. I think there should be a way to identify all those junk rows using the schema file while running "in2csv" and eliminate those while importing, as they do not serve any purpose.

@mpschr
Copy link

mpschr commented Sep 12, 2015

Hi

I'd like to pronounce myself to that issue also. I am with tabular files almost 24/7 and have found csvkit very powerful to select look at and select data from a file in the terminal in order to continue working with it, etc.. I have a habit to add important information about file generation as commented lines at the top of the file, which is very important for backtracking issues. Below an example.

I am circumventing these things with e.g. grep -v "#" some.tsv | csvstat/csvlook/csvcut etc., which is possible, but quite a pitfall in usability. Therefore I suggest implementing a feature such as comment, or skip_lines in order to provide better usability, similar as f.ex. the pandas.read_csv which provides several of these options and I think that it should be rather easy to implement.

What are your toughts?

# 37512 variants loaded from from ~/Documents/projects/exon/out_trios/SAMPLEID/mutations/sample-ALLMUTS.tsv
# -4074 variants discarded with: reads_REL >= 10 & reads_CR >= 10
# -98 variants discarded with: alt_reads_REL >= 3
# -2470 variants discarded with: alt_freq_REL >= 0.1
# -27151 variants discarded with: is_coding != 'not_coding'
# 3719 variants returned (9.91%)
chrom   start   end ref alt called_by   caller_count    base_qual   mapping_qual    alt_freq_CR alt_freq_REL    alt_reads_CR    alt_reads_REL   reads_CR    reads_REL   source_readcounts_CR    source_readcounts_REL   gene    exon    aa_change   rs_ids  cosmic_ids  aaf_adj_exac_all    type    sub_type    impact_so   is_coding

@ostrokach
Copy link

@mpschr I have a similar problem when working with VCF files, which have comments prefixed with ## and a header line prefixed with # (see example below). Something like a comment option would be a welcome addition, as now I am manually deleting those lines before importing the files into a database.

##fileformat=VCFv4.1
##source=COSMICv75
##reference=GRCh38
##fileDate=20151124
##comment="Missing nucleotide details indicate ambiguity during curation process"
##comment="URL stub for COSN ID field (use numeric portion of ID)='http://cancer.sanger.ac.uk/cosmic/ncv/overview?id='"
##comment="REF and ALT sequences are both forward strand
##INFO=<ID=GENE,Number=1,Type=String,Description="Gene name">
##INFO=<ID=STRAND,Number=1,Type=String,Description="Gene strand">
##INFO=<ID=CDS,Number=1,Type=String,Description="CDS annotation">
##INFO=<ID=AA,Number=1,Type=String,Description="Peptide annotation">
##INFO=<ID=CNT,Number=1,Type=Integer,Description="How many samples have this mutation">
##INFO=<ID=SNP,Number=0,Type=Flag,Description="classified as SNP">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO
1       10151   COSN14661299    T       A       .       .       SNP
1       10175   COSN14519186    T       A       .       .       SNP
1       10181   COSN14600774    A       T       .       .       SNP
1       10181   COSN7167327     A       T       .       .       SNP
1       10237   COSN8882887     A       C       .       .       SNP
1       10333   COSN8883341     C       T       .       .       SNP

@onyxfish
Copy link
Collaborator

onyxfish commented Dec 29, 2016

agate.Table.from_csv supports this (skip_lines) so it should be simple to support it in csvkit as well.

@jpmckinney jpmckinney modified the milestone: Picks Dec 29, 2016
@jpmckinney
Copy link
Member

Related: #669

@jpmckinney
Copy link
Member

Noting that in2csv needs to handle this for fixed, otherwise I have a local branch ready.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
7 participants