"skip lines" feature #336

amnonkhen · 2014-09-15T15:07:24Z

Sometimes the first few lines of a csv need to be skipped (header comments, copyright lines). It would be nice to have this capability in csvkit.
I think it belongs in the reader's constructor.

eyeseast · 2014-09-18T14:05:47Z

You can do this with tail on *nix systems. Just add it to your pipeline:

in2csv somefile.xls | tail -n +7 | csvcut -c 1,2,3 > newfile.csv

That would convert somefile.xls to csv, then chop off the first seven lines (note the + sign), then cut all but the first three columns, and finally safe to newfile.csv.

I just ran into the same issue with school data in Massachusetts.

amnonkhen · 2014-09-18T15:24:38Z

I realize about tail. It, however, is applicable for command line usage,
while I suggested this feature for the programmatic interface of csvkit,
namely the Reader classes.

On Thu, Sep 18, 2014 at 5:05 PM, Chris Amico [email protected]
wrote:

You can do this with tail on *nix systems. Just add it to your pipeline:

in2csv somefile.xls | tail -n +7 | csvcut -c 1,2,3 > newfile.csv

That would convert somefile.xls to csv, then chop off the first seven
lines (note the + sign), then cut all but the first three columns, and
finally safe to newfile.csv.

I just ran into the same issue with school data in Massachusetts.

—
Reply to this email directly or view it on GitHub
#336 (comment).

kiranpalla · 2015-01-18T12:35:52Z

There are many kinds of Junk lines that can appear in a csv / text file, especially in fixed width reports. They generally appear as report headers / footers, page headers / footers, blank / junk rows. I handle such kind of reports regularly. Some old reports cause so much trouble as they have junk rows after each data row. I think there should be a way to identify all those junk rows using the schema file while running "in2csv" and eliminate those while importing, as they do not serve any purpose.

mpschr · 2015-09-12T08:42:27Z

Hi

I'd like to pronounce myself to that issue also. I am with tabular files almost 24/7 and have found csvkit very powerful to select look at and select data from a file in the terminal in order to continue working with it, etc.. I have a habit to add important information about file generation as commented lines at the top of the file, which is very important for backtracking issues. Below an example.

I am circumventing these things with e.g. grep -v "#" some.tsv | csvstat/csvlook/csvcut etc., which is possible, but quite a pitfall in usability. Therefore I suggest implementing a feature such as comment, or skip_lines in order to provide better usability, similar as f.ex. the pandas.read_csv which provides several of these options and I think that it should be rather easy to implement.

What are your toughts?

# 37512 variants loaded from from ~/Documents/projects/exon/out_trios/SAMPLEID/mutations/sample-ALLMUTS.tsv
# -4074 variants discarded with: reads_REL >= 10 & reads_CR >= 10
# -98 variants discarded with: alt_reads_REL >= 3
# -2470 variants discarded with: alt_freq_REL >= 0.1
# -27151 variants discarded with: is_coding != 'not_coding'
# 3719 variants returned (9.91%)
chrom   start   end ref alt called_by   caller_count    base_qual   mapping_qual    alt_freq_CR alt_freq_REL    alt_reads_CR    alt_reads_REL   reads_CR    reads_REL   source_readcounts_CR    source_readcounts_REL   gene    exon    aa_change   rs_ids  cosmic_ids  aaf_adj_exac_all    type    sub_type    impact_so   is_coding

ostrokach · 2016-01-11T22:11:02Z

@mpschr I have a similar problem when working with VCF files, which have comments prefixed with ## and a header line prefixed with # (see example below). Something like a comment option would be a welcome addition, as now I am manually deleting those lines before importing the files into a database.

##fileformat=VCFv4.1
##source=COSMICv75
##reference=GRCh38
##fileDate=20151124
##comment="Missing nucleotide details indicate ambiguity during curation process"
##comment="URL stub for COSN ID field (use numeric portion of ID)='http://cancer.sanger.ac.uk/cosmic/ncv/overview?id='"
##comment="REF and ALT sequences are both forward strand
##INFO=<ID=GENE,Number=1,Type=String,Description="Gene name">
##INFO=<ID=STRAND,Number=1,Type=String,Description="Gene strand">
##INFO=<ID=CDS,Number=1,Type=String,Description="CDS annotation">
##INFO=<ID=AA,Number=1,Type=String,Description="Peptide annotation">
##INFO=<ID=CNT,Number=1,Type=Integer,Description="How many samples have this mutation">
##INFO=<ID=SNP,Number=0,Type=Flag,Description="classified as SNP">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO
1       10151   COSN14661299    T       A       .       .       SNP
1       10175   COSN14519186    T       A       .       .       SNP
1       10181   COSN14600774    A       T       .       .       SNP
1       10181   COSN7167327     A       T       .       .       SNP
1       10237   COSN8882887     A       C       .       .       SNP
1       10333   COSN8883341     C       T       .       .       SNP

onyxfish · 2016-12-29T16:00:14Z

agate.Table.from_csv supports this (skip_lines) so it should be simple to support it in csvkit as well.

jpmckinney · 2017-01-25T14:07:59Z

Related: #669

jpmckinney · 2017-01-28T00:13:22Z

Noting that in2csv needs to handle this for fixed, otherwise I have a local branch ready.

onyxfish added feature Low Priority labels Sep 20, 2014

onyxfish added this to the 1.0 milestone Sep 20, 2014

jpmckinney mentioned this issue Jan 23, 2016

in2csv option to ignore first line #219

Closed

nbedi mentioned this issue Mar 22, 2016

Skip rows on from_csv wireservice/agate#581

Closed

jpmckinney modified the milestones: 1.0, Picks Jun 3, 2016

jpmckinney modified the milestone: Picks Dec 29, 2016

jpmckinney mentioned this issue Jan 27, 2017

Designate header row when using from_xls or from_xlsx wireservice/agate-excel#7

Closed

jpmckinney mentioned this issue Jan 28, 2017

Add a --skip-lines option to skip initial lines (e.g. comments, copyright notices, empty rows) #775

Merged

jpmckinney closed this as completed in #775 Jan 28, 2017

lcorbasson pushed a commit to lcorbasson/csvkit that referenced this issue Sep 7, 2020

Fix two major join bugs. wireservice#336.

18b4396

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"skip lines" feature #336

"skip lines" feature #336

amnonkhen commented Sep 15, 2014

eyeseast commented Sep 18, 2014

amnonkhen commented Sep 18, 2014

kiranpalla commented Jan 18, 2015

mpschr commented Sep 12, 2015

ostrokach commented Jan 11, 2016

onyxfish commented Dec 29, 2016 •

edited by jpmckinney

Loading

jpmckinney commented Jan 25, 2017

jpmckinney commented Jan 28, 2017

"skip lines" feature #336

"skip lines" feature #336

Comments

amnonkhen commented Sep 15, 2014

eyeseast commented Sep 18, 2014

amnonkhen commented Sep 18, 2014

kiranpalla commented Jan 18, 2015

mpschr commented Sep 12, 2015

ostrokach commented Jan 11, 2016

onyxfish commented Dec 29, 2016 • edited by jpmckinney Loading

jpmckinney commented Jan 25, 2017

jpmckinney commented Jan 28, 2017

onyxfish commented Dec 29, 2016 •

edited by jpmckinney

Loading