Skip to content
This repository has been archived by the owner on May 24, 2022. It is now read-only.

Be more verbose when normalizing #21

Open
ngirard opened this issue Mar 21, 2021 · 2 comments
Open

Be more verbose when normalizing #21

ngirard opened this issue Mar 21, 2021 · 2 comments

Comments

@ngirard
Copy link

ngirard commented Mar 21, 2021

In addition to #13, it would be useful if Scrubcsv could output details about which data was normalized and for what reason.
This would be enabled with such dedicated --verbose command-line option, for instance.

@ngirard
Copy link
Author

ngirard commented Mar 21, 2021

As food for thought, see Csvlint.

@emk
Copy link
Contributor

emk commented Mar 22, 2021

Thank you for the suggestion!

For the most part, the first pass of normalization in scrubcsv is actually performed by the csv parser. This handles things like normalization of quoting.

But there's no easy way to keep track of the decisions that csv makes under the hood.

One of the underlying issues here is that CSV is a poorly-defined format. (There's a spec. Actually, there are multiple specs and they don't always agree, as far as I know. And many real-world implementations have issues you wouldn't expect from the specs.)

scrubcsv was designed to be run on hundreds of millions of rows, or occasionally tens of billions of rows. And the input files may come from a wide variety of different sources that produce subtly corrupt CSV files. At that scale, corrupt input is pretty much a given. The primary goal of parsing is to produce standards-compliant output, and to fail if too many rows are corrupt. So the underlying goals, in order of importance, look something like:

  1. Speed.
  2. Valid output.
  3. Detection of large-scale systemic errors, as opposed to 1-in-a-million scattered errors.

I'm not necessarily opposed to adding more detailed reporting of errors, but not at the cost of performance on mostly-valid data. scrubcsv is often run on distributed batch jobs spread across dozens of servers, and performance is important.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants