Be more verbose when normalizing #21

ngirard · 2021-03-21T08:11:29Z

In addition to #13, it would be useful if Scrubcsv could output details about which data was normalized and for what reason.
This would be enabled with such dedicated --verbose command-line option, for instance.

The text was updated successfully, but these errors were encountered:

ngirard · 2021-03-21T08:37:16Z

As food for thought, see Csvlint.

emk · 2021-03-22T13:11:09Z

Thank you for the suggestion!

For the most part, the first pass of normalization in scrubcsv is actually performed by the csv parser. This handles things like normalization of quoting.

But there's no easy way to keep track of the decisions that csv makes under the hood.

One of the underlying issues here is that CSV is a poorly-defined format. (There's a spec. Actually, there are multiple specs and they don't always agree, as far as I know. And many real-world implementations have issues you wouldn't expect from the specs.)

scrubcsv was designed to be run on hundreds of millions of rows, or occasionally tens of billions of rows. And the input files may come from a wide variety of different sources that produce subtly corrupt CSV files. At that scale, corrupt input is pretty much a given. The primary goal of parsing is to produce standards-compliant output, and to fail if too many rows are corrupt. So the underlying goals, in order of importance, look something like:

Speed.
Valid output.
Detection of large-scale systemic errors, as opposed to 1-in-a-million scattered errors.

I'm not necessarily opposed to adding more detailed reporting of errors, but not at the cost of performance on mostly-valid data. scrubcsv is often run on distributed batch jobs spread across dozens of servers, and performance is important.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Be more verbose when normalizing #21

Be more verbose when normalizing #21

ngirard commented Mar 21, 2021

ngirard commented Mar 21, 2021

emk commented Mar 22, 2021

Be more verbose when normalizing #21

Be more verbose when normalizing #21

Comments

ngirard commented Mar 21, 2021

ngirard commented Mar 21, 2021

emk commented Mar 22, 2021