You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on May 24, 2022. It is now read-only.
In addition to #13, it would be useful if Scrubcsv could output details about which data was normalized and for what reason.
This would be enabled with such dedicated --verbose command-line option, for instance.
The text was updated successfully, but these errors were encountered:
For the most part, the first pass of normalization in scrubcsv is actually performed by the csv parser. This handles things like normalization of quoting.
But there's no easy way to keep track of the decisions that csv makes under the hood.
One of the underlying issues here is that CSV is a poorly-defined format. (There's a spec. Actually, there are multiple specs and they don't always agree, as far as I know. And many real-world implementations have issues you wouldn't expect from the specs.)
scrubcsv was designed to be run on hundreds of millions of rows, or occasionally tens of billions of rows. And the input files may come from a wide variety of different sources that produce subtly corrupt CSV files. At that scale, corrupt input is pretty much a given. The primary goal of parsing is to produce standards-compliant output, and to fail if too many rows are corrupt. So the underlying goals, in order of importance, look something like:
Speed.
Valid output.
Detection of large-scale systemic errors, as opposed to 1-in-a-million scattered errors.
I'm not necessarily opposed to adding more detailed reporting of errors, but not at the cost of performance on mostly-valid data. scrubcsv is often run on distributed batch jobs spread across dozens of servers, and performance is important.
Sign up for freeto subscribe to this conversation on GitHub.
Already have an account?
Sign in.
In addition to #13, it would be useful if Scrubcsv could output details about which data was normalized and for what reason.
This would be enabled with such dedicated
--verbose
command-line option, for instance.The text was updated successfully, but these errors were encountered: