-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support "Integer" data types when data has no fractional numbers (floats, decimals) #43
Comments
Transferred to agate-sql, as it's responsible for the However, agate has only a number type, which doesn't track decimals observed. So, agate probably needs to change first. Ah, so looking at agate's history, I see that there had been an IntColumn, but it was removed wireservice/agate#64 The explanation is in wireservice/agate#35. Some of the relevant discussion:
So, I think you'll just have to run an Reintroducing an integer type in agate is too major. As described in the original issue, the main downside could be performance, which in SQL is that DECIMAL takes up more space than INTEGER, so if you're doing huge computations, a DECIMAL-based table might not fit in memory where an INTEGER-based table can. But... csvkit is designed for relatively small data. For big data, you don't want to use Python at all (unless you're handing all the computation to C, like numpy does, etc.). qsv uses Rust and has a |
Bummer. :-( What about another boolean (like "Contains nulls"), that is "Contains fractions" that would only be true if type were numeric and there was at least one data value wasn't a whole number. Then users can create their own SQL, but at least know which fields were actually integers. Combined with a "Unique" boolean, the developer could determine the appropriate primary key (and unique indexes). |
Sure, agate has a MaxPrecision aggregation, so I now added that to the output. If it's 0, then there are only whole numbers. |
Looks like there was an earlier open issue here: wireservice/csvkit#1070 |
When the values are all integer, it would be nice to know that, instead of just "Number"
The daily subtitle file from opensubtitles.org is less than 100k is size, about 2000 lines, and is a nice dataset for showing off csvkit.
IDSubtitle, ImdbID and MovieYear would be better represented in the database as integer values, rather than decimal.
Thanks for your consideration.
The text was updated successfully, but these errors were encountered: