Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support "Integer" data types when data has no fractional numbers (floats, decimals) #43

Closed
tacman opened this issue Oct 14, 2023 · 4 comments

Comments

@tacman
Copy link

tacman commented Oct 14, 2023

When the values are all integer, it would be nice to know that, instead of just "Number"

The daily subtitle file from opensubtitles.org is less than 100k is size, about 2000 lines, and is a nice dataset for showing off csvkit.

curl "http://dl.opensubtitles.org/addons/export/subtitles_day.txt.gz" |  gunzip -c > subtitles.txt
  
csvcut -c IDSubtitle,MovieYear subtitles.txt -t | csvstat -y 0
  1. "IDSubtitle"

	Type of data:          Number
	Contains null values:  False
	Unique values:         1856
	Smallest value:        9,747,231
	Largest value:         9,749,339
	Sum:                   18,092,851,467
	Mean:                  9,748,303.592
	Median:                9,748,352.5
	StDev:                 628.279
	Most common values:    9,747,231 (1x)
	                       9,747,232 (1x)
	                       9,747,233 (1x)
	                       9,747,234 (1x)
	                       9,747,235 (1x)

  2. "MovieYear"

	Type of data:          Number
	Contains null values:  True (excluded from calculations)
	Unique values:         76
	Smallest value:        1,931
	Largest value:         2,023
	Sum:                   3,619,369
	Mean:                  2,016.362
	Median:                2,022
	StDev:                 14.175
	Most common values:    2,023 (861x)
	                       2,016 (115x)
	                       2,015 (95x)
	                       2,021 (93x)
	                       2,019 (85x)

Row count: 1856
csvsql subtitles.txt 
CREATE TABLE subtitles (
	"IDSubtitle" DECIMAL NOT NULL, 
	"MovieName" VARCHAR NOT NULL, 
	"MovieYear" DECIMAL, 
	"LanguageName" VARCHAR NOT NULL, 
	"ISO639" VARCHAR NOT NULL, 
	"SubAddDate" TIMESTAMP, 
	"ImdbID" DECIMAL NOT NULL, 
	"SubFormat" VARCHAR NOT NULL, 
	"SubSumCD" DECIMAL NOT NULL, 
	"MovieReleaseName" VARCHAR, 
	"MovieFPS" DECIMAL NOT NULL, 
	"SeriesSeason" DECIMAL, 
	"SeriesEpisode" DECIMAL, 
	"SeriesIMDBParent" DECIMAL, 
	"MovieKind" VARCHAR NOT NULL, 
	"URL" VARCHAR NOT NULL
);

IDSubtitle, ImdbID and MovieYear would be better represented in the database as integer values, rather than decimal.

Thanks for your consideration.

@jpmckinney jpmckinney transferred this issue from wireservice/csvkit Oct 17, 2023
@jpmckinney
Copy link
Member

jpmckinney commented Oct 17, 2023

Transferred to agate-sql, as it's responsible for the csvsql database schema (DECIMAL vs INTEGER).

However, agate has only a number type, which doesn't track decimals observed.

So, agate probably needs to change first.


Ah, so looking at agate's history, I see that there had been an IntColumn, but it was removed wireservice/agate#64 The explanation is in wireservice/agate#35. Some of the relevant discussion:

I think it makes more sense to just have NumberColumn treat everything as Decimal type

The only thing that comes to mind is somtimes folk from the not-so-computational side on NICAR list get finicky about the display of things/getting rid of unwanted decimal bits and/or leading/trailing zeroes, which I guess ends up being either an issue of handling in the templating or just having to grok the difference between integer and decimal from the outset.

I am sensitive to the "no unnecessary rounding/conversion" issues, however, I'm less worried about that with a purely analytical library than I would be if there was a presentation component to this. Eliminating the extra type has one major benefit: you couldn't specify the wrong type when computing a new column. So for instance, if you create a column by dividing two integers, you need to make sure you specify the new column is Decimal. I can see people getting this wrong and it creating errors.

As best I can discern, Excel, OO, Google, etc all treat numbers internally as decimals and it's only at presentation time that something is "int-ifed".

the only reason I've been able to come up with not to eliminate IntColumn is that sometimes it might be more performant. That's not a good enough reason, so I'll probably pull the trigger on this today.


So, I think you'll just have to run an ALTER TABLE if you want INTEGER columns.

Reintroducing an integer type in agate is too major. As described in the original issue, the main downside could be performance, which in SQL is that DECIMAL takes up more space than INTEGER, so if you're doing huge computations, a DECIMAL-based table might not fit in memory where an INTEGER-based table can.

But... csvkit is designed for relatively small data. For big data, you don't want to use Python at all (unless you're handing all the computation to C, like numpy does, etc.). qsv uses Rust and has a to command to convert to SQL. I can't tell if it handles integers, but it might be worth looking into if you do need the columns to be integers.

@tacman
Copy link
Author

tacman commented Oct 17, 2023

Bummer. :-(

What about another boolean (like "Contains nulls"), that is "Contains fractions" that would only be true if type were numeric and there was at least one data value wasn't a whole number. Then users can create their own SQL, but at least know which fields were actually integers.

Combined with a "Unique" boolean, the developer could determine the appropriate primary key (and unique indexes).

@jpmckinney
Copy link
Member

jpmckinney commented Oct 17, 2023

Sure, agate has a MaxPrecision aggregation, so I now added that to the output.

If it's 0, then there are only whole numbers.

@jpmckinney
Copy link
Member

Looks like there was an earlier open issue here: wireservice/csvkit#1070

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants