Support "Integer" data types when data has no fractional numbers (floats, decimals) #43

tacman · 2023-10-14T12:51:44Z

When the values are all integer, it would be nice to know that, instead of just "Number"

The daily subtitle file from opensubtitles.org is less than 100k is size, about 2000 lines, and is a nice dataset for showing off csvkit.

curl "http://dl.opensubtitles.org/addons/export/subtitles_day.txt.gz" |  gunzip -c > subtitles.txt
  
csvcut -c IDSubtitle,MovieYear subtitles.txt -t | csvstat -y 0
  1. "IDSubtitle"

	Type of data:          Number
	Contains null values:  False
	Unique values:         1856
	Smallest value:        9,747,231
	Largest value:         9,749,339
	Sum:                   18,092,851,467
	Mean:                  9,748,303.592
	Median:                9,748,352.5
	StDev:                 628.279
	Most common values:    9,747,231 (1x)
	                       9,747,232 (1x)
	                       9,747,233 (1x)
	                       9,747,234 (1x)
	                       9,747,235 (1x)

  2. "MovieYear"

	Type of data:          Number
	Contains null values:  True (excluded from calculations)
	Unique values:         76
	Smallest value:        1,931
	Largest value:         2,023
	Sum:                   3,619,369
	Mean:                  2,016.362
	Median:                2,022
	StDev:                 14.175
	Most common values:    2,023 (861x)
	                       2,016 (115x)
	                       2,015 (95x)
	                       2,021 (93x)
	                       2,019 (85x)

Row count: 1856

csvsql subtitles.txt 
CREATE TABLE subtitles (
	"IDSubtitle" DECIMAL NOT NULL, 
	"MovieName" VARCHAR NOT NULL, 
	"MovieYear" DECIMAL, 
	"LanguageName" VARCHAR NOT NULL, 
	"ISO639" VARCHAR NOT NULL, 
	"SubAddDate" TIMESTAMP, 
	"ImdbID" DECIMAL NOT NULL, 
	"SubFormat" VARCHAR NOT NULL, 
	"SubSumCD" DECIMAL NOT NULL, 
	"MovieReleaseName" VARCHAR, 
	"MovieFPS" DECIMAL NOT NULL, 
	"SeriesSeason" DECIMAL, 
	"SeriesEpisode" DECIMAL, 
	"SeriesIMDBParent" DECIMAL, 
	"MovieKind" VARCHAR NOT NULL, 
	"URL" VARCHAR NOT NULL
);

IDSubtitle, ImdbID and MovieYear would be better represented in the database as integer values, rather than decimal.

Thanks for your consideration.

The text was updated successfully, but these errors were encountered:

jpmckinney · 2023-10-17T14:52:50Z

Transferred to agate-sql, as it's responsible for the csvsql database schema (DECIMAL vs INTEGER).

However, agate has only a number type, which doesn't track decimals observed.

So, agate probably needs to change first.

Ah, so looking at agate's history, I see that there had been an IntColumn, but it was removed wireservice/agate#64 The explanation is in wireservice/agate#35. Some of the relevant discussion:

I think it makes more sense to just have NumberColumn treat everything as Decimal type

The only thing that comes to mind is somtimes folk from the not-so-computational side on NICAR list get finicky about the display of things/getting rid of unwanted decimal bits and/or leading/trailing zeroes, which I guess ends up being either an issue of handling in the templating or just having to grok the difference between integer and decimal from the outset.

I am sensitive to the "no unnecessary rounding/conversion" issues, however, I'm less worried about that with a purely analytical library than I would be if there was a presentation component to this. Eliminating the extra type has one major benefit: you couldn't specify the wrong type when computing a new column. So for instance, if you create a column by dividing two integers, you need to make sure you specify the new column is Decimal. I can see people getting this wrong and it creating errors.

As best I can discern, Excel, OO, Google, etc all treat numbers internally as decimals and it's only at presentation time that something is "int-ifed".

the only reason I've been able to come up with not to eliminate IntColumn is that sometimes it might be more performant. That's not a good enough reason, so I'll probably pull the trigger on this today.

So, I think you'll just have to run an ALTER TABLE if you want INTEGER columns.

Reintroducing an integer type in agate is too major. As described in the original issue, the main downside could be performance, which in SQL is that DECIMAL takes up more space than INTEGER, so if you're doing huge computations, a DECIMAL-based table might not fit in memory where an INTEGER-based table can.

But... csvkit is designed for relatively small data. For big data, you don't want to use Python at all (unless you're handing all the computation to C, like numpy does, etc.). qsv uses Rust and has a to command to convert to SQL. I can't tell if it handles integers, but it might be worth looking into if you do need the columns to be integers.

tacman · 2023-10-17T15:10:10Z

Bummer. :-(

What about another boolean (like "Contains nulls"), that is "Contains fractions" that would only be true if type were numeric and there was at least one data value wasn't a whole number. Then users can create their own SQL, but at least know which fields were actually integers.

Combined with a "Unique" boolean, the developer could determine the appropriate primary key (and unique indexes).

…tion, wireservice/agate-sql#43

jpmckinney · 2023-10-17T18:48:39Z

Sure, agate has a MaxPrecision aggregation, so I now added that to the output.

If it's 0, then there are only whole numbers.

jpmckinney · 2023-10-17T19:43:27Z

Looks like there was an earlier open issue here: wireservice/csvkit#1070

jpmckinney transferred this issue from wireservice/csvkit Oct 17, 2023

jpmckinney closed this as completed Oct 17, 2023

jpmckinney added a commit to wireservice/csvkit that referenced this issue Oct 17, 2023

csvstat: Add a "Most decimal places" statistic and --max-precision op…

cc3dabb

…tion, wireservice/agate-sql#43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support "Integer" data types when data has no fractional numbers (floats, decimals) #43

Support "Integer" data types when data has no fractional numbers (floats, decimals) #43

tacman commented Oct 14, 2023

jpmckinney commented Oct 17, 2023 •

edited

Loading

tacman commented Oct 17, 2023

jpmckinney commented Oct 17, 2023 •

edited

Loading

jpmckinney commented Oct 17, 2023

Support "Integer" data types when data has no fractional numbers (floats, decimals) #43

Support "Integer" data types when data has no fractional numbers (floats, decimals) #43

Comments

tacman commented Oct 14, 2023

jpmckinney commented Oct 17, 2023 • edited Loading

tacman commented Oct 17, 2023

jpmckinney commented Oct 17, 2023 • edited Loading

jpmckinney commented Oct 17, 2023

jpmckinney commented Oct 17, 2023 •

edited

Loading

jpmckinney commented Oct 17, 2023 •

edited

Loading