You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Related to conversation in #737, i'd really like a memory efficient (and faster) version of schema generation. Let's start by describing how csvkit currently generates a schema (this is a bit simplified)
the Table __init__ method exits and control is passed to the csvsql script
csvsql calls to_sql_create_statement method is called on the Table object
to_sql_create_statement calls the make_sql_table method
if user is not using the --no-constraints flag on csvsql the make_sql_tablecalculates precision and length constraints for every column. This effectively doubles the memory footprint because the agate aggregation methods make calls column.values and column.values_without_nulls which create list objects of the values in a column. This methods are memoized, so these lists will remain in memory until the parent column object is garbage collected, which won't happen until csvsql exits.
<agate-sql> The column information is passed to make_sql_column which makes the appropriate sql schema entry for that column.
Here's an alternative flow that I would suggest for a streaming process:
csv reader is passed to a type inference that is similar to existing type inference. Exhaust all the rows. In addition to doing the type inference, we will also get information about max length and precision in this iteration
this information is passed to `make_sql_column'
So, basically, bypassing creation of the Table object and expanding the functionality of the type tester.
Thoughts?
The text was updated successfully, but these errors were encountered:
fgregg
changed the title
Streaming mdoe for schema generation
Streaming mode for schema generation
Dec 15, 2017
Related to conversation in #737, i'd really like a memory efficient (and faster) version of schema generation. Let's start by describing how csvkit currently generates a schema (this is a bit simplified)
csvsql
passes a file object to thefrom_csv
class method of Tablefrom_csv
reads the file as a csv an saves all rows to arows
objectfrom_csv
passesrows
toTable.__init__
Table __init__
iterates through all rows to do type inferenceTable __init__
iterates through all rows and casts fields to appropriate type and saving result to a anew_rows
. This effectively doubles the memory footprint until we exit from__init__
androws
is garbage collectedTable __init__
method exits and control is passed to thecsvsql
scriptcsvsql
callsto_sql_create_statement
method is called on the Table objectto_sql_create_statement
calls themake_sql_table
method--no-constraints
flag oncsvsql
themake_sql_table
calculates precision and length constraints for every column. This effectively doubles the memory footprint because the agate aggregation methods make callscolumn.values
andcolumn.values_without_nulls
which create list objects of the values in a column. This methods are memoized, so these lists will remain in memory until the parent column object is garbage collected, which won't happen until csvsql exits.<agate-sql>
The column information is passed tomake_sql_column
which makes the appropriate sql schema entry for that column.Here's an alternative flow that I would suggest for a streaming process:
So, basically, bypassing creation of the Table object and expanding the functionality of the type tester.
Thoughts?
The text was updated successfully, but these errors were encountered: