Chunking of larger tables in DB

Documenting a major development in this project that took a long time to fine-tune and got finalized with v2.0.0 in Sep 2018, but was in backburner development since May 2018.
Will share documentation regarding it here.

Basic problem statement

The stop_times.txt file (and shapes.txt also in some cases) can be HUGE, million+ rows Reading and writing these took extremely long time, and in many cases, simply errored out with a memory error, in the older json-based DB system. And the json files created there were also huge. This is given in detail in #4 .

Change in DB

This prompted a transition from json-based db to HDF5 format which took care of a lot of things at the DB storage and retrieval level. But the problem of working with a very large table in memory remained. v.long waiting time and constant threat of out-of-memory errors.

Chunking method

So from there, an idea for CHUNKING emerged.

tables like stop_times.txt and shapes.txt are anyways never actually handled in full. ie, at no point in managing GTFS are you processing the entire table. You're just working with one trip_id's rows in stop_times, and one shape_id's rows in shapes.
So why not store them in chunks
Split the stop_times file into multiple chunks: stop_times_1.h5, stop_times_2.h5, ... having a million or half a million rows each.
make a lookup json having the trip_id's mapped to the file containing them.

Note: from this point onwards, stop_times and trip_id will be mentioned but that should be assumed to apply to any table that's being chunked, and the corresponding primary key.

Then, when reading the table,

load the lookup json and figure out which chunk our trip_id is mapped to.
load that chunk only, leave the remaining data alone.

Chunking the tables at import

When creating the data for the first time during a feed import, this chunking feature was leveraged : the csv itself was read in chunks, and each chunk is processed separately and deleted from memory before loading the next chunk. This way huge feeds having huge stop_times.txt files can be read by the program without requiring large memory hardware.

Contiguous Chunks

one challenge : how to ensure that one trip_id doesn't spill over from one chunk to the other.
See the code for this in the GTFS import function:

https://github.com/WRI-Cities/static-GTFS-manager/blob/8c9a8f30e482a3dc1e4462ebba85beb9c180ce86/GTFSserverfunctions.py#L168

Sample console output of an import:

Storing stop_times in chunks.
stop_times_1.h5: 499987 rows
stop_times_2.h5: 500004 rows
stop_times_3.h5: 499994 rows
stop_times_4.h5: 500007 rows
stop_times_5.h5: 499993 rows
stop_times_6.h5: 500007 rows
stop_times_7.h5: 500004 rows
stop_times_8.h5: 499970 rows
stop_times_9.h5: 500013 rows
stop_times_10.h5: 500020 rows
stop_times_11.h5: 4333 rows
Appending the 22 rows of last ID to last chunk stop_times_11.h5
Lookup json: stop_times_lookup.json created for mapping ID trip_id to stop_times_n.h5 chunk files.

An explanation of the workflow:

from the csv chunk loaded, take all trip_ids except the last one (which may be continued in the next chunk). Store these into a .h5 file.
store the last trip_id's rows in a "CarryOver" df.
when the next csv chunk is loaded, concatenate the CarryOver rows with these. Even here, repeat the earlier process.
this will result in chunks that have slightly less or more rows than the chosen chunk size.
After processing the last chunk, append the last set of CarryOver rows back to the last chunk.

(this is a lot easier to code than it is to explain, so apologies if this is still confusing)

Provide feedback

Saved searches