From CSV to jsonl #165

kfsone · 2024-05-06T01:36:19Z

kfsone
May 6, 2024

During experiments on improving import/cache-rebuild, I realized some significant changes to assumptions made during the original design.

1- Json isn't the worst, any more.

Python's json performance is now pretty good, ijson saves memory/cpu, and while the csv parser has gotten a little slower we have to wrap it in a couple of layers of code to make it actually usable relative to the database.

2- We have more storage and faster.

json does use more disk than csv, and I was trying to avoid filling my own disk back then. It uses more bandwidth, except that it compresses really well if you use sensible constraints.

3- json vs jsonl

using nested, structured json can be "cool" at the cost of overheads: you have to load all the child elements before you can yield an object

4- The sql database was a cache

At one point the .prices file was authoritative, the sql backing was just a binarization of it, hence the frenzied efforts to keep the cache rebuilt.

At this point, the code is hardy enough that the database is definitive (as ticketed). I have a branch-in-progress that aims to introduce schema versioning and the ability to do in-place imports that don't require rebuilding the database. That is, instead of creating a new file every time, it will empty all the tables, and only actually delete the .db file when the schema version changes. That will want to be refined before there's ever a schema change so that it can migrate the data. Probably something like

templates/TradeDangerous.migrate.0001-0002.sql

I also have a little module that you give two row-sources to and it sequences through them to tell you whats new/modified/removed.

db_rows = (RowDelta(row[0], row[modified], row) for row in db.select(fields="*", from="System"))
in_rows = (RowDelta(col['id'], col['modified'], col.values()) for col in csv_reader(import_file))
for op, delta in delta(db_rows, in_rows, modified=modified):
  if op == Op.ADD:
    # new record
    sql insert
  elif op == Op.MOD:
    # changed record
    sql update
  elif op == Op.UPD:
    # same entry, newer timestamp - i.e. it's stable,
    # updating only the modified field minimizes index churn
    sql update table set modified = blah where row = blah
  elif op == Op.DEL:
    if not incremental_update:
      sql delete

Which would allow us to reduce the amount of thrash we do at the database - instead of having to load everything in and do lookups, we only ever have to worry about records one at a time.

Unfortunately, trying to wrangle the various CSV files ended up making this too much work to do nicely, pushing me to fall back to the old code for eddblink.

A final issue I ran into with that experiment was trying to break up and encapsulate the steps of getting the remote data to the local machine. I wanted to make it robust dealing with download or import failures, but that left me having lots of versions of each file: incoming for a file being downloaded, new for the file to wait until it was processed, old for the .csv file to be backed up while we're working, etc.

I realized that what we perhaps ought to do is make better use sqlite and use a database to store the incoming data as rows.

Sure we can do that with CSV - either creating an "import.db" or a schema in the main table - but essentially

CREATE TABLE SystemImport
(
  line text,
);

We could either throw the csv text in there and continue doing manual parsing, or...

If we use jsonl we have three options:
1/ store the text as is and decode it at some later point,
2/ use json.loads() to create a dict of the values and then pickle-store them using the sqlite binary blob/pickle decoder,
3/ use json.loads() to create a dict and then field-wise populate the table

with (3) that means what we now have a table

SystemImport (system_id, name, ...)

and once the data is loaded in there, well then we can use sql to do the transfering and conditions for us:

DELETE FROM SystemImport WHERE EXISTS (SELECT FROM System using SystemImport.system_id AND SystemImport.modified < System.modified)
.. etc ..

This means we no-longer have to have files lying around, we read straight into the import tables/schema.

Secondly, if we stop nuking the main database and do these incremental imports, well we can do a ton of the import work without blocking the main database. "import" will download data straight into the database without storing it to a file, populate the Import tables without interfering with the main db, it can read the main db to reduce the data to be imported.

We could even set it up to do something like:

def import_batch():
  records = 0
  while True:
    batch_limit = getOption("batch-size", 250)
    for table in import_tables:
      delete_expired_entries(table)
      rows = transfer_rows(f"import.{table}", f"main.{table}", limit=batch_limit)
      records += rows
      batch_limit -= rows
      if batch_limit == 0:
        break
    db.commit()
    # if batch limit is 0, we don't know for sure that we reached the end
    # of backlogged data
    if batch_limit == 0:
      continue
    break
  NOTE("imported {} records", records)

jsonl will be much faster to parse than csv, so transiting it into the database should be a lot easier, we may have to write some temporary code that reads specific .csv files and regurgitates it as jsonl while we sources to switch to generating jsonl in the first place.

eyeonus · 2024-05-06T04:27:37Z

eyeonus
May 6, 2024
Maintainer

That all sounds nice, I'm just concerned with how much re-tinkering that would require on the part of the listener. We'd have to make the changes on both sides.

0 replies

kfsone · 2024-05-06T05:05:38Z

kfsone
May 6, 2024
Author

I'll make sure to look at that before I start doing anything then

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

From CSV to jsonl #165

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments

{{title}}

{{title}}

Select a reply

From CSV to jsonl #165

kfsone May 6, 2024

Replies: 2 comments

eyeonus May 6, 2024 Maintainer

kfsone May 6, 2024 Author

kfsone
May 6, 2024

eyeonus
May 6, 2024
Maintainer

kfsone
May 6, 2024
Author