Replies: 2 comments
-
That all sounds nice, I'm just concerned with how much re-tinkering that would require on the part of the listener. We'd have to make the changes on both sides. |
Beta Was this translation helpful? Give feedback.
0 replies
-
I'll make sure to look at that before I start doing anything then |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
During experiments on improving import/cache-rebuild, I realized some significant changes to assumptions made during the original design.
1- Json isn't the worst, any more.
Python's
json
performance is now pretty good,ijson
saves memory/cpu, and while thecsv
parser has gotten a little slower we have to wrap it in a couple of layers of code to make it actually usable relative to the database.2- We have more storage and faster.
json does use more disk than csv, and I was trying to avoid filling my own disk back then. It uses more bandwidth, except that it compresses really well if you use sensible constraints.
3- json vs jsonl
using nested, structured json can be "cool" at the cost of overheads: you have to load all the child elements before you can yield an object
4- The sql database was a cache
At one point the .prices file was authoritative, the sql backing was just a binarization of it, hence the frenzied efforts to keep the cache rebuilt.
At this point, the code is hardy enough that the database is definitive (as ticketed). I have a branch-in-progress that aims to introduce schema versioning and the ability to do in-place imports that don't require rebuilding the database. That is, instead of creating a new file every time, it will empty all the tables, and only actually delete the .db file when the schema version changes. That will want to be refined before there's ever a schema change so that it can migrate the data. Probably something like
templates/TradeDangerous.migrate.0001-0002.sql
I also have a little module that you give two row-sources to and it sequences through them to tell you whats new/modified/removed.
Which would allow us to reduce the amount of thrash we do at the database - instead of having to load everything in and do lookups, we only ever have to worry about records one at a time.
Unfortunately, trying to wrangle the various CSV files ended up making this too much work to do nicely, pushing me to fall back to the old code for eddblink.
A final issue I ran into with that experiment was trying to break up and encapsulate the steps of getting the remote data to the local machine. I wanted to make it robust dealing with download or import failures, but that left me having lots of versions of each file: incoming for a file being downloaded, new for the file to wait until it was processed, old for the .csv file to be backed up while we're working, etc.
I realized that what we perhaps ought to do is make better use sqlite and use a database to store the incoming data as rows.
Sure we can do that with CSV - either creating an "import.db" or a schema in the main table - but essentially
We could either throw the csv text in there and continue doing manual parsing, or...
If we use jsonl we have three options:
1/ store the text as is and decode it at some later point,
2/ use json.loads() to create a dict of the values and then pickle-store them using the sqlite binary blob/pickle decoder,
3/ use json.loads() to create a dict and then field-wise populate the table
with (3) that means what we now have a table
and once the data is loaded in there, well then we can use sql to do the transfering and conditions for us:
This means we no-longer have to have files lying around, we read straight into the import tables/schema.
Secondly, if we stop nuking the main database and do these incremental imports, well we can do a ton of the import work without blocking the main database. "import" will download data straight into the database without storing it to a file, populate the Import tables without interfering with the main db, it can read the main db to reduce the data to be imported.
We could even set it up to do something like:
jsonl will be much faster to parse than csv, so transiting it into the database should be a lot easier, we may have to write some temporary code that reads specific .csv files and regurgitates it as jsonl while we sources to switch to generating jsonl in the first place.
Beta Was this translation helpful? Give feedback.
All reactions