-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SEVERE performance issues doing EDDBlink import when DB is large. #126
Comments
The problem is cumulative: |
Can I suggest we expand this concept: The server side works. It works. So I suggest concentrating on optimization, memory use in any case where that remains an issue and then speed rather than adding any new features. Draw a line under it, call it feature complete until we have something which can process imports and output runs in a more reasonable timeframe. |
I saw we recently enabled incremental vacuuming and some other sqlite options to improve performance? It's probably worth turning those off locally and investigating SQLite's features for reporting on what the vacuum will do (keywords: sqlite3, pragma, vacuum, analyze, optimize) which may explain the issue - maybe there's an index that needs tweaking or scrapping. You also made it more aggressive about flushing the cache (closing the db, even) yesterday, to counter the memory bloat that my simple executemany was causing. So this make sit smell like a transaction issue - either an absence of them at some point or an unexpected nesting of them. Look for manual transactions (execute "BEGIN"/"COMMIT") and switch those to context-managed cursors so that the sqlite3 lib isn't taken by surprise. I also strongly recommend looking at the |
https://stackoverflow.com/a/17794805/257645 I strongly recommend using executemany whenever there's a loop, if possible. If you have to cap it, that's fine, but you save a bunch of overhead. Did you try running the imports with pyinstrument?
at the end it will give you a bunch of blurb and you can customize how it does that, but I tend to run it like the above and then grab the session from the last two lines:
and run something like:
sample speed.scope from an eddblink -O solo: speedscope.zip unzip, hit https://www.speedscope.app/ and drag it into the browse button. use the buttons in the top left to change type of view. |
The only transaction being performed when starting from an empty database is:
Removing the "OR IGNORE" part doesn't have any effect. eddblink used to use the executemany function after it finished gathering the list of all changes to be made, by appending the values for each change to the list Changing it to what it is now, where each listing is inserted during processing with The pragmas that were introduced knocked it down to a little over 2 hours. |
I've been working on TD for a week straight now, during any free time I had outside of work and sleep. I'm taking a break for a while. If it ain't a bug, it can wait or someone else can tackle it, is my current mode. |
The pragmas was me. WAL journal has a massive speed benefit on Linux, though seems less so on Windows and the vacuum keeps the db size to a minimum. The server is set up to vacuum every couple of days and I do want it to do incremental, as a full vac takes a chunk of time. That said, the first vac and optimise on the server's db reduced it from 6GB and change to 4GB and change so it's definitely worth it. |
I hope I didn't sound like I was handing out assignments :( if you can run the pyinstrument command on the server and snag a .scope file it might be really interesting. |
I hope I didn't sound like I was feeling put upon. I'm just so tired from working on this for a week solid and need a break for awhile. I'm running pyistrument, I'll post the results for you if you want to have a look at it. IT definitely seems to be more an issue on my *nix box than on Tromador's Win one. |
Jonathon first, trade calculator for online game later, man. TD is a vague distant memory for me - I honestly thought one of you added the plugin system after I'd ambled off, and I was kinda wondering wtf the saitek crap was until I looked at one of the files and saw the first line comment. cough Point: I don't know what the priorities are or the full details of what's wear and what's tear. If you think I can productively be of use somewhere specific, you'll need to point me, and while I'm human and will get my knickers in a twist or my butt hurt in the moment when it comes up that I'm responsible for any particular cretinous decision or demonstrate of derp, I'm not a diva: my first job was for a UK "national" contract software company as an embedded engineer, and when the company went belly up I got offers from each customer I'd worked for to go finish development, but I either didn't really "get" the companies or feel adequate, except for one, in my home town, on the fishing docks. So I accepted their offer, which came with a caveat: they weren't big enough to pay for a programmer, they were a cold-store/blast-freezer/fish packing plant. So if it got busy, I worked packing fish, loading/unloading 50lb boxes of frozen shark bellies or halibut. Yeah, I've worked some fancy places since, but I didn't stay. |
We aren't on identical hardware though so although I agree from our testing that Linux and Windows do seem to behave differently, don't forget I have lots of CPU, on the other hand you have more RAM than me. It's hard to simply draw comparisons therefore just based on OS. I to tend to waffle when I am tired, point being, many variables outside of OS. Particularly hardware. |
I quite deliberately put Est 2015 at the start of the thread on E:D forums. Other tools may be quicker, flashier whatever, but nothing else can do what we can and nothing else has our stubborn will to survive. |
This is my list of things needing doing at some point. The Niceness value is how important I consider each item, with lower being more important just like *nix CPU prioritizing. 1) N = -20Anything you can do to improve import performance is welcome. On my box, with the current TD, both spansh and eddblink take a bit over 3 hours to do an import. Obviously different machines will take different times, but ideally anything but a potato should have sub 1 hour import duration. My box has an 8-core CPU with 64GB RAM, so it's not a potato. (Also I checked the drive, and it has 45MB/s+ write speeds, but neither import ever went above 5MB/s, so I highly doubt that's the bottleneck for me.) 2) N = 5The spicing up of the visuals is certainly welcome, I like what you've done there and would be pleased to see the visual improvements you've done to spansh done to eddblink (and the rest of TD) as well. 3) N=-10TD needs to be able to build the other Tables. We have two reliable sources, those being the daily dumps from Spansh, and the EDCD lists, such as rare_items.csv We currently use the Spansh data (and EDDN via the listener) to fill out System, Station, Item, and StationItem, and could potentially use it to fill out Ship, ShipVendor, Upgrade, and UpgradeVendor. We currently use EDCD to fill out FDevOutfitting and FDevShipyard, and could potentially use it to fill out RareItem, Item, and Category. There's obviously some overlap, so in the case where we can use either, I think EDCD is the better option. That leaves Added as the only one that can't be easily updated, but as that's only used in System, to indicate (I think?) what version of Elite Dangerous a system was added to the galaxy, I feel like it should individually get N=20 Getting everything populated via Spansh means updating to utilize the information provided by the dump that is currently being ignored, as well as pulling from the EDCD sources when building the DB from scratch or they've been updated. Getting everything populated via the TD listener means listening to the other EDDN messages besides just the commoditiy ones, and processing them accordingly. Since the listener uses Spansh, this only matters for things which change more often than once a day. The eddblink plugin pulls from Trom's server, so getting this into the listener automatically gets this into eddblink as well. 4) N=-18Rescue Ships are EVIL to TD, because not only do they move, they also change their name, as they are named after whatever station they are currently rescuing. This results in duplicate entries in the prices file whenever two different rescue ships go to the same station. 5) N=12Working GUI, as in feature complete with TDHelper. 6) N=20Update the wiki to account for changes made to TD since the last time it was updated. I can probably get all that done myself, but I know it'll take a long time and be coded poorly, because I'm entirely self-taught on this stuff. I'm currently in college for computer science, so I'm better at this stuff than I was when I picked up the torch, but I'm still learning, and most if I'm learning the hard way. |
Which reminds me, I set this to run before I left for work. |
I'd venture to add. Import performance is the biggest problem as eyeonus says. No doubt. But... I reinstalled E:D being as we had working TD. I guess I was feeling nostalgic. So I have been using it in anger and there are in fact two things which leave me twiddling my thumbs. The first is absolutely importing data. The second is exporting data, by which I mean getting query results from trade run. The size of the dataset has grown in just the same way for making queries, so that the tree of possibilities which must be examined has become vast. This can be mitigated by filtering down, excluding by age for example, or no odyssey is a good one, but if you want to do a full check with a huge hold, enough money to fill it with whatever, and a jump range sufficient to cross the bubble? It's not usable. I am taking to lying about my real jump range to constrain the tree size. So yes. Import speed does need optimization, but so does the output. |
@eyeonus The visual stuff would ultimately be about code reduction - if rich had existed 10 years ago I'd not have written progressbar etc. I was looking at the transfers code, and it feels like half that module is just about trying to get "requests" imported. That's covered now by the requests.txt file, right? GUI I always figured that would end up being web-based but the web-app concepts at the time weren't really much use and I had reasons not to really want to get too much into html. But perhaps 'gradio' would be an option. So that brings us to the meaningful work, and I guess I always suspected TD would find itself struggling for trying to be a cold-runner rather than some kind of service. Juts thinking out loud here: We pay an entry fee for storing our data: every operation we need to do has to be expressed textually as SQL, parsed by the SQLite code, sqlite has to plan what to do and then execute on it. Lot of cpu cycles just to start reading. The ordinary payoff is a sort of storage-jit: tell the engine what you want in the abstract and it can figure out the fastest way to do it. We're not doing elaborate queries with sophisticated multi-table joins, groups, sorts, wheres, etc. I mean, Spitballs, consider each individually not as a sequence:
for system in systems: Using the temp table means we can avoid either building our own list or the cost of per-item queries or database indexes while we're doing this work. We can also parallelize the work here; we fully populate the temporary table without index overheads etc, and then have sqlite move things over. Now we're making SQL pay for itself because the query planner/executor can use its own internal data about quantities and stuff to do the transfer. That's the theory - I don't know if it's true for SQLite.
We intuitively get why a movie ticket costs $25 when renting the film costs $5, but TD is paying $25 to rent. SQLite's benefit is that you don't need a server, it's con is that you dont' get a server doing all the heavy lifting in another thread/process or possibly on a different cpu. That's a fundamental part of the hitch here. We've turned on indexes and joins. We've started doing vacuums and the problem is they can only happen while there's a process running, and we're it. We could: a) have TD split itself in two, a TradeAdder that implements TradeDB and soaks up the costs of keeping the database fresh, and a TradeGecko that shuttles stuff back and forth between the user and the backend; not entirely implausible, mostly separating concerns and building facades/bridges b) Make it possible or even a default to have a persistent database spun up one of several ways - docker, podman, vm, local or remote url and use that, with sqlite as a naive fallback This is where we might look to something like SQLAlchemy, PeeWee or pony. It's also where we might actually change some of the operations encumbered on TradeDB so that it doesn't have to in-memory everything.
A variant of (a) in the above, we literally just make TD keep itself alive over probably a non-sql persistant store (either embedded or a second process; key value or nosql: leveldb, rocksdb, memcache) and then have a very simple api for talking to it that it can use on itself; either thru json-rpc with something like flask/eel or literally have it read stdin and accept/process trade commands and pass them to the existing command-line arg parser:
via pseudo code:
I don't know off the top of my head where we can leverage this right now to any good use; I created an experimental Rust-based python-package providing a rust version of the "count lines in file" method. That's a hard one to optimize because cpu isn't the main bottleneck. I did a version where it would on startup create a number of counting workers, and then you simply gave them byte-regions to count from. And there was no version where it wasn't slower with concurrency. I wasn't exhaustive by any means, but you're trying to apply a very simple set of logic and rush thru memory, so there's not a lot of juggle room. |
that is uniquely ... lacking info lol. I've never seen it give that little information. what are the specs on the server? |
That's not the server, that's my machine. Tromador would have to provide you with any server runs. |
??? What requests.txt file?
I was considering using pySimpleGUI, I'll have to look at gradio, never heard of it.
Since the speed of the import goes down as the table gets bigger, this sounds like a great idea. The vast majority of the information goes into StationItem: all the other tables, completely filled with the spansh data, don't amolunt to even .5GB, but add in the market data and it grows to ~7GB, so while I doubt we'd see much improvement doing this on any of the other tables, it might be a big help for that particular one.
I"m sorry, that's all over my head. I defer to your expertise. |
| I"m sorry, that's all over my head. I defer to your expertise. Oh, just to make you regret saying that: | ??? What requests.txt file? ... was a typo of "requirements.txt". What I mean't was, in order for anyone to pip install Trade-Dangerous, it depends on requirements.txt on its own, right? Duh. Gonna check that real quick, dum-me: so it should be safe to remove all that crap screwing around to make sure requests is installed (in transfers.py), haha |
Oh. Yes. |
| I was considering using pySimpleGUI, I'll have to look at gradio, never heard of it. I don't have useful input on GUIs really; 'gradio' is - I believe - what people are using to build web-based-gui-apps for things in python, so it's not sexy but you spend a lot more time on what the gui elements do than on the screen itself. So, for instance, "StableDiffusion" has a gradio-based UI, as does the popular "run your own local GPT" https://github.com/oobabooga/text-generation-webui... https://www.gradio.app/ Yeah, I know it says machine learning, but they're pitching to their new market because of those last two. I've used flask for building UIs before but then you need to build an entire front-end application in angular or react or cut-my-wrists-for-me-now-please. If you're going to be old school about it and do a desktopy app; wxWidgets is shit, but you can keep it relatively simple and use 'wxFormBuilder' to design the app - I've done that a few times lately: https://github.com/wxFormBuilder/wxFormBuilder. So pySimpleGui might be rockstar - check that out before wxWidgets, because I hate it as much as I use it. You could conceivably do something with Dear imgui - I found it easy enough to use I wrote wrappers for it in C++ (https://github.com/kfsone/imguiwrap) when I used it a few years ago, the python version is ok, but it's best use case imho is for in-game debug screens/menus etc. So I'll look at doing some experiments on the item import and how it is currently working. How often have frontier been adding items? I wonder if it would make sense to actually alter the row format of StationItem so that there are two rows per station: one for selling, one for buying. |
Right now I'm trying a little experiment that reintroduces a per-station cursor; we're not building a list of items for executemany, but we're also not doing the full cost of adding each and every item in its own mini-transaction. Also breaking up the code a little-bit where we're not invoking methods in loops and we can afford the extra 40-50ns it takes to call a method in python lol. So for right now I think it would be useful for us to look at:
|
forget the row-shape idea then lol |
Just so you know - I'm not remotely above figuring out how to get an AI to help me, but I'm also not expecting it to be actual help. However... Giving it the ability to read the repository definitely makes it more useful: |
Do we care about the added column any more? |
I gave the Added TABLE an N=20 I never cared about that column. :D |
Just want to say, the recent PR is definitely an imrpovement. Even with the additional 'optimization' step, it finishes processing both listings and listings-live faster than the old way could get through just listings. I'm happy, and extremely grateful. |
@ultimatespirit appreciate the diligence and the detailed analysis; apologies to both of you for the churn I've caused lately - @eyeonus I'll try to do better about making less work for you man. I wanna make sure you don't think I'm gonna be a diva if you have to put a flea in my ear or point out I'm a nub. | Oh. So the filesystem-eliminates-empty-pages thing, and the hash-bucketing thing, wouldn't eliminate the empty bits in that case? Or am I misunderstanding something? The eliminates-empty-pages thing would be "spare files", but when I say "empty pages" I'm thinking in terms of a handful, as opposed to the file starting with 3906 empty pages (pages required to represent the 128000 unused IDs at the start of the ID-space). But I'm spitballing there, the hash-bucket has pros-and-cons, primarily that it requires a computation to get from actual-fdev-id to bit-no in the bitmap, and after doing it the resulting bitmap is only good for a one-way lookup: the "stations selling this item" bitmap on an item object can be used to go from a station id to do a quick bit-check of "does this station have this item", but it might not be possible to do the reverse, to say "bit 40401 is set, what does that mean?" because there isn't guaranteed to be a meaningful way to reverse the algorithm and calculate what station id bit 40401 represents. I did some experimentation to try and find algorithms that might work but at the end of the day I think it will be easiest to just have some kind of per-instance hidden column or table that collapses fdev_ids into near-sequential smaller numbers. I'll table those for an actual presentation of implementation and thinking when I have opportunity to put it together. |
The reason there's no progress display during the listings.csv download is that @Tromador's server isn't giving a size on the download, and the transfers.py code decides not to try and represent the progress. The rich progress.py changes I have in the "import-via-table" branch at least has a go at this, and we can improve on it: and the Processing stage is night-and-day. The nice thing about using rich is that the progress stuff happens in a thread so you aren't bleeding away performance in the tight loop. Recording.2024-04-28.213320.mp4It also uses these to let you know it's still working when it gets to rebuilding the index at the end, too. |
(didn't mean to throw you under the bus @Tromador - I should have said: our call to trom's server for the listings.csv isn't getting length information -- likely that there's a header missing in the request or sometihng) |
Not sure what the pattern is for fdev IDs, but you can remove the starting empty spots by just reducing all IDs by whatever the lowest fdev ID is. I.e. if the IDs start at 128,001 just offset all IDs by 128,000 as your zero index. From there though it's still not necessarily efficient to use a bitmap, are IDs always increasing and densely packed? If they aren't, and there are holes in the range, you'd need to non-trivially map them by order, 128,001 -> 0, 128,010 -> 1, etc. and that has no easy conversion back that doesn't involve an array access at least (have an array of IDs matching indices to bitmap positions). Which then runs into the next question, if it's sparsely populated does fdev fill in holes later? Because if they ever fill in a hole well... you have to completely rebuild that mapping, not fun to update an existing database to handle, though perhaps a worthwhile one-time cost if it has enough performance benefits and is considered a rare event.
Also, in case that was in response to my talking about progress above, I was actually referring to the DB rebuild steps / prices file processing steps, less so the downloads. But your prototype for more progress messages looks great. I noticed that new UI when spinning up the listener earlier today (though had to turn it off shortly after as it seems to download the spansh dump every hour even if it was listening to the EDDN the entire time in between, which is what spansh uses too?? I'm guessing there's some edge case requiring that...). |
| I was actually referring to the DB rebuild steps / prices file processing steps, The 10s video I posted is actually it processing the csv file, not downloading it. I didn't record the tail part because that's the part I'm working on, and at the moment it just sits there for 25 minutes showing this:
(but the spinner is animated the entire time as is the timer) @eyeonus the cause of this is these two indexes in particular: I can look at splitting the data into two tables so it doesn't need that index, if that's useful. It'll mean my doing some diligence on the various commands that access price information from the StationItem table to get it right. |
That's certainly possible, the header we pass is |
I'm fine with using a row_id if you think it'll help things. StationItem currently uses (station_id, item_id) as the PK, so adding a row_id to the table shouldn't bork anything, except maybe the two unique ids thing with the export? |
They would be something totally transient/private to the data set, almost even to the particular generation. I've been running various experiments/investigations but I probably need to actually create a couple of branches in my fork and try some of them out in more detail. for instance - I took the hash-bucket thing for a spin, trying to find if there are numbers that would actually work for us, and I actually found some cheap algorithms for transforming the range of fdev_ids in my local Station table into a 0-N range of unique numbers that essentially required 3-bits of bitmap per station to be able to create a bitmap for each item saying "these are all my stations" - so the indexes we currently have could be condensed down from 64-bits-per-station-vs-item to ~3-bits-per-station on each item's row. I ran an import, which added a station ... ran my test again, and the new station collided with another id after running the algo again. son of a... there's a python version here https://gist.github.com/kfsone/f7f7de6d7c0216cf44cdf97b526a474e and a rust version (cause the python version was gonna take several hours to run, I should have tried it under pypy instead) https://github.com/kfsone/tinker/tree/rust ---------- it's late, I'm blathering ... sorry. Current big hurt has to be the two indexes that have to present on the StationItem table to make it practical for buy/sell to distinguish only rows important to them. At a smaller scale, it was actually better having one table with sell/buy prices because you avoided a join, but now it forces us to maintain a massive conditional index. power hurt assemble So I can focus on splitting that into two tables on that import-via-table branch over the next several days, if you like. It currently has the new progress bar stuff in it because I wanted/needed the extra visibility while running the long tests.
I had written various bits and pieces of this stuff into TD early on, but I'm awful at UI so none of it is great, and switching it over to using rich should improve quality/appearance/speed and reduce the amount of maintainable code at the same time. |
That all sounds great. Feel free to work on the two table thing, and if it turns out to be much better than the current implementation, I consider that a win. |
So long as it's an old school Routemaster, I will happily be under the bus. The reason the server doesn't send content length header is because it streams listings.csv as gzipped content and because it's doing the compression on the fly, it doesn't actually know how the size of the data it's ultimately going to send. I guess listener could gzip listings.csv and then apache would know how much data it's sending? I have a terrible feeling of deja vu here. See, I have sarcoidosis and one of the many symptoms is brain fog and I have become much more forgetful than I once was, but I have an itch in the back of my brain which suggests we've been here before. Why did we choose to have the webserver do on the fly compression instead of just creating compressed files? |
I guess if listener compressed the files, apache sent listings.csv.gz then eddblink will have to uncompress it. As it is, the download is handled by urllib, which reads the headers, and uncompresses on the fly as the file downloads. Ultimately it's about saving a lot of bandwidth (and thus download time) given the innate compressibility of text. |
I name this branch: make imports lightning fast |
Yeah, that's totally my fault. I didn't know any other way to do it. |
Server hosting kindly donated by Alpha Omega Computers Ltd who are in the UK. Until I got sick, I was a company director there. The deal is that I get free hosting and they get to call me to ask dumb questions once in a while. |
Pretty sure that 'text' is not a valid encoding standard. IANA HTTP content coding registry. If it is allowed (or works anyway) make sure you don't use it for the download as you will be telling the server you only grok plain text and can't accept compressed content and it will presumably comply, sending the whole thing in the requested, uncompressed format. |
I tested it, and it didn't give any errors. Progress bar showed up and everything. |
@Tromador Just wasn't sure how to tell it to not encode. And this is only for an HTTP HEAD request:
Should probably give transfers.py the ability to open and maintain a "Connection" so that it can turn around the queries more rapidly - at the minute it has to do the https handshake on all of them which takes time and costs cpu/bandwidth it doesn't strictly need to, but I don't think it's hurting anyone atm so nice +19 :) |
I think you want 'identity' then, which says "I want the original with no encoding whatsoever". @eyeonus It may work, but that doesn't make it right, I have been through the RFC and it refers right back to the list in the link I gave above. Doing things correctly per standards is important for the day such a loophole is plugged in a patch and then suddenly it doesn't work. I had a good trawl on the net looking for "text" in the context of an "accept-encoding" header and can't find it. If you can, I am happy to be educated. |
No, in this case I'm the one to be educated. |
No! It were me as wuz educated. Ready for a round of "wen ah were a lad?" |
Well, I am a Yorkshireman :) |
I'm from a place called Grimsby which is neither Up North, Down South, Off East or Out West. It has its own ordinality: Theranaz. People from Grimsby are Up-In Theranaz.
-- Douglas Adams, Meaning of Liff |
@kfsone I'm in the process of making the spansh plugin capable of filling the Ship, ShipVendor, Upgrade, and UpgradeVendor tables from the spansh data. Because of the data that is (not) provided by spansh for the ships (cost) and upgrades (weight, cost), I'm wanting to change those tables to include the information that is avaialble. Current testing indicates that the new code works as expected on a clean run (no pre-existing DB), but with an already existing DB the tables won't match, which borks everything. Do you know if TD already has code to do this? If not, where would be the best place to have TD check if the DB needs updated, and if so drop the old table and add the new version? I could put it in the plugin itself, but it might be better to put it somewhere in tradedb or tradeenv? |
Ideally, I'd like to see if |
Yes there's a schema-migration system, but it's not particularly smart; basically it operates on the sqlite "pragma user_version". Take a look in cache.py. In a bit of karmic bloody-nosing while I was working on it (4 weeks ago?), we suddenly needed to change our sqlite schema at work for the first time in 15 years, and I ended up writing a more advanced version there, but I'll see if the boss is OK with me donating it. From the runbook here's a "change 3" which is recovering from a previous mistake but compounds it, and change 4 that fixes things finally.
|
Not sure what the problem is, but there's a huge difference in processing time when doing an import using EDDBlink plugin when the database file is large vs. small. To whit:
TradeDangerous.db file size: 6.5 GiB
listings-live.csv file size: 10.3 MiB
Time to completion: ~24 minutes
versus:
TradeDangerous.db file size: 43.7 MiB (empty StationItem table, otherwise identical to above database)
listings-live.csv file size: 10.3 MiB (Same file as above)
Time to completion: ~7 seconds
Everything is exactly the same in both cases except for the size of the StationItem table in the database being imported to.
The text was updated successfully, but these errors were encountered: