-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create index inside writer and transformation #319
Conversation
2f2858d
to
e0f2173
Compare
56a724c
to
9e11dde
Compare
@liquidaty this is done from my POV, I'll continue work in another branch on top of this. EDIT: I added retry function for testing because timeouts keep causing issues. |
14e1533
to
f9eb48f
Compare
Thank you @richiejp. Would it be possible to set up some sort of simple benchmark test to measure the impact of this PR on index creation speed and/or other targeted metrics? This will be useful not only for understanding performance in general, but also for sanity checking that the changes have the intended/expected effect |
8e891c5
to
dcb7935
Compare
Yes I think it should be possible, but a little more difficult than I expected. So far I added a new expect function to all of the sheet tests (this dramatically sped up the tests) and took measurements of each stage. Then output the timings to a CSV file like this:
This is not very accurate, but if the indexing performance regressed by an order of magnitude we would see the indexing time increase across all tests. Possibly we could checkin this data to Git and look for regressions. Probably the best way to improve this for the least amount of effort would be to repeatedly open a file in the same sheet session and index it. Completely isolating the index code into a benchmark is tempting, but the full system performance is what the user sees. Another option would be to add a command line option to the CLI to output timings for various operations or to instrument it. I'll think about it a bit more and probably add an extra test with repeated indexing unless you have some feedback. |
dcb7935
to
f334c84
Compare
@richiejp: The last three workflows are still running for @liquidaty: |
Thank you, yes please add a timeout... start with 15 mins? |
Thanks that could be it there are differences in the timeout command across platforms that I already ran into @iamazeem BTW I can't cancel the job |
@richiejp two thoughts come to mind. Apologies if I created a lot more work by not clarifying earlier
What do you think? |
@liquidaty makes sense! Sorry I could have asked for clarification, but on the other hand once I fix the infinite loop issue and test 6, this eliminates slow testing and manually setting sleep times which have been an issue for me while testing changes to sheet. So I think the time will be recuperated fairly quickly. |
@richiejp: Canceled the last three workflows. CC: @liquidaty |
f334c84
to
5e4994a
Compare
@iamazeem thanks, rebased on main. |
@richiejp: For scripts, there's |
5e4994a
to
17e9948
Compare
This creates the index while writing the file saving an extra pass
Helps with missing symbol errors
Allows the index to be used in a thread safe way so that entries can be read in the main thread while more are being added in the worker thread.
This also reduces test time dramatically on faster hardware.
190868c
to
a91ccd4
Compare
Allows us to wait for indexing to finish during testing without needing to sleep to avoid matching the initial status.
For some reason timing timeout works but not timing out time
a91ccd4
to
3c686f5
Compare
UI buffer frees this on closing and the indexing thread will use it so it can't point to a stack variable or memory that gets overwritten after the file is opened.
a5d9b7d
to
ee8efe5
Compare
I added a benchmark test and ran it on a 13GB file it takes approx between 36-38 seconds with and without these changes on my dev laptop. For comparison csvindex takes 11 seconds on my machine. @liquidaty the test-timestamp test failed on Mac, I haven't had chance to look yet, but I don't think my changes should effect this feature so it's possibly a random failure @CobbCoding1 |
Great work, that is very helpful. While it suggests this PR is within expectations (i.e. no worse than before), it does beg the question of why the original implementation is taking > 3x longer than the (single-threaded) csvindex run (which in theory should be a tad slower since it also writes the index to file). Any thoughts as to what is driving that and how it might be modified to match csvindex performance? |
Thanks, turns out that increasing the read buffer size to 2MB reduces the time to ~13 seconds. This appears to be the optimal buffer size on my laptop and I imagine it's because it avoids reading the file and taking locks to communicate progress with the main thread. I'd suggest the library default should be 2MB or it could dynamically resize the buffer after reading from the stream a couple of times. |
Thank you @richiejp, that is interesting. The original csvindex code only uses the default buffer size (256k), and increasing the buffersize with other zsv operations (e.g. On my machine, the "count" operation scans about 1GB per second, and since the index update is only for user interaction, and assuming the average machine we want to target is 4x slower and an update every quarter second is probably well within acceptable limits, then we could only have to update every 32MB or 64MB. If you haven't already explored this route, would you mind to do so? |
@richiejp I'm finding a bug that appears to be related to the index implementation-- running |
This reduces the indexing time by 3-4x on a 13GB file with a Ryzen 7 6800U.
6e66cc2
to
d8cc28b
Compare
Yes that's a good point, I have just pushed a version which leaves the buffer size the same, but only updates the UI every 32MB. Performance is about the same. I find it interesting that the buffer size has no other impact, but I'm not sure whether I prefer this version because it complicates the code slightly to save <2MB during indexing, however I really don't feel strongly about that.
My first thought is that there is a difference in how 2-byte line endings are handled and maybe it needs to handle invalid utf8 just before a line ending. I'll take a look, thanks! |
Actually it appears to be related to line endings. Removing non utf8 has no effect, but replacing '\r\n' with '\n' stops the error. Looking at
|
Also add a test on a file with only \r\n line endings as the mixed line endings test did not reproduce this bug.
7f56aaf
to
5dd0e1a
Compare
I added a test for files only with \r\n which reproduces the bug. The CI failure appears to be unrelated. |
Great! Thank you.
That is expected; the row handler is called immediately when the line end is detected, without waiting to process the next byte in the case of a 2-byte newline. csvindex uses zsv_peek() to handle this, which will return the next byte in a performant way (from the buffer if available, or in the unlikely case that the first-char line end was split exactly at the end of a buffer, then from the next stream read) |
One of the zsv design goals is to always the minimize memory footprint, in case anyone ever, for example, wants to run a zillion concurrent instances. Who knows, maybe in the future, we might want to try to parallelize indexing-- for example, this article describes a feasible way to parallelize without an initial sequential parsing, but I haven't considered it because it is still not 100% accurate. But I could imagine a super-optimized (probably over-optimized) way where the parallelized approach is used first, and then a background thread does a sanity check on its results (which can probably be basically instantaneous, as it only need to check that the ending in-quote status of a chunk parse is equivalent to what the next chunk parse guessed as the correct status) and in the rare case of an error, handles via some retry or slow path. We may want to also want to reconsider the internal index API to make it easily usable by other utilities, for example by changing the main thread update from being hardcoded in the indexer to being a progress callback that is part of the options passed to the indexer. Since that is a built-in zsv option, it may consolidate the related indexer code as well, and provide a more versatile API (in which case the lock handling would only need to exist in the caller code, which also seems to simplify)-- but that can definitely be a different PR/feature, and the end goal might not be worth pursuing if it turns out that real-world utility is not compelling enough. I suspect it might be compelling enough however, if only for searching/filtering |
@richiejp when I build/run locally, none of the tests using
I'm not sure how the CI is able to pass but even so, at some point if the tests fail for valid reasons, we'll need to have the ability to examine and manually replicate what it is doing. For that purpose:
|
This creates the index while writing the file saving an extra pass. In addition it allows reading the index while a file is still being indexed or transformed.
TODO: