Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create index inside writer and transformation #319

Merged
merged 13 commits into from
Dec 11, 2024

Conversation

richiejp
Copy link
Contributor

@richiejp richiejp commented Dec 3, 2024

This creates the index while writing the file saving an extra pass. In addition it allows reading the index while a file is still being indexed or transformed.

TODO:

  • Fix extension example build where libzsv symbols are missing
  • Fix bug where cursor can not reach the bottom row after filter
  • Allow index to be used before transformation or indexing is complete (if feasible)
  • Fix CI failures
  • fix search on filter results being disallowed
  • decide if some flags are now redundant and can be removed

@richiejp richiejp force-pushed the transformation-index branch 3 times, most recently from 2f2858d to e0f2173 Compare December 3, 2024 17:11
@richiejp richiejp marked this pull request as ready for review December 3, 2024 17:12
@richiejp richiejp force-pushed the transformation-index branch 4 times, most recently from 56a724c to 9e11dde Compare December 4, 2024 15:53
@richiejp
Copy link
Contributor Author

richiejp commented Dec 5, 2024

@liquidaty this is done from my POV, I'll continue work in another branch on top of this.

EDIT: I added retry function for testing because timeouts keep causing issues.

@richiejp richiejp force-pushed the transformation-index branch 3 times, most recently from 14e1533 to f9eb48f Compare December 5, 2024 14:22
@liquidaty
Copy link
Owner

Thank you @richiejp. Would it be possible to set up some sort of simple benchmark test to measure the impact of this PR on index creation speed and/or other targeted metrics? This will be useful not only for understanding performance in general, but also for sanity checking that the changes have the intended/expected effect

@richiejp richiejp force-pushed the transformation-index branch 2 times, most recently from 8e891c5 to dcb7935 Compare December 6, 2024 15:15
@richiejp
Copy link
Contributor Author

richiejp commented Dec 6, 2024

Yes I think it should be possible, but a little more difficult than I expected. So far I added a new expect function to all of the sheet tests (this dramatically sped up the tests) and took measurements of each stage. Then output the timings to a CSV file like this:

> cli pretty tmp/timings-2024-12-06-15-15-18.csv
Test         |Stage       |Time
test-sheet-2 |indexed     |0.04
test-sheet-2 |            |0.22
test-sheet-3 |indexed     |0.04
test-sheet-3 |            |0.15
test-sheet-4 |indexed     |0.04
test-sheet-4 |            |0.19
test-sheet-5 |            |0.04
test-sheet-6 |indexed     |0.04
test-sheet-6 |            |
test-sheet-7 |indexed     |0.04
test-sheet-7 |            |0.01
test-sheet-8 |indexed     |0.04
test-sheet-8 |filtered    |0.42
test-sheet-8 |            |0.01
test-sheet-9 |indexed     |0.04
test-sheet-9 |            |0.24
test-sheet-10|indexed     |0.04
test-sheet-10|filtered    |0.42
test-sheet-10|            |0.01
test-sheet-11|indexed     |0.05
test-sheet-11|            |0.27
test-sheet-12|indexed     |0.04
test-sheet-12|filtered-el |0.23
test-sheet-12|            |0.23
test-sheet-13|            |0.05
test-sheet-14|help        |0.05
test-sheet-14|            |0.25

This is not very accurate, but if the indexing performance regressed by an order of magnitude we would see the indexing time increase across all tests. Possibly we could checkin this data to Git and look for regressions.

Probably the best way to improve this for the least amount of effort would be to repeatedly open a file in the same sheet session and index it. Completely isolating the index code into a benchmark is tempting, but the full system performance is what the user sees. Another option would be to add a command line option to the CLI to output timings for various operations or to instrument it.

I'll think about it a bit more and probably add an extra test with repeated indexing unless you have some feedback.

@richiejp richiejp force-pushed the transformation-index branch from dcb7935 to f334c84 Compare December 6, 2024 16:03
@iamazeem
Copy link
Collaborator

iamazeem commented Dec 6, 2024

@richiejp: The last three workflows are still running for ci-musl job.
Workflows: https://github.com/liquidaty/zsv/actions
Looks like an infinite loop.
Apparently, it's something to do with the test-expect.sh and/or test-retry-capture-cmp.sh scripts. 🤔
ci-musl job is run in alpine:latest container and its default shell is ash, not sh.
Haven't looked into much though. You might want to cancel the workflows.

@liquidaty:
A timeout may be added to avoid such accidental long running workflows e.g. 30 minutes max.
What do you think?

@liquidaty
Copy link
Owner

Thank you, yes please add a timeout... start with 15 mins?

@richiejp
Copy link
Contributor Author

richiejp commented Dec 6, 2024

its default shell is ash, not sh.

Thanks that could be it there are differences in the timeout command across platforms that I already ran into @iamazeem

BTW I can't cancel the job

@liquidaty
Copy link
Owner

@richiejp two thoughts come to mind. Apologies if I created a lot more work by not clarifying earlier

  1. Can this be simplified and clearly demonstrated by just manually running 1 test, whereby two different versions are built (one before changes, one after), and then their performance is compared on a very large file, over say 10 runs?
  2. A separate concern is whether the tests should be part of the automated test suite. Of course, it is nice if it can be, but for this type of test, I think it's acceptable if they are not, given that it requires two different builds. If it's just a one-time test manually run prior to the code merge, that is not easily replicable later (maybe because before-change-build is not supported), I think that could still be acceptable.

What do you think?

@richiejp
Copy link
Contributor Author

richiejp commented Dec 6, 2024

@liquidaty makes sense!

Sorry I could have asked for clarification, but on the other hand once I fix the infinite loop issue and test 6, this eliminates slow testing and manually setting sleep times which have been an issue for me while testing changes to sheet. So I think the time will be recuperated fairly quickly.

@iamazeem
Copy link
Collaborator

iamazeem commented Dec 6, 2024

@richiejp: Canceled the last three workflows.
15m timeout added with 90d7bc1 for CI jobs.
Pushed to main.
Please sync this PR. Thanks!

CC: @liquidaty

@richiejp richiejp force-pushed the transformation-index branch from f334c84 to 5e4994a Compare December 6, 2024 19:11
@richiejp
Copy link
Contributor Author

richiejp commented Dec 6, 2024

@iamazeem thanks, rebased on main.

@iamazeem
Copy link
Collaborator

iamazeem commented Dec 6, 2024

@richiejp: For scripts, there's shellcheck report on the workflow run summary page.
For the latest workflow run, here it is: https://github.com/liquidaty/zsv/actions/runs/12204702991/attempts/1#summary-34050376445.
Currently, the CI pipeline is just reporting issues, not failing it (it's a TODO for the future).
Please fix the shell script issues if it's decided to push those.
VSCode has a shellcheck extension. Not sure about other editors.
Its online version (https://www.shellcheck.net/) may also be used to identify and fix issues.

@richiejp richiejp force-pushed the transformation-index branch from 5e4994a to 17e9948 Compare December 9, 2024 09:35
This creates the index while writing the file saving an extra pass
Helps with missing symbol errors
Allows the index to be used in a thread safe way so that entries can
be read in the main thread while more are being added in the worker thread.
This also reduces test time dramatically on faster hardware.
@richiejp richiejp force-pushed the transformation-index branch 3 times, most recently from 190868c to a91ccd4 Compare December 9, 2024 10:59
Allows us to wait for indexing to finish during testing without
needing to sleep to avoid matching the initial status.
For some reason timing timeout works but not timing out time
@richiejp richiejp force-pushed the transformation-index branch from a91ccd4 to 3c686f5 Compare December 9, 2024 11:42
UI buffer frees this on closing and the indexing thread will use it so
it can't point to a stack variable or memory that gets overwritten
after the file is opened.
@richiejp richiejp force-pushed the transformation-index branch from a5d9b7d to ee8efe5 Compare December 9, 2024 13:14
@richiejp
Copy link
Contributor Author

richiejp commented Dec 9, 2024

I added a benchmark test and ran it on a 13GB file it takes approx between 36-38 seconds with and without these changes on my dev laptop. For comparison csvindex takes 11 seconds on my machine. @liquidaty

the test-timestamp test failed on Mac, I haven't had chance to look yet, but I don't think my changes should effect this feature so it's possibly a random failure @CobbCoding1

@liquidaty
Copy link
Owner

approx between 36-38 seconds with and without these changes on my dev laptop. For comparison csvindex takes 11 seconds on my machine

Great work, that is very helpful. While it suggests this PR is within expectations (i.e. no worse than before), it does beg the question of why the original implementation is taking > 3x longer than the (single-threaded) csvindex run (which in theory should be a tad slower since it also writes the index to file). Any thoughts as to what is driving that and how it might be modified to match csvindex performance?

@richiejp
Copy link
Contributor Author

Thanks, turns out that increasing the read buffer size to 2MB reduces the time to ~13 seconds. This appears to be the optimal buffer size on my laptop and I imagine it's because it avoids reading the file and taking locks to communicate progress with the main thread.

I'd suggest the library default should be 2MB or it could dynamically resize the buffer after reading from the stream a couple of times.

@liquidaty
Copy link
Owner

Thank you @richiejp, that is interesting.

The original csvindex code only uses the default buffer size (256k), and increasing the buffersize with other zsv operations (e.g. zsv count bigfile.csv --buffsize 2097152) does not have nearly the same performance impact (though there is a little, maybe ~2%). Given that, it seems possible that the root performance issue is primarily the locks, and not the buffer size (though increasing the latter decreases the frequency of the former). If that is the case, the same impact as increasing the buffer size, but without any memory increase, could be achieved by limiting the execution of locking inside build_memory_index() to only occur after each [2MB] chunk has been scanned (e.g. using zsv_scanned_length()).

On my machine, the "count" operation scans about 1GB per second, and since the index update is only for user interaction, and assuming the average machine we want to target is 4x slower and an update every quarter second is probably well within acceptable limits, then we could only have to update every 32MB or 64MB.

If you haven't already explored this route, would you mind to do so?

@liquidaty
Copy link
Owner

@richiejp I'm finding a bug that appears to be related to the index implementation-- running sheet on the attached will cause indexer errors, as it appears to be miscalculating line ending on/after line 4401. I've tried using the same file with csvindex and it appears to be working correctly (using e.g. csvindex slice), so maybe the cause can be identified by comparing their implementations. Possibly it is related to invalid utf8 in the input data around byte 155825, and while that is unfortunate, it shouldn't break the indexer. Could you take a look?
converted.csv

This reduces the indexing time by 3-4x on a 13GB file with a Ryzen 7 6800U.
@richiejp richiejp force-pushed the transformation-index branch from 6e66cc2 to d8cc28b Compare December 11, 2024 09:24
@richiejp
Copy link
Contributor Author

assuming the average machine we want to target is 4x slower and an update every quarter second is probably well within acceptable limits, then we could only have to update every 32MB or 64MB.

Yes that's a good point, I have just pushed a version which leaves the buffer size the same, but only updates the UI every 32MB. Performance is about the same.

I find it interesting that the buffer size has no other impact, but I'm not sure whether I prefer this version because it complicates the code slightly to save <2MB during indexing, however I really don't feel strongly about that.

Possibly it is related to invalid utf8 in the input data around byte 155825, and while that is unfortunate, it shouldn't break the indexer

My first thought is that there is a difference in how 2-byte line endings are handled and maybe it needs to handle invalid utf8 just before a line ending. I'll take a look, thanks!

@richiejp
Copy link
Contributor Author

Actually it appears to be related to line endings. Removing non utf8 has no effect, but replacing '\r\n' with '\n' stops the error. Looking at seek_and_check_newline it appears it never seeks on a two byte ending which is my mistake.

zsv_cum_scanned_length returns one less character than what would be needed when the parser starts on \r\n, so that the second call to seek_and_check_newline fails. I'm not sure why that is or if it is expected. In any case the bug should be fixed with the commit I added.

Also add a test on a file with only \r\n line endings as the mixed
line endings test did not reproduce this bug.
@richiejp richiejp force-pushed the transformation-index branch from 7f56aaf to 5dd0e1a Compare December 11, 2024 13:33
@richiejp
Copy link
Contributor Author

I added a test for files only with \r\n which reproduces the bug. The CI failure appears to be unrelated.

@liquidaty
Copy link
Owner

Great! Thank you.

zsv_cum_scanned_length returns one less character than what would be needed when the parser starts on \r\n

That is expected; the row handler is called immediately when the line end is detected, without waiting to process the next byte in the case of a 2-byte newline. csvindex uses zsv_peek() to handle this, which will return the next byte in a performant way (from the buffer if available, or in the unlikely case that the first-char line end was split exactly at the end of a buffer, then from the next stream read)

@liquidaty
Copy link
Owner

I find it interesting that the buffer size has no other impact, but I'm not sure whether I prefer this version because it complicates the code slightly to save <2MB during indexing, however I really don't feel strongly about that.

One of the zsv design goals is to always the minimize memory footprint, in case anyone ever, for example, wants to run a zillion concurrent instances. Who knows, maybe in the future, we might want to try to parallelize indexing-- for example, this article describes a feasible way to parallelize without an initial sequential parsing, but I haven't considered it because it is still not 100% accurate. But I could imagine a super-optimized (probably over-optimized) way where the parallelized approach is used first, and then a background thread does a sanity check on its results (which can probably be basically instantaneous, as it only need to check that the ending in-quote status of a chunk parse is equivalent to what the next chunk parse guessed as the correct status) and in the rare case of an error, handles via some retry or slow path.

We may want to also want to reconsider the internal index API to make it easily usable by other utilities, for example by changing the main thread update from being hardcoded in the indexer to being a progress callback that is part of the options passed to the indexer. Since that is a built-in zsv option, it may consolidate the related indexer code as well, and provide a more versatile API (in which case the lock handling would only need to exist in the caller code, which also seems to simplify)-- but that can definitely be a different PR/feature, and the end goal might not be worth pursuing if it turns out that real-world utility is not compelling enough. I suspect it might be compelling enough however, if only for searching/filtering

@liquidaty liquidaty merged commit 417e27d into liquidaty:main Dec 11, 2024
13 checks passed
@liquidaty
Copy link
Owner

@richiejp when I build/run locally, none of the tests using scripts/test-expect.sh pass-- they all fail with a message similar to:

Incorrect output:
cat: /path/to/src/zsv/tmp/test-sheet-2-indexed.out: No such file or directory

I'm not sure how the CI is able to pass but even so, at some point if the tests fail for valid reasons, we'll need to have the ability to examine and manually replicate what it is doing. For that purpose:

  1. Can the content of scripts/test-expect.sh instead be moved into the Makefile, or otherwise made more transparent?

  2. I can't easily tell from looking at the script (or scripts it calls) what it is doing or why. To the extent any steps are left inside a script, can more commenting, reorg of the scripts and/or other steps be taken to make more transparent what the script is doing and why? At the least, it would be immensely helpful to have some sort of VERBOSE option that would print out all the commands it is running before its final comparison test

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants