Create index inside writer and transformation #319

richiejp · 2024-12-03T14:53:55Z

This creates the index while writing the file saving an extra pass. In addition it allows reading the index while a file is still being indexed or transformed.

TODO:

Fix extension example build where libzsv symbols are missing
Fix bug where cursor can not reach the bottom row after filter
Allow index to be used before transformation or indexing is complete (if feasible)
Fix CI failures
fix search on filter results being disallowed
decide if some flags are now redundant and can be removed

richiejp · 2024-12-05T10:01:09Z

@liquidaty this is done from my POV, I'll continue work in another branch on top of this.

EDIT: I added retry function for testing because timeouts keep causing issues.

liquidaty · 2024-12-05T21:05:16Z

Thank you @richiejp. Would it be possible to set up some sort of simple benchmark test to measure the impact of this PR on index creation speed and/or other targeted metrics? This will be useful not only for understanding performance in general, but also for sanity checking that the changes have the intended/expected effect

richiejp · 2024-12-06T15:56:44Z

Yes I think it should be possible, but a little more difficult than I expected. So far I added a new expect function to all of the sheet tests (this dramatically sped up the tests) and took measurements of each stage. Then output the timings to a CSV file like this:

> cli pretty tmp/timings-2024-12-06-15-15-18.csv
Test         |Stage       |Time
test-sheet-2 |indexed     |0.04
test-sheet-2 |            |0.22
test-sheet-3 |indexed     |0.04
test-sheet-3 |            |0.15
test-sheet-4 |indexed     |0.04
test-sheet-4 |            |0.19
test-sheet-5 |            |0.04
test-sheet-6 |indexed     |0.04
test-sheet-6 |            |
test-sheet-7 |indexed     |0.04
test-sheet-7 |            |0.01
test-sheet-8 |indexed     |0.04
test-sheet-8 |filtered    |0.42
test-sheet-8 |            |0.01
test-sheet-9 |indexed     |0.04
test-sheet-9 |            |0.24
test-sheet-10|indexed     |0.04
test-sheet-10|filtered    |0.42
test-sheet-10|            |0.01
test-sheet-11|indexed     |0.05
test-sheet-11|            |0.27
test-sheet-12|indexed     |0.04
test-sheet-12|filtered-el |0.23
test-sheet-12|            |0.23
test-sheet-13|            |0.05
test-sheet-14|help        |0.05
test-sheet-14|            |0.25

This is not very accurate, but if the indexing performance regressed by an order of magnitude we would see the indexing time increase across all tests. Possibly we could checkin this data to Git and look for regressions.

Probably the best way to improve this for the least amount of effort would be to repeatedly open a file in the same sheet session and index it. Completely isolating the index code into a benchmark is tempting, but the full system performance is what the user sees. Another option would be to add a command line option to the CLI to output timings for various operations or to instrument it.

I'll think about it a bit more and probably add an extra test with repeated indexing unless you have some feedback.

iamazeem · 2024-12-06T17:13:06Z

@richiejp: The last three workflows are still running for ci-musl job.
Workflows: https://github.com/liquidaty/zsv/actions
Looks like an infinite loop.
Apparently, it's something to do with the test-expect.sh and/or test-retry-capture-cmp.sh scripts. 🤔
ci-musl job is run in alpine:latest container and its default shell is ash, not sh.
Haven't looked into much though. You might want to cancel the workflows.

@liquidaty:
A timeout may be added to avoid such accidental long running workflows e.g. 30 minutes max.
What do you think?

liquidaty · 2024-12-06T18:46:34Z

Thank you, yes please add a timeout... start with 15 mins?

richiejp · 2024-12-06T18:49:58Z

its default shell is ash, not sh.

Thanks that could be it there are differences in the timeout command across platforms that I already ran into @iamazeem

BTW I can't cancel the job

liquidaty · 2024-12-06T18:53:29Z

@richiejp two thoughts come to mind. Apologies if I created a lot more work by not clarifying earlier

Can this be simplified and clearly demonstrated by just manually running 1 test, whereby two different versions are built (one before changes, one after), and then their performance is compared on a very large file, over say 10 runs?
A separate concern is whether the tests should be part of the automated test suite. Of course, it is nice if it can be, but for this type of test, I think it's acceptable if they are not, given that it requires two different builds. If it's just a one-time test manually run prior to the code merge, that is not easily replicable later (maybe because before-change-build is not supported), I think that could still be acceptable.

What do you think?

richiejp · 2024-12-06T19:09:22Z

@liquidaty makes sense!

Sorry I could have asked for clarification, but on the other hand once I fix the infinite loop issue and test 6, this eliminates slow testing and manually setting sleep times which have been an issue for me while testing changes to sheet. So I think the time will be recuperated fairly quickly.

iamazeem · 2024-12-06T19:09:57Z

@richiejp: Canceled the last three workflows.
15m timeout added with 90d7bc1 for CI jobs.
Pushed to main.
Please sync this PR. Thanks!

CC: @liquidaty

richiejp · 2024-12-06T19:11:59Z

@iamazeem thanks, rebased on main.

iamazeem · 2024-12-06T19:19:26Z

@richiejp: For scripts, there's shellcheck report on the workflow run summary page.
For the latest workflow run, here it is: https://github.com/liquidaty/zsv/actions/runs/12204702991/attempts/1#summary-34050376445.
Currently, the CI pipeline is just reporting issues, not failing it (it's a TODO for the future).
Please fix the shell script issues if it's decided to push those.
VSCode has a shellcheck extension. Not sure about other editors.
Its online version (https://www.shellcheck.net/) may also be used to identify and fix issues.

This creates the index while writing the file saving an extra pass

Helps with missing symbol errors

Allows the index to be used in a thread safe way so that entries can be read in the main thread while more are being added in the worker thread.

This also reduces test time dramatically on faster hardware.

Allows us to wait for indexing to finish during testing without needing to sleep to avoid matching the initial status.

For some reason timing timeout works but not timing out time

UI buffer frees this on closing and the indexing thread will use it so it can't point to a stack variable or memory that gets overwritten after the file is opened.

richiejp · 2024-12-09T15:38:05Z

I added a benchmark test and ran it on a 13GB file it takes approx between 36-38 seconds with and without these changes on my dev laptop. For comparison csvindex takes 11 seconds on my machine. @liquidaty

the test-timestamp test failed on Mac, I haven't had chance to look yet, but I don't think my changes should effect this feature so it's possibly a random failure @CobbCoding1

liquidaty · 2024-12-09T16:46:30Z

approx between 36-38 seconds with and without these changes on my dev laptop. For comparison csvindex takes 11 seconds on my machine

Great work, that is very helpful. While it suggests this PR is within expectations (i.e. no worse than before), it does beg the question of why the original implementation is taking > 3x longer than the (single-threaded) csvindex run (which in theory should be a tad slower since it also writes the index to file). Any thoughts as to what is driving that and how it might be modified to match csvindex performance?

richiejp · 2024-12-10T09:22:40Z

Thanks, turns out that increasing the read buffer size to 2MB reduces the time to ~13 seconds. This appears to be the optimal buffer size on my laptop and I imagine it's because it avoids reading the file and taking locks to communicate progress with the main thread.

I'd suggest the library default should be 2MB or it could dynamically resize the buffer after reading from the stream a couple of times.

liquidaty · 2024-12-10T18:06:30Z

Thank you @richiejp, that is interesting.

The original csvindex code only uses the default buffer size (256k), and increasing the buffersize with other zsv operations (e.g. zsv count bigfile.csv --buffsize 2097152) does not have nearly the same performance impact (though there is a little, maybe ~2%). Given that, it seems possible that the root performance issue is primarily the locks, and not the buffer size (though increasing the latter decreases the frequency of the former). If that is the case, the same impact as increasing the buffer size, but without any memory increase, could be achieved by limiting the execution of locking inside build_memory_index() to only occur after each [2MB] chunk has been scanned (e.g. using zsv_scanned_length()).

On my machine, the "count" operation scans about 1GB per second, and since the index update is only for user interaction, and assuming the average machine we want to target is 4x slower and an update every quarter second is probably well within acceptable limits, then we could only have to update every 32MB or 64MB.

If you haven't already explored this route, would you mind to do so?

liquidaty · 2024-12-11T09:06:48Z

@richiejp I'm finding a bug that appears to be related to the index implementation-- running sheet on the attached will cause indexer errors, as it appears to be miscalculating line ending on/after line 4401. I've tried using the same file with csvindex and it appears to be working correctly (using e.g. csvindex slice), so maybe the cause can be identified by comparing their implementations. Possibly it is related to invalid utf8 in the input data around byte 155825, and while that is unfortunate, it shouldn't break the indexer. Could you take a look?
converted.csv

This reduces the indexing time by 3-4x on a 13GB file with a Ryzen 7 6800U.

richiejp · 2024-12-11T09:47:51Z

assuming the average machine we want to target is 4x slower and an update every quarter second is probably well within acceptable limits, then we could only have to update every 32MB or 64MB.

Yes that's a good point, I have just pushed a version which leaves the buffer size the same, but only updates the UI every 32MB. Performance is about the same.

I find it interesting that the buffer size has no other impact, but I'm not sure whether I prefer this version because it complicates the code slightly to save <2MB during indexing, however I really don't feel strongly about that.

Possibly it is related to invalid utf8 in the input data around byte 155825, and while that is unfortunate, it shouldn't break the indexer

My first thought is that there is a difference in how 2-byte line endings are handled and maybe it needs to handle invalid utf8 just before a line ending. I'll take a look, thanks!

richiejp · 2024-12-11T12:12:00Z

Actually it appears to be related to line endings. Removing non utf8 has no effect, but replacing '\r\n' with '\n' stops the error. Looking at seek_and_check_newline it appears it never seeks on a two byte ending which is my mistake.

zsv_cum_scanned_length returns one less character than what would be needed when the parser starts on \r\n, so that the second call to seek_and_check_newline fails. I'm not sure why that is or if it is expected. In any case the bug should be fixed with the commit I added.

Also add a test on a file with only \r\n line endings as the mixed line endings test did not reproduce this bug.

richiejp · 2024-12-11T13:40:01Z

I added a test for files only with \r\n which reproduces the bug. The CI failure appears to be unrelated.

liquidaty · 2024-12-11T16:15:40Z

Great! Thank you.

zsv_cum_scanned_length returns one less character than what would be needed when the parser starts on \r\n

That is expected; the row handler is called immediately when the line end is detected, without waiting to process the next byte in the case of a 2-byte newline. csvindex uses zsv_peek() to handle this, which will return the next byte in a performant way (from the buffer if available, or in the unlikely case that the first-char line end was split exactly at the end of a buffer, then from the next stream read)

liquidaty · 2024-12-11T16:55:02Z

I find it interesting that the buffer size has no other impact, but I'm not sure whether I prefer this version because it complicates the code slightly to save <2MB during indexing, however I really don't feel strongly about that.

One of the zsv design goals is to always the minimize memory footprint, in case anyone ever, for example, wants to run a zillion concurrent instances. Who knows, maybe in the future, we might want to try to parallelize indexing-- for example, this article describes a feasible way to parallelize without an initial sequential parsing, but I haven't considered it because it is still not 100% accurate. But I could imagine a super-optimized (probably over-optimized) way where the parallelized approach is used first, and then a background thread does a sanity check on its results (which can probably be basically instantaneous, as it only need to check that the ending in-quote status of a chunk parse is equivalent to what the next chunk parse guessed as the correct status) and in the rare case of an error, handles via some retry or slow path.

We may want to also want to reconsider the internal index API to make it easily usable by other utilities, for example by changing the main thread update from being hardcoded in the indexer to being a progress callback that is part of the options passed to the indexer. Since that is a built-in zsv option, it may consolidate the related indexer code as well, and provide a more versatile API (in which case the lock handling would only need to exist in the caller code, which also seems to simplify)-- but that can definitely be a different PR/feature, and the end goal might not be worth pursuing if it turns out that real-world utility is not compelling enough. I suspect it might be compelling enough however, if only for searching/filtering

liquidaty · 2024-12-11T22:29:34Z

@richiejp when I build/run locally, none of the tests using scripts/test-expect.sh pass-- they all fail with a message similar to:

Incorrect output:
cat: /path/to/src/zsv/tmp/test-sheet-2-indexed.out: No such file or directory

I'm not sure how the CI is able to pass but even so, at some point if the tests fail for valid reasons, we'll need to have the ability to examine and manually replicate what it is doing. For that purpose:

Can the content of scripts/test-expect.sh instead be moved into the Makefile, or otherwise made more transparent?
I can't easily tell from looking at the script (or scripts it calls) what it is doing or why. To the extent any steps are left inside a script, can more commenting, reorg of the scripts and/or other steps be taken to make more transparent what the script is doing and why? At the least, it would be immensely helpful to have some sort of VERBOSE option that would print out all the commands it is running before its final comparison test

richiejp force-pushed the transformation-index branch 3 times, most recently from 2f2858d to e0f2173 Compare December 3, 2024 17:11

richiejp marked this pull request as ready for review December 3, 2024 17:12

richiejp force-pushed the transformation-index branch 4 times, most recently from 56a724c to 9e11dde Compare December 4, 2024 15:53

richiejp force-pushed the transformation-index branch 3 times, most recently from 14e1533 to f9eb48f Compare December 5, 2024 14:22

richiejp force-pushed the transformation-index branch 2 times, most recently from 8e891c5 to dcb7935 Compare December 6, 2024 15:15

richiejp force-pushed the transformation-index branch from dcb7935 to f334c84 Compare December 6, 2024 16:03

richiejp force-pushed the transformation-index branch from f334c84 to 5e4994a Compare December 6, 2024 19:11

richiejp force-pushed the transformation-index branch from 5e4994a to 17e9948 Compare December 9, 2024 09:35

richiejp added 3 commits December 9, 2024 09:38

Create index inside writer and transformation

be09f08

This creates the index while writing the file saving an extra pass

Display dlerror when dynamic linking fails

60f762d

Helps with missing symbol errors

sheet: Allow search on generated buffers

9213ad3

richiejp added 2 commits December 9, 2024 09:38

sheet: Allow index to be used before it is complete

72cd87c

Allows the index to be used in a thread safe way so that entries can be read in the main thread while more are being added in the worker thread.

tests: Retry matching expected output to improve test robustness

0dfeed6

This also reduces test time dramatically on faster hardware.

richiejp force-pushed the transformation-index branch 3 times, most recently from 190868c to a91ccd4 Compare December 9, 2024 10:59

richiejp added 2 commits December 9, 2024 11:42

sheet: Use buffer status on first display to prevent race in testing

9823b20

Allows us to wait for indexing to finish during testing without needing to sleep to avoid matching the initial status.

Fix timeout on Alpine/Busybox

3c686f5

For some reason timing timeout works but not timing out time

richiejp force-pushed the transformation-index branch from a91ccd4 to 3c686f5 Compare December 9, 2024 11:42

richiejp added 2 commits December 9, 2024 13:14

sheet: Convert sheet-subcommand tests to use expect

1eca52e

sheet: Fix bad memory access by duplicating filename string

ee8efe5

UI buffer frees this on closing and the indexing thread will use it so it can't point to a stack variable or memory that gets overwritten after the file is opened.

richiejp force-pushed the transformation-index branch from a5d9b7d to ee8efe5 Compare December 9, 2024 13:14

richiejp added 2 commits December 9, 2024 14:50

sheet: Convert sheet extension example tests to use expect

a155ebb

sheet: Add benchmark test

813f29b

sheet: Only notify the UI of progress after 32MB

d8cc28b

This reduces the indexing time by 3-4x on a 13GB file with a Ryzen 7 6800U.

richiejp force-pushed the transformation-index branch from 6e66cc2 to d8cc28b Compare December 11, 2024 09:24

sheet: Fix bug in \r\n line handling

5dd0e1a

Also add a test on a file with only \r\n line endings as the mixed line endings test did not reproduce this bug.

richiejp force-pushed the transformation-index branch from 7f56aaf to 5dd0e1a Compare December 11, 2024 13:33

liquidaty merged commit 417e27d into liquidaty:main Dec 11, 2024
13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create index inside writer and transformation #319

Create index inside writer and transformation #319

richiejp commented Dec 3, 2024 •

edited

Loading

richiejp commented Dec 5, 2024 •

edited

Loading

liquidaty commented Dec 5, 2024

richiejp commented Dec 6, 2024

iamazeem commented Dec 6, 2024

liquidaty commented Dec 6, 2024

richiejp commented Dec 6, 2024

liquidaty commented Dec 6, 2024

richiejp commented Dec 6, 2024

iamazeem commented Dec 6, 2024

richiejp commented Dec 6, 2024

iamazeem commented Dec 6, 2024 •

edited

Loading

richiejp commented Dec 9, 2024

liquidaty commented Dec 9, 2024

richiejp commented Dec 10, 2024

liquidaty commented Dec 10, 2024

liquidaty commented Dec 11, 2024

richiejp commented Dec 11, 2024

richiejp commented Dec 11, 2024

richiejp commented Dec 11, 2024

liquidaty commented Dec 11, 2024

liquidaty commented Dec 11, 2024

liquidaty commented Dec 11, 2024

Create index inside writer and transformation #319

Create index inside writer and transformation #319

Conversation

richiejp commented Dec 3, 2024 • edited Loading

richiejp commented Dec 5, 2024 • edited Loading

liquidaty commented Dec 5, 2024

richiejp commented Dec 6, 2024

iamazeem commented Dec 6, 2024

liquidaty commented Dec 6, 2024

richiejp commented Dec 6, 2024

liquidaty commented Dec 6, 2024

richiejp commented Dec 6, 2024

iamazeem commented Dec 6, 2024

richiejp commented Dec 6, 2024

iamazeem commented Dec 6, 2024 • edited Loading

richiejp commented Dec 9, 2024

liquidaty commented Dec 9, 2024

richiejp commented Dec 10, 2024

liquidaty commented Dec 10, 2024

liquidaty commented Dec 11, 2024

richiejp commented Dec 11, 2024

richiejp commented Dec 11, 2024

richiejp commented Dec 11, 2024

liquidaty commented Dec 11, 2024

liquidaty commented Dec 11, 2024

liquidaty commented Dec 11, 2024

richiejp commented Dec 3, 2024 •

edited

Loading

richiejp commented Dec 5, 2024 •

edited

Loading

iamazeem commented Dec 6, 2024 •

edited

Loading