Improve concurrency with streams #330

mre · 2021-09-14T14:08:10Z

The goal of this effort is to avoid blocking the runtime as much as possible.
This is a multi-step approach:

Change input::get_contents from returningVec<InputContent> to Stream<Item=Result<InputContent>>.
Change Collector::collect_links from returning Result<HashSet<Request>> to Stream<Item=Result<InputContent>>. (1)
Remove deadpool from the dependencies in lychee-lib and only work with streams throughout the call-stack.
Add back connection pool #355
~~Remove ClientPool as reqwest already has a built-in pool.~~
Simplify main.rs by removing mpsc channels and iterating over streams directly.
~~Optional: Move the link cache out of the collector. Use a XOR filter instead of a HashSet. It uses less memory.~~

Previously we collected all inputs in one vector before checking the links, which is not ideal. Especially when reading many inputs (e.g. by using a glob pattern), this could cause issues like running out of file handles. By moving to streams we avoid that scenario. This is also the first step towards improving performance for many inputs.

TimoFreiberg · 2021-09-26T13:24:55Z

Making Collector::collect_links return a Stream is pretty straightforward: TimoFreiberg@a05400b.
I'm tempted to add the progressbar changes from my previous PR, do you want them here or not?

I'm not 100% sure how the rework of main.rs will look like. I think we'll still need at least one Tokio Task?
Because changing one of the Tasks spawned in main.rs to handle Results is something I would like to discuss with you beforehand 😄

mre · 2021-09-30T13:53:08Z

@TimoFreiberg your changes look good. I wonder how to proceed. Shall I just merge in the changes from your branch into this PR or shall we continue working on your fork?

mre · 2021-09-30T13:53:37Z

The progressbar changes can be added here, too yes.

TimoFreiberg · 2021-09-30T16:54:46Z

I think merging my changes into your branch is fine. I don't have access to a computer this weekend so feel free to continue on your own :)

…

On Thu, Sep 30, 2021, 15:53 Matthias ***@***.***> wrote: @TimoFreiberg <https://github.com/TimoFreiberg> your changes look good. I wonder how to proceed. Shall I just merge in the changes from your branch into this PR or shall we continue working on your fork? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#330 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABIJO3IOWTIANHIQ6SMO3H3UERTVBANCNFSM5EAH5BHQ> .

Because we can't know the amount of links without blocking

To stay as close to the pre-stream behaviour, we want to stop processing as soon as an Err value appears in the stream. This is easiest when the stream is consumed in the main thread. Previously, the stream was consumed in a tokio task and the main thread waited for responses. Now, a tokio task waits for responses (and displays them/registers response stats) and the main thread sends links to the ClientPool. To ensure that the main thread waits for all responses to have arrived before finishing the ProgressBar and printing the stats, it waits for the show_results_task to finish.

TimoFreiberg · 2021-10-06T12:23:26Z

Alright, I have a proposal for a minimal change in the tokio tasks here: 98cdfba

mre · 2021-10-06T12:58:57Z

This is great. I have to merge in your changes into my branch when I find a few minutes. 👍

mre · 2021-10-07T16:41:58Z

@TimoFreiberg merged all changes in your master branch into this one.
For some reason it gets stuck "Extracting links" on a bigger test case and when I use it on my blog it's really fast but then never exits. It's stuck on the last link. Weird.
Do you observe the same or does it work for you. It probably has nothing to do with your changes and might have been broken before.

TimoFreiberg · 2021-10-07T19:04:44Z

Yeah I'm pretty sure that's caused by my changes 😬 re-organizing tasks and adding a join is pretty dangerous I guess. I think I can debug it tomorrow

Hmm weird. It looks a lot like Stream returned from collect_links doesn't return None after the last link is yielded, so the while let Some() loop here blocks endlessly

mre · 2021-10-08T11:04:13Z

Yeah that might be the case!

TimoFreiberg · 2021-10-09T16:46:54Z

Oh. The progressbar covered the last eprintln! which mislead me into thinking the stream wasn't closed somehow. the stream is fine, but the send_req channel needed to be closed explicitly before waiting for the show_results_task.
The tests run fine now, they're approximately as fast as on master.
feel free to merge TimoFreiberg@28ad683

When testing cargo run --release -- README.md though, it seems like there's no concurrency in the stream branch.
I'm not completely convinced yet, but I kinda fear that this approach can't be truly concurrent, no matter what Client does internally. Each future is sequentially awaited until it produces a finished Response, I think it's completely impossible for the Client internals to make this concurrent.
I guess something like this could fix it (seemed to work locally)

            futures::StreamExt::for_each_concurrent(
                ReceiverStream::new(recv_req),
                max_concurrency,
                |req| async {
                    let resp = client.check(req).await.unwrap();
                    send_resp.send(resp).await.unwrap();
                },
            )
            .await;

What do you think?

mre · 2021-10-10T00:58:16Z

feel free to merge TimoFreiberg/lychee@28ad683

Worked nicely on my machine as well. Thanks for fixing that!

What do you think?

Great idea. In the meantime we added the ClientPool back in #355 because we ran into the same issues. I'll benchmark your approach, to see how it compares to the ClientPool. My hope is that it's equally fast so that we can remove the pool to simplify the code and remove dependencies. 👍

I also merged a couple commits for basic directory support. Want to land that together with streams, because it's way easier and more elegant to implement it with yield instead of collecting all files into a Vector before returning (which kind of would defeat the point of walking files, heh).

untitaker · 2021-11-29T23:11:54Z

@mre since you're extending the API, which command line should I replace this with? I assume that "directory is now an input" means I can add it as posarg, and absence of globs may affect performance

lychee --offline -b . '**/*.htm*'

untitaker · 2021-11-29T23:35:28Z

time lychee --offline -n . from this branch on sentry-docs takes

real    0m2.888s
user    0m12.182s
sys     0m0.564s

while time lychee --offline -b . '**/*htm*' -n from master takes

real    0m5.379s
user    0m19.369s
sys     0m5.381s

time hyperlink --check-anchors . gets

real    0m1.282s
user    0m3.692s
sys     0m0.180s

this is a great improvement, and generally lychee seems to be in the same ballpark as hyperlink (somewhere between muffet/liche and hyperlink), but as for filesystem-only checking I think there are bigger areas of improvement that lychee can make than anything in relation to parallelism.

See this flamegraph. This is an interactive SVG with embedded JS generated via cargo flamegraph, for some reason you need to download it from github and open it as local file to get the most readable output. Maybe Content-Security-Policy? Who knows.

Two big blocks:

indicatif continues to be really slow. In fact the above test was closer to running at 1 minute real time before I added -n, which made it go closer to the 3-5 seconds you see above. I think this is not a good default experience. I don't know if this is a low-hanging fruit on the technical side, but it's too big of a perf improvement to.. uh.. be left hanging on the tree. IMO
html parsing is slow. I know that using html5ever's Tokenizer more than doubled hyperlink's runtime, which is why I didn't manage to move hyperlink away from quick-xml for a long time. I see that you're constructing an entire tree in lychee instead of iterating over tokens. Shamelessly plugging my own html parser at this point, but swapping out html5ever with literally anything else, or replacing the document parser with just the Tokenizer will give you some boost. Quickly skimming over what you do with the element tree, you don't really need anything but tokens.

Note that I really have no clue which thread that flamegraph comes from, I hope all of them are represented. I sort of just ran flamegraph -o lol.svg lychee ... and hoped for the best.

MichaIng · 2021-11-29T23:57:06Z

I tested our two sites and here is the execution time release => stream branch:

8s (613 links) => 7s (607 links)
18s (3347 links) => 12s (3313 links)

So jep, it's faster 👍.

As can be seen, also the amount of checked links was reduced somehow, while the checked inputs didn't change. I know a cross-file cache wasn't implemented, but it has been changed from a link to a request cache, if I see right? May it be related to that that some links which were seen as distinct/different before are after processing seen as identical requests or so?

mre · 2021-11-30T00:06:56Z

If we can somehow get https://users.rust-lang.org/t/async-stream-with-item-result/68066 resolved then I can push my local changes which give me a cool 2x improvement.
Using only the tokenizer is a great idea. We can then return a token stream from the extractor as well. Actually we could even par_iter on the tokens (with rayon) for another potential boost.
I remember that someone posted a link to a discussion about indicatif's performance issues. Something about locking but I forgot. I'll post it here if I can dig it up again.

mre · 2021-11-30T00:10:25Z

@untitaker this should also work now:

lychee --offline -b . .

(not tested)

Note that this will include markdown files as well at the moment. I still need to add the overwrite option for that. (Something like --extensions html,htm could work in the future.)

MichaIng · 2021-11-30T01:28:01Z

About HTML parsing: At least for some feature requests it is required, either on the inputs or even on the target HTML document:

But probably it can be done conditionally, only if an option/flag requires it.

untitaker · 2021-11-30T01:30:04Z

I don't think it's required for anchors but yes, if lychee doesn't consider itself to be feature complete, moving to a tokenizer is limiting future features too much

…

On Tue, Nov 30, 2021, at 02:28, MichaIng wrote: About HTML parsing: At least for some feature requests it is required, either on the inputs or even on the target HTML document: * #259 <#259> * #185 <#185> But probably it can be done conditionally, only if an option/flag requires it. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#330 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAGMPRKZ225C7Z4HP62UUY3UOQSCXANCNFSM5EAH5BHQ>.

MichaIng · 2021-11-30T01:46:03Z

I don't think it's required for anchors

Meant is to also check whether the URL https://foo.bar#fragment does not only respond well but also that the fragment is valid, hence that an element with the ID fragment exists in the target HTML. For this, the whole link target needs to be downloaded (at least when it is content-type: text/html) and parsed for a matching HTML element. Sounds heavy, but indeed it is useful when e.g. checking internal links in a documentation with large pages where you definitely want to link to a specific heading instead of the start of the page.

untitaker · 2021-11-30T01:50:45Z

For this, the whole link target needs to be downloaded (at least when it is content-type: text/html) and parsed for a matching HTML element.

I understand but that requires parsing more documents, not any particular kind of HTML parsing that a tokenizer wouldn't support. you need to find the start tag that contains the right id. any start tag is fine, you don't even need to find the right one. You don't need to understand hierarchy of elements for example.

mre · 2021-11-30T22:49:58Z

Okay I think we're done here.

❯❯❯ hyperfine "lychee --base '/Users/mendler/Downloads/sentry-docs/public' '/Users/mendler/Downloads/sentry-docs/public' --offline --dump --no-progress"
Benchmark 1: lychee --base '/Users/mendler/Downloads/sentry-docs/public' '/Users/mendler/Downloads/sentry-docs/public' --offline --dump --no-progress
  Time (mean ± σ):      4.589 s ±  0.379 s    [User: 23.370 s, System: 7.124 s]
  Range (min … max):    3.995 s …  5.167 s    10 runs

Summary
  'lychee --base '/Users/mendler/Downloads/sentry-docs/public' '/Users/mendler/Downloads/sentry-docs/public/**/*.html' --offline --dump' ran
    1.35 ± 0.11 times faster than 'lychee-old --base '/Users/mendler/Downloads/sentry-docs/public' '/Users/mendler/Downloads/sentry-docs/public/**/*.html' --offline --dump'

Using jwalk for directory traversal:

Summary
  'lychee --base '/Users/mendler/Downloads/sentry-docs/public' '/Users/mendler/Downloads/sentry-docs/public' --offline --dump' ran
    1.38 ± 0.16 times faster than 'lychee-old --base '/Users/mendler/Downloads/sentry-docs/public' '/Users/mendler/Downloads/sentry-docs/public/**/*.html' --offline --dump'

So in summary we're around 35% to 50% percent faster with the stream-based version. I honestly expected a bit more, but at least the architecture is more promising now and the cores get fully used from the beginning with many files. This way the startup time is much shorter and it shouldn't get stuck when it runs out of memory.

I'm planning to merge this tomorrow so this is the "final comment period" for the changeset. Everything else that isn't covered yet (optimizing the extractor, caching) will be handled by a separate PR. Thanks all for your support!

mre · 2021-12-01T17:25:32Z

🥳 🥳 🥳

mre added 3 commits September 13, 2021 19:57

Merge branch 'master' of github.com:lycheeverse/lychee into stream

aee5c9f

Fix formatting and lints

0c05acb

This was referenced Sep 16, 2021

Send extracted links from extractor pool via channel #193

Closed

Recursion Support (closes #55, #78) #165

Closed

mre and others added 3 commits September 20, 2021 12:15

Merge branch 'master' of github.com:lycheeverse/lychee into stream

eecd8e2

Merge remote-tracking branch 'upstream/master' into stream

18448ce

Return collected links as Stream

a05400b

MichaIng mentioned this pull request Oct 5, 2021

Extend repitition test to span across two files #349

Merged

Timo Freiberg added 2 commits October 6, 2021 14:15

Initialize ProgressBar without length

d4b9bad

Because we can't know the amount of links without blocking

mre added 2 commits October 7, 2021 18:26

Merge branch 'master' of github.com:lycheeverse/lychee into stream

23df173

Cleanup

1471725

mre and others added 4 commits October 10, 2021 02:23

Merge branch 'master' of github.com:lycheeverse/lychee into stream

d111c0e

Add basic directory support

dfd0735

Merge branch 'master' of github.com:lycheeverse/lychee into stream

9ef2b7d

Fix deadlock

5f790bf

Clippy

d42bf3e

mre mentioned this pull request Oct 10, 2021

Support checking local file links #21

Closed

mre added 10 commits November 30, 2021 20:10

Add back Result

e57bd6c

fmt

954ba0d

Fix wording in test

7945421

Adjust test to new style

34858f8

fmt

e20228c

clippy

ea371dd

wording

d7f347a

fmt

e1930cf

clippy

1f36ab1

map_err -> context

1acdd24

Replace FuturesUnordered with ParThenUnordered

ef7d760

mre mentioned this pull request Dec 1, 2021

Stream-based extractor #414

Closed

3 tasks

mre marked this pull request as ready for review December 1, 2021 17:03

Merge branch 'master' into stream

6c3be1c

mre merged commit 3d51356 into master Dec 1, 2021

mre deleted the stream branch December 1, 2021 17:25

This was referenced Dec 2, 2021

Optional features #224

Closed

Behavior when specifying directory #139

Closed

Introduce extractor pool #55

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve concurrency with streams #330

Improve concurrency with streams #330

mre commented Sep 14, 2021 •

edited

Loading

TimoFreiberg commented Sep 26, 2021

mre commented Sep 30, 2021

mre commented Sep 30, 2021

TimoFreiberg commented Sep 30, 2021 via email

TimoFreiberg commented Oct 6, 2021

mre commented Oct 6, 2021

mre commented Oct 7, 2021

TimoFreiberg commented Oct 7, 2021 •

edited

Loading

mre commented Oct 8, 2021

TimoFreiberg commented Oct 9, 2021 •

edited

Loading

mre commented Oct 10, 2021

untitaker commented Nov 29, 2021

untitaker commented Nov 29, 2021 •

edited

Loading

MichaIng commented Nov 29, 2021

mre commented Nov 30, 2021

mre commented Nov 30, 2021

MichaIng commented Nov 30, 2021

untitaker commented Nov 30, 2021 via email

MichaIng commented Nov 30, 2021 •

edited

Loading

untitaker commented Nov 30, 2021

mre commented Nov 30, 2021 •

edited

Loading

mre commented Dec 1, 2021

Improve concurrency with streams #330

Improve concurrency with streams #330

Conversation

mre commented Sep 14, 2021 • edited Loading

TimoFreiberg commented Sep 26, 2021

mre commented Sep 30, 2021

mre commented Sep 30, 2021

TimoFreiberg commented Sep 30, 2021 via email

TimoFreiberg commented Oct 6, 2021

mre commented Oct 6, 2021

mre commented Oct 7, 2021

TimoFreiberg commented Oct 7, 2021 • edited Loading

mre commented Oct 8, 2021

TimoFreiberg commented Oct 9, 2021 • edited Loading

mre commented Oct 10, 2021

untitaker commented Nov 29, 2021

untitaker commented Nov 29, 2021 • edited Loading

MichaIng commented Nov 29, 2021

mre commented Nov 30, 2021

mre commented Nov 30, 2021

MichaIng commented Nov 30, 2021

untitaker commented Nov 30, 2021 via email

MichaIng commented Nov 30, 2021 • edited Loading

untitaker commented Nov 30, 2021

mre commented Nov 30, 2021 • edited Loading

mre commented Dec 1, 2021

mre commented Sep 14, 2021 •

edited

Loading

TimoFreiberg commented Oct 7, 2021 •

edited

Loading

TimoFreiberg commented Oct 9, 2021 •

edited

Loading

untitaker commented Nov 29, 2021 •

edited

Loading

MichaIng commented Nov 30, 2021 •

edited

Loading

mre commented Nov 30, 2021 •

edited

Loading