-
-
Notifications
You must be signed in to change notification settings - Fork 134
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve concurrency with streams #330
Conversation
Previously we collected all inputs in one vector before checking the links, which is not ideal. Especially when reading many inputs (e.g. by using a glob pattern), this could cause issues like running out of file handles. By moving to streams we avoid that scenario. This is also the first step towards improving performance for many inputs.
Making I'm not 100% sure how the rework of |
@TimoFreiberg your changes look good. I wonder how to proceed. Shall I just merge in the changes from your branch into this PR or shall we continue working on your fork? |
The progressbar changes can be added here, too yes. |
I think merging my changes into your branch is fine. I don't have access to
a computer this weekend so feel free to continue on your own :)
…On Thu, Sep 30, 2021, 15:53 Matthias ***@***.***> wrote:
@TimoFreiberg <https://github.com/TimoFreiberg> your changes look good. I
wonder how to proceed. Shall I just merge in the changes from your branch
into this PR or shall we continue working on your fork?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#330 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABIJO3IOWTIANHIQ6SMO3H3UERTVBANCNFSM5EAH5BHQ>
.
|
Because we can't know the amount of links without blocking
To stay as close to the pre-stream behaviour, we want to stop processing as soon as an Err value appears in the stream. This is easiest when the stream is consumed in the main thread. Previously, the stream was consumed in a tokio task and the main thread waited for responses. Now, a tokio task waits for responses (and displays them/registers response stats) and the main thread sends links to the ClientPool. To ensure that the main thread waits for all responses to have arrived before finishing the ProgressBar and printing the stats, it waits for the show_results_task to finish.
Alright, I have a proposal for a minimal change in the tokio tasks here: 98cdfba |
This is great. I have to merge in your changes into my branch when I find a few minutes. 👍 |
@TimoFreiberg merged all changes in your |
Yeah I'm pretty sure that's caused by my changes 😬 re-organizing tasks and adding a join is pretty dangerous I guess. I think I can debug it tomorrow Hmm weird. It looks a lot like Stream returned from |
Yeah that might be the case! |
Oh. The progressbar covered the last When testing futures::StreamExt::for_each_concurrent(
ReceiverStream::new(recv_req),
max_concurrency,
|req| async {
let resp = client.check(req).await.unwrap();
send_resp.send(resp).await.unwrap();
},
)
.await; What do you think? |
Worked nicely on my machine as well. Thanks for fixing that!
Great idea. In the meantime we added the I also merged a couple commits for basic directory support. Want to land that together with streams, because it's way easier and more elegant to implement it with |
@mre since you're extending the API, which command line should I replace this with? I assume that "directory is now an input" means I can add it as posarg, and absence of globs may affect performance
|
while
this is a great improvement, and generally lychee seems to be in the same ballpark as hyperlink (somewhere between muffet/liche and hyperlink), but as for filesystem-only checking I think there are bigger areas of improvement that lychee can make than anything in relation to parallelism. See this flamegraph. This is an interactive SVG with embedded JS generated via Two big blocks:
Note that I really have no clue which thread that flamegraph comes from, I hope all of them are represented. I sort of just ran |
I tested our two sites and here is the execution time release => stream branch:
So jep, it's faster 👍. As can be seen, also the amount of checked links was reduced somehow, while the checked inputs didn't change. I know a cross-file cache wasn't implemented, but it has been changed from a link to a request cache, if I see right? May it be related to that that some links which were seen as distinct/different before are after processing seen as identical requests or so? |
If we can somehow get https://users.rust-lang.org/t/async-stream-with-item-result/68066 resolved then I can push my local changes which give me a cool 2x improvement. |
@untitaker this should also work now:
(not tested) Note that this will include markdown files as well at the moment. I still need to add the overwrite option for that. (Something like |
About HTML parsing: At least for some feature requests it is required, either on the inputs or even on the target HTML document: But probably it can be done conditionally, only if an option/flag requires it. |
I don't think it's required for anchors but yes, if lychee doesn't consider itself to be feature complete, moving to a tokenizer is limiting future features too much
…On Tue, Nov 30, 2021, at 02:28, MichaIng wrote:
About HTML parsing: At least for some feature requests it is required, either on the inputs or even on the target HTML document:
* #259 <#259>
* #185 <#185>
But probably it can be done conditionally, only if an option/flag requires it.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <#330 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAGMPRKZ225C7Z4HP62UUY3UOQSCXANCNFSM5EAH5BHQ>.
|
Meant is to also check whether the URL |
I understand but that requires parsing more documents, not any particular kind of HTML parsing that a tokenizer wouldn't support. you need to find the start tag that contains the right id. any start tag is fine, you don't even need to find the right one. You don't need to understand hierarchy of elements for example. |
Okay I think we're done here.
Using jwalk for directory traversal:
So in summary we're around 35% to 50% percent faster with the stream-based version. I honestly expected a bit more, but at least the architecture is more promising now and the cores get fully used from the beginning with many files. This way the startup time is much shorter and it shouldn't get stuck when it runs out of memory. I'm planning to merge this tomorrow so this is the "final comment period" for the changeset. Everything else that isn't covered yet (optimizing the extractor, caching) will be handled by a separate PR. Thanks all for your support! |
🥳 🥳 🥳 |
The goal of this effort is to avoid blocking the runtime as much as possible.
This is a multi-step approach:
input::get_contents
from returningVec<InputContent>
toStream<Item=Result<InputContent>>
.Collector::collect_links
from returningResult<HashSet<Request>>
toStream<Item=Result<InputContent>>
. (1)Add back connection pool #355
RemoveClientPool
as reqwest already has a built-in pool.main.rs
by removing mpsc channels and iterating over streams directly.Optional: Move the link cache out of the collector. Use a XOR filter instead of a HashSet. It uses less memory.See #193, #21 (comment)
@TimoFreiberg fyi