-
Notifications
You must be signed in to change notification settings - Fork 117
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rewrite casync-http to use curl multi #208
base: main
Are you sure you want to change the base?
Rewrite casync-http to use curl multi #208
Conversation
f7cf33e
to
ebc2243
Compare
struct CaChunkDownloader { | ||
CaRemote *remote; | ||
CURLM *multi; | ||
Queue *ready; /* CURL handles waiting to be used */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we know the size of the queues already, it would be more efficient to have static arrays, and avoid a lot of malloc/free. I can rework this.
} | ||
|
||
static int ca_chunk_downloader_step(CaChunkDownloader *dl) { | ||
int r; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, this function is tricky. This is basically what happens during a loop iteration.
See the call ca_chunk_downloader_remote_step()
below? If you move it to the end, then you break the testsuite, and a bunch of casync list
commands hang forever. I don't really know why. But it's not sure you will reproduce it. It's interesting to see that if I run the testuite without parallel tests (ie. MESON_TESTTHREADS=1 ninja test
), then everything works fine.
So there's some subtleties going here that I didn't fully understand, and I ended up ordering things in this function by trial and failure. In other words, this needs thorough review: this function, the functions it calls, and basically the communication with the casync remote.
Ping? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, so if i get this right, then in this PR we'll synchronously run the curl logic, then synchronously the ca_remote logic and then the curl logic again and so on. Ideally both logics would run asynchronously in the same poll() loop, i.e. so that we get the fds to wait for out of curl and out of caremote (the latter is easy since it's only two) and pass them to the same poll() call. Can we make that happen with libcurl? I don't know curl well enough I must admit, and the reason I did my original http client the way i did (with its synchronous behaviour) was laziness. But if we fix this for multiple getters maybe we can fix that too?
Also, I am not sure why we need a queue for the requests? note that CaRemote already implements one anyway?
QueueItem *next; | ||
}; | ||
|
||
typedef struct Queue { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, instead of introducing a new, local queue implementation here: let's just copy list.h from systemd's tree (i.e. https://github.com/systemd/systemd/blob/master/src/basic/list.h) and use that? it's an embedded list which means we need fewer allocations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in commit 48f37dd and fdd4689. See my temporary branch mr/casync-http-curl-multi.
I will push-force latter.
ebc2243
to
3c1f551
Compare
Not sure I understand you, so let me detail a bit how things work. There's actually only one poll for all events (curl and casync altogether). This happens in Line 999 in 3c1f551
The pros of If you want to peak into curl's implementation for
So it's an easy way to poll on both casync and curl fds, basically. The downside, maybe, is that when we're out of this function, we don't know which fds were triggered, all we can know is how many fds were triggered (which doesn't help much). That's why a "loop iteration" involves running both curl logic and casync logic, in case something happened (we're not sure). That could be done differently, we could do the poll ourself instead of using
So we need to keep tracks of the active requests somehow. Active requests being either easy handles added to the curl multi, or either chunks downloaded, and waiting to be sent to the remote. I implemented a queue mechanism because it seemed to be suitable, and having 3 queues for the 3 states possibles (ready, in progress and completed) seemed to make things easy. For example, by keeping a bit of stats inside the queues ( For example while testing locally (ie. super fast localhost network), I can see that the average size of the OTOH, while testing with a remote server, obviously things are different, and this time it's the average size of the Of course, this is not the only implementation possible, and instead of having 3 queues, we could have a "state" enum, and for each chunk we could keep track of the state it's in. No more queue, just a static array of chunks. But I'm not sure it would make the code easier to read and maintain. Performance wise, it don't think there's any cons in having these 3 queues. It's small amount of data (
This part I don't really understand. Can I make use of that on the casync-http side? |
(Note that I fixed all the details you mentioned above. Since it was trivial enough I hit "Resolve" button myself, but truth to be told, I'm not familiar with reviewing stuff on github, and maybe you prefer to hit "Resolve" yourself, so please don't hesitate to tell me how it should be done). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be great to have some tests for this. It is far from trivial ;) I agree we should switch to one of the "real" implementations, but I'm not sure which one is appropriate. By nghttpd do you mean https://www.nghttp2.org/? That seems to be dead (website from 2015) and is not packaged for Fedora, which would make things difficult for us.
I think it'd be easier to review this patchset if some parts were split out. E.g. the part to move arg_protocol
enum higher could be separated, and it would just make things easier to review. I suggest some other parts to split out inline.
Like @poettering suggested, we want to pull in the list.h implementation from systemd. This will remove a large chunk of this patch too.
As for the general approach, I think it's reasonable. According to the documentation, the curl multi interface supports both doing poll internally and integrating into an external event loop. I think it's reasonable to start with the approach you picked. We have the option to switch to an external loop later on if necessary.
src/casync-http.c
Outdated
|
||
c = curl_easy_setopt(handle, CURLOPT_PROTOCOLS, | ||
arg_protocol == PROTOCOL_FTP ? CURLPROTO_FTP : | ||
arg_protocol == PROTOCOL_SFTP? CURLPROTO_SFTP: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe PROTOCOL_SFTP ? CURLPROTO_SFTP :
? Squishing it together like this looks off.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in #219
src/casync-http.c
Outdated
} | ||
} | ||
|
||
/* (void) curl_easy_setopt(handle, CURLOPT_SSL_VERIFYPEER, false); */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'd be nice to make this into a commandline option, as you suggest. It's useful for debugging independently of this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in #219
src/casync-http.c
Outdated
|
||
/* (void) curl_easy_setopt(handle, CURLOPT_SSL_VERIFYPEER, false); */ | ||
|
||
/* (void) curl_easy_setopt(handle, CURLOPT_VERBOSE, 1L); */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't do it, as -v
is already an option to enable casync verbosity. Maybe -vv
to enable curl verbose as well?
Since there's already #194 on the same topic, I would batch that altogether. I have some WIP regarding this issue, but I wanted to wait until this PR is completed and merged before finishing it and opening a PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#217 propagates both options --log-level
and -v
from casync
to casync-http
. The cURL verbosity is enabled for log-level notice
and higher (i.e. on debug
).
src/casync-http.c
Outdated
do { \ | ||
if (ENABLE_LOG_TRACE) \ | ||
log_debug("[%d] " fmt, (int) getpid(), ##__VA_ARGS__); \ | ||
} while (false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
log_trace
should go in log.h.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gportay will follow up on that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in commit de2ba61. See my temporary branch mr/casync-http-curl-multi.
I will push-force latter.
src/casync-http.c
Outdated
queue_push(dl->inprogress, handle); | ||
|
||
/* We know there must be something to do, since we just added something. */ | ||
c = curl_multi_perform(dl->multi, &running_handles); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems strange to call curl_multi_perform
in a loop. It should handle all handles at once, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gportay will follow up on that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@keszybz: indeed curl_multi_perform
handles all the handles.
This function handles transfers on all the added handles that need attention in an non-blocking fashion.
I will call the function once, outside the for
loop.
src/casync-http.c
Outdated
for (;;) { | ||
if (quit) { | ||
log_info("Got exit signal, quitting"); | ||
r = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just do return 0
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gportay ^
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in commit fdd4689. See my temporary branch mr/casync-http-curl-multi.
I will push-force latter.
src/casync-http.c
Outdated
} | ||
|
||
return c; | ||
return r; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... and remove the this line completely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gportay ^
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in commit fdd4689. See my temporary branch mr/casync-http-curl-multi.
I will push-force latter.
@@ -180,7 +1151,7 @@ static size_t write_index(const void *buffer, size_t size, size_t nmemb, void *u | |||
|
|||
r = ca_remote_put_index(rr, buffer, product); | |||
if (r < 0) { | |||
log_error("Failed to put index: %m"); | |||
log_error_errno(r, "Failed to put index: %m"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This patch would be much easier to read if those fixes (which are independent) were split out into a separate patch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in #219
@@ -623,9 +1393,6 @@ static int run(int argc, char *argv[]) { | |||
r = process_remote(rr, PROCESS_UNTIL_FINISHED); | |||
|
|||
finish: | |||
if (curl) | |||
curl_easy_cleanup(curl); | |||
|
|||
return r; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this cleanup block is not necessary anymore. It would be nice to drop the label, and simply use return
instead of goto
everywhere above. Could be done as a separate commit if you prefer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in #219
src/casync-http.c
Outdated
static curl_off_t arg_rate_limit_bps = 0; | ||
static bool arg_verbose = false; | ||
static curl_off_t arg_rate_limit_bps = 0; | ||
static uint64_t arg_max_active_chunks = MAX_ACTIVE_CHUNKS; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
uint64_t
seems a bit over the top. Maybe just make this unsigned
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in #219
Signed-off-by: Arnaud Rebillout <[email protected]>
This is to prepare the next commit, where we will use the protocol enum for more than just the protocol passed in arguments, and the arg_ prefix won't make sense anymore. Signed-off-by: Arnaud Rebillout <[email protected]>
This commits brings in two helpers: - protocol_str() to convert an enum protocol to a string, which is useful mainly for logs. - protocol_status_ok() as a unique place to check if the protocol status that we get from libcurl means OK or KO. Signed-off-by: Arnaud Rebillout <[email protected]>
Signed-off-by: Arnaud Rebillout <[email protected]>
It seems to me that the condition PROCESS_UNTIL_WRITTEN is reached when there's no more data to write, hence ca_remote_has_unwritten() returns 0. And it also seems that this is could be copy/paste mistake, as all the code above is similar, but the condition matches the function we call, ie: - PROCESS_UNTIL_CAN_PUT_CHUNK > ca_remote_can_put_chunk - PROCESS_UNTIL_CAN_PUT_INDEX > ca_remote_can_put_index - PROCESS_UNTIL_CAN_PUT_ARCHIVE > ca_remote_can_put_archive - PROCESS_UNTIL_HAVE_REQUEST > ca_remote_has_pending_requests But here, the function returns the opposite of what we want: - PROCESS_UNTIL_WRITTEN > ca_remote_has_unwritten Note that I didn't observe any bug due to that, and the test suite succeeds before and after this patch. Signed-off-by: Arnaud Rebillout <[email protected]>
While working on this code, I stumbled on cases where casync got stuck because we forgot to call PROCESS_UNTIL_WRITTEN here. Well, TBH as long as we download chunks, we're fine because we end up calling PROCESS_UNTIL_WRITTEN afterwards. But in any case, it seems more correct to sync after downloading these files, and it doesn't hurt. Signed-off-by: Arnaud Rebillout <[email protected]>
The way we use curl handle is that we create it once, and then re-use it again and again, as it's more efficient than re-allocating a new one for each request. By looking at the code closely, it turns out that the setup of the curl handle needs to be done only once, then afterwards we only need to change the URL in order to re-use the handle. So this commit brings two helper functions to reflect that: - make_curl_easy_handle() does the init work and set all the options for the handle. - configure_curl_easy_handle() does the things that are needed in order to re-use the handle. In effect, it only sets the URL. Additionally, this commit introduces curl_easy_cleanupp, in order to use automatic pointers. Signed-off-by: Arnaud Rebillout <[email protected]>
These macros aim to make setting curl options easier. CURL_SETOPT_EASY() sets the option, and on failure it outputs a generic error message with the name of the option that failed, and returns -EIO. The CURL_SETOPT_EASY_CANFAIL() variant does not return, it only outputs an error message. Signed-off-by: Arnaud Rebillout <[email protected]>
This removes the need for the 'finish' label, hence a bunch of goto go away. Signed-off-by: Arnaud Rebillout <[email protected]>
Signed-off-by: Arnaud Rebillout <[email protected]>
Hi and sorry for being that long, lately I've been under water due to other projects. I split all the work that is not really related to multi http request, but more refactoring, into #219, which can be merged to master if it's good enough. It's mostly shuffling code, nothing polemic I think, but I'm sure it will benefit from a round of review / rework. Due to other tasks ongoing on my side, I won't be able to work further on that, hence let me introduce @gportay who will keep the good job going. I'm still around and will help him catching up. Thanks for the collaboration and I hope to see u again through more casync PR! |
Signed-off-by: Arnaud Rebillout <[email protected]>
The goal is to make the run() function more readable, and only outline the major steps, while the bulk of the work is left to other functions. Signed-off-by: Arnaud Rebillout <[email protected]>
This long option does not use a short option (see the optstring in call of getopt_long). Use a constant instead, as it is done in casync-tool.c.
This can be useful for testing, if ever we do HTTP2/SSL with a local, untrusted server. Signed-off-by: Arnaud Rebillout <[email protected]>
I think nginx is best, as it's real in the sense that it powers the Internet for real, and it's also easy enough to set up for local tests. The only downside I see is that it requires root, and in Debian we build packages and run test suites without root. I don't know about fedora. I also proposed nghttp2 because it's a dependency of curl (under the form of libnghttp2), so it's already around when we build casync. I'm not familiar with the project though. It seems that it's maintained: https://github.com/nghttp2/nghttp2/graphs/code-frequency.
Done in #219
@gportay will take care of that.
Yes indeed. It's very easy with curl, as you basically just call |
This reverts commit 328f13d.
The file is taken from systemd[1]. Note: The inclusion of the macro header is removed. [1]: https://raw.githubusercontent.com/systemd/systemd/0d7f7c2fde377d9bf618d16aa230757f956f8191/src/basic/list.h
This commit brings parallel downloads to casync-http, by using the cURL multi interface, and more precisely the "select()" flavour. For details see https://curl.haxx.se/libcurl/c/libcurl-multi.html. The cURL library has two ways to achieve parallel downloads: for HTTP/1, it can open parallel connections. The maximum number of parallel connections is user-defined, through MAX_HOST_CONNECTIONS and MAX_TOTAL_CONNECTIONS. for HTTP/2, it can attempt to multiplex on a single connection. The maximum number of parallel downloads in this case is negociated between the client and the server (we talk about number of streams in the HTTP/2 jargon). Note that libcurl used to do pipelining over HTTP/1.1, but this is no longer supported since 7.62.0, and casync-http doesn't use it anyway) Signed-off-by: Arnaud Rebillout <[email protected]> Signed-off-by: Gaël PORTAY <[email protected]>
Currently, casync-http handles the option --log-level but casync does not propagate its verbosity to its protocol helpers. This commit propagates the --log-level to the protocol helpers as it is done for option --rate-limit-bps. Now, the debug messages from the helper casync-http are printed to stderr if the --log-level is specified to casync.
Enable CURL verbosity for the notice and higher log levels (ie. debug). When CURLOPT_VERBOSE is enabled, libcurl outputs traces to stderr, such as headers and a lot of internal information. Now, the libcurl verbose informations are dumped to stdout when the options --log-level notice or debug are specified to casync.
This commit introduces a new option --max-active-chunks=<MAX> that limits the number of simultaneous chunk transfers from the remote. The MAX number is the sum of: 1. the number of chunks added to the cURL multi interface, and 2. chunks downloaded and waiting to be sent to the remote. It limits the number of chunks stored in memory that are ready to be sent to the casync process; this limit the memory usage in the situation where the network is faster than the pipe communication between the helper and casync.
This commit introduces a new option --max-host-connections=<MAX> that limits the number of connections to a single host. See https://curl.haxx.se/libcurl/c/CURLMOPT_MAX_HOST_CONNECTIONS.html.
Currently, casync-http handles the option --trust-peer but casync does not propagate it to its protocol helpers. This commit propagates the --trust-peer the protocol helpers as it is done for option --rate-limit-bps.
The macro log_trace is imported from systemd.
3c1f551
to
d8748c9
Compare
I have push-force a new version. It is based on #219, with more atomic commits. |
@poettering, @keszybz is there any chance you have time to make a review on all PR? I assume you are very busy these days :) Thanks anyway. |
This commit brings parallel downloads to casync-http, by using the curl
multi interface, and more precisely the "select()" flavour. For details
see https://curl.haxx.se/libcurl/c/libcurl-multi.html.
Libcurl has two ways to achieve parallel downloads:
parallel connections is user-defined, through
MAX_HOST_CONNECTIONS
and
MAX_TOTAL_CONNECTIONS
.maximum number of parallel downloads in this case is negociated
between the client and the server (we talk about number of streams in
the HTTP/2 jargon).
longer supported since 7.62.0, and casync-http doesn't use it anyway)
Accordingly, this commit also introduces two new command-line arguments
to better control the behavior of parallel downloads:
--max-active-chunks
is the sum of 1. the number of chunks in thecurl multi and 2. chunks downloaded and waiting to be sent to the
remote. It allows to limit the number of chunks waiting in RAM, in
case we download faster than we can send to remote. It also gives a
limit for the maximum number of concurrent downloads.
--max-host-connections
is for the case where libcurl opens parallelconnections to the server. In all likelihood it's only used for HTTP1.
We probably want a large number for max-active-chunks, to ensure we
don't starve the libcurl multi handle, but at the same time we probably
don't want to open too many connections in parallel, and that's why
max-host-connections is a much lower number. It seems to be a sensible
default, according to my understanding so far. User might want to adjust
these number for their specific use-case.
Note that the command-line argument
--rate-limit-bps
doesn't make muchsense anymore, since it's set for each chunk, but now chunks are
downloaded in parallel, and we don't really know how many downloads are
actually happening in parallel. And from Daniel Steinberg:
So we might want to completely remove this option, or rework it somehow.
Note that this commit removes the wrapper
robust_curl_easy_perform()
introduced in 328f13d. Quick reminder: this wrapper was used to sleep
and retry on
CURLE_COULDNT_CONNECT
, and allowed to workaround whatseemed to be a misbehavior of the Python Simple HTTP Server. Now that we
do parallel downloads, we can't apply this workaround "as is", we can't
just sleep. So I removed the wrapper. The issue is still present and
reproducible though, but I just assume it's a server issue, not a casync
issue.
Regarding unit tests
This commit also opens interesting questions regarding the unit tests.
For now we're using the Python Simple HTTP Server, which can only serve
requests sequentially. It doesn't allow to really test parallelism. It's
not really representative of real-life scenario where, I assume, chunks
are served by a production server such as apache or nginx. Additionally,
I think it would be best to run the test for both a HTTP/1 server and a
HTTP/2 server.
One possibility is to use nginx, it's easy enough to run it. nginx can
serve HTTP/2 only if TLS is enabled though, and out of the box
casync-http will fail if it can't recognize the certificate. So we might
want to add a command-line argument to trust any random certificate.
Additionally, nginx requires root, maybe not very suitable for a test
suite, however there might be some workarounds?
Another possibility is to run nghttpd. This option is light on
dependencies, in the sense that libcurl already relies on libnghttp2,
however I didn't succeed in using this server yet.
TODOs
There are a few todos left, mainly:
arguments in casync-http.