Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Try to fix deadlock #124

Merged
merged 1 commit into from
Mar 2, 2018
Merged

Conversation

samoconnor
Copy link
Contributor

This call:
nb_available(::SSLContext)->mbedtls_ssl_read->f_recv->readbytes!()
can block holding the datalock and can stop calls write(::SSLContext) calls
(in other tasks) from getting the lock and writing. If a request is not
writen, the response may never come, and the read may block forever.

The forum post below says that the zero-size mbedtls_ssl_read call
used to find out how many bytes are available must be done in
non-blocking mode. This commit addes a nonblocking flag to try to
implement that.

This seems to work to avoid a deadlock in HTTP.jl where:

  • A request header has been written.
  • An async task has been schedueled to write the body.
  • The main task starts reading the response (in some cases the
    response headers arrive before the request body is sent).

The expected behaviour is that the reading task will soon block to wait
for data, thereby yeilding to the async task, which will finish writing
the body. Before this commit, it would get stuck.

See: https://esp32.com/viewtopic.php?t=1101#p4945
"After the correct implementation of non-blocking sockets, the things
seem to be working fine"

This call:
nb_available(::SSLContext)->mbedtls_ssl_read->f_recv->readbytes!()
can block holding the datalock and can stop calls write(::SSLContext) calls
(in other tasks) from getting the lock and writing. If a request is not
writen, the response may never come, and the read may block forever.

The forum post below says that the zero-size mbedtls_ssl_read call
used to find out how many bytes are available must be done in
non-blocking mode. This commit addes a nonblocking flag to try to
implement that.

This seems to work to avoid a deadlock in HTTP.jl where:
 - A request header has been written.
 - An async task has been schedueled to write the body.
 - The main task starts reading the response (in some cases the
   response headers arrive before the request body is sent).

The expected behaviour is that the reading task will soon block to wait
for data, thereby yeilding to the async task, which will finish writing
the body. Before this commit, it would get stuck.

See: https://esp32.com/viewtopic.php?t=1101#p4945
    "After the correct implementation of non-blocking sockets, the things
    seem to be working fine"
samoconnor added a commit to JuliaWeb/HTTP.jl that referenced this pull request Mar 2, 2018
- Add a txcount field to the Request struct
- Increment txcount after the body is written
- Modify RetryRequest.isrecoverable to allow retry of non-idempotent
  requests if txcount is zero.
- Modify StreamRequest to Wait for other pipelined reads to complete
  before sending a non-idempotent request body. This should decrease
  the chance of sending a POST body that ends up failing and can't be
  retried.
- Remove yeild() after @async writebody. This should increase the chance
  that the startread() realises that the connection is dead before the
  body is written.
  Depends on: JuliaLang/MbedTLS.jl#124
@samoconnor
Copy link
Contributor Author

@Keno @quinnj @malmaud

@codecov-io
Copy link

codecov-io commented Mar 2, 2018

Codecov Report

Merging #124 into master will decrease coverage by 3.74%.
The diff coverage is 75%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #124      +/-   ##
==========================================
- Coverage   68.63%   64.88%   -3.75%     
==========================================
  Files          10       10              
  Lines         424      430       +6     
==========================================
- Hits          291      279      -12     
- Misses        133      151      +18
Impacted Files Coverage Δ
src/ssl.jl 66.11% <75%> (-5.19%) ⬇️
src/error.jl 44.44% <0%> (-11.12%) ⬇️
src/x509_crt.jl 63.15% <0%> (-10.53%) ⬇️
src/entropy.jl 78.94% <0%> (-10.53%) ⬇️
src/ctr_drbg.jl 66.66% <0%> (-9.53%) ⬇️
src/pk.jl 72.34% <0%> (-4.26%) ⬇️
src/cipher.jl 60.86% <0%> (-1.45%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cec4191...a67c763. Read the comment docs.

@quinnj
Copy link
Member

quinnj commented Mar 2, 2018

in some cases the response headers arrive before the request body is sent

Is that just in 100-continue situations or are there other cases where this can happen?

ccall((:mbedtls_ssl_read, MBED_TLS),
Cint, (Ptr{Cvoid}, Ptr{Cvoid}, Csize_t), ctx.data, C_NULL, 0)
ctx.nonblocking = true
try
Copy link
Member

@quinnj quinnj Mar 2, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this try-finally really needed? Can ccalls actually throw errors?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not 100% sure.
But it's never good for a block of code to appear to restore the old state, but have a lurking possibility of exiting early.
(& It isn't a try/catch, its a try/finally)
I'm happy to leave a FIXME in the code suggesting that someone could take it out in the future if everything is working nicely...

@samoconnor
Copy link
Contributor Author

samoconnor commented Mar 2, 2018

in some cases the response headers arrive before the request body is sent

Is that just in 100-continue situations or are there other cases where this can happen?

No, it is quite common for the server to immediately respond with error response headers if it can tell it isn't happy based on the request headers. e.g. unsupported request, or bad credentials.

@samoconnor
Copy link
Contributor Author

In fact there is a specific part in the RFC about how if the server says Connection: Close you must close before sending the remainder of the body. There is a test for this in test/loopback.jl

@@ -112,19 +114,23 @@ end
function f_send(c_ctx, c_msg, sz)
jl_ctx = unsafe_pointer_to_objref(c_ctx)
jl_msg = unsafe_wrap(Array, c_msg, sz)
return Cint(write(jl_ctx, jl_msg))
return Cint(write(jl_ctx.bio, jl_msg))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this right? we're calling write on the TCPSocket now instead of on the SSLContext?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before the void* passed to the library was the socket.
Now it is the SSLContext. This is needed so we can see the nonblocking flag in the callback.
So, the .bio is needed now to get back to the socket.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, that makes sense. 👍

return Cint(n)
end

function set_bio!(ssl_ctx::SSLContext, jl_ctx::T) where {T<:IO}
ssl_ctx.bio = jl_ctx
set_bio!(ssl_ctx, pointer_from_objref(jl_ctx), c_send[], c_recv[])
set_bio!(ssl_ctx, pointer_from_objref(ssl_ctx), c_send[], c_recv[])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what effect does this change end up having?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see above

@quinnj quinnj merged commit 1835661 into JuliaLang:master Mar 2, 2018
@quinnj
Copy link
Member

quinnj commented Mar 2, 2018

Looks good. Thanks @samoconnor !

quinnj pushed a commit to JuliaWeb/HTTP.jl that referenced this pull request Mar 2, 2018
- Add a txcount field to the Request struct
- Increment txcount after the body is written
- Modify RetryRequest.isrecoverable to allow retry of non-idempotent
  requests if txcount is zero.
- Modify StreamRequest to Wait for other pipelined reads to complete
  before sending a non-idempotent request body. This should decrease
  the chance of sending a POST body that ends up failing and can't be
  retried.
- Remove yeild() after @async writebody. This should increase the chance
  that the startread() realises that the connection is dead before the
  body is written.
  Depends on: JuliaLang/MbedTLS.jl#124
@samoconnor
Copy link
Contributor Author

Maybe more generally shouldn't be waiting for exactly sz bytes

yeah, I actually had code that did just that, but I thought it was best to keep the change small for now.

@vtjnash, @quinnj, see: #125

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants