Skip to content

Prevent socket specs from hanging the main fiber#9437

Open
jhass wants to merge 1 commit intocrystal-lang:masterfrom
jhass:spec_hang
Open

Prevent socket specs from hanging the main fiber#9437
jhass wants to merge 1 commit intocrystal-lang:masterfrom
jhass:spec_hang

Conversation

@jhass
Copy link
Member

@jhass jhass commented Jun 6, 2020

No description provided.

@jhass jhass force-pushed the spec_hang branch 2 times, most recently from 5b06c4f to 0b063c4 Compare June 6, 2020 10:32
@jhass jhass changed the title Prevent OpenSSL server specs from hanging the tests when the client fails to connect Prevent socket specs from hanging the main fiber Jun 6, 2020
@straight-shoota
Copy link
Member

I don't like that the helpers remove the actual method call from the direct spec code. The call to accept/accept? should be explicit, especially when that's part of what's being tested.
So I'd suggest to make the helper yield:

socket = with_timeout { server.accept }

This would mean we only need a single helper method.
If the server.close is moved outside, it can be even more generic and useful for other specs, too. And I think it should be moved outside because the code that starts the server should make sure to close it.

It would also be great if the helper handled exceptions raised in the spawned fiber.

@jhass
Copy link
Member Author

jhass commented Jun 6, 2020

It would also be great if the helper handled exceptions raised in the spawned fiber.

Handled how? Basically no spec code does that right now.

If the server.close is moved outside, it can be even more generic and useful for other specs

Well, it's just there to prevent FD leaks for specs that don't use open. Moving it behind the assertion looses its point. I guess we don't really care.

@straight-shoota
Copy link
Member

Rescue the exception, send it to a channel, add that channel to the select and raise it in the main fiber.

Spec helpers run_server and spawn_and_check handle exceptions.

@jhass jhass requested a review from straight-shoota June 6, 2020 14:08
Copy link
Member

@straight-shoota straight-shoota left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks awesome!

@jhass jhass added kind:bug A bug in the code. Does not apply to documentation, specs, etc. and removed kind:refactor labels Jun 6, 2020
@jhass
Copy link
Member Author

jhass commented Jun 6, 2020

This is more than a refactor, it actually fixes the specs hanging themselves in some build environments or error conditions rather. I wouldn't be too surprised if the random preview_mt specs hangs we see are even caused by this (not saying it fixes them properly for MT, just that it might no longer hang).

@jhass jhass mentioned this pull request Jun 18, 2020
5 tasks
@waj
Copy link
Member

waj commented Jun 23, 2020

I was hesitant about adding more complexity to the specs by adding extra fiber plus channels to handle accept timeouts, since this should be already possible using the read_timeout (note also that failing specs would probably keep the socket open and still accepting connections in the background). But it turns out that's not working and I found why:

wait_readable rescue nil

That rescue nil is avoiding the timeout exception to raise and keep looping forever. If I just remove the rescue at least one other spec fails. So I'll send another PR to fix that in a better way. After that I think with_timeout could just set/unset the read_timeout property.

@jhass
Copy link
Member Author

jhass commented Jun 23, 2020

Mmh, what I liked here is that with_timeout is completely agnostic, it doesn't need to know anything about the thing that's blocking. Which sets a good precedent against having any blocking operation on the main fiber in specs. This is updating accept's only because there are a lot of them in the current specs on the main fiber and were what hid the underlying failure conditions in #9438 and hang the specs.

I could imagine potentially a Channel.receive for example on the main fiber still lurking around somewhere.

@waj
Copy link
Member

waj commented Jun 23, 2020

Yup, I understand, but I don't want to mix channels and selects where it's not needed. Let's keep things simple as possible. I know a hanging spec is hard to debug, but it's even harder if we mix things that often causes trouble in multithreaded environments.

@waj
Copy link
Member

waj commented Jun 23, 2020

I mean, I understand what you're trying to accomplish, but in that case should we add a wait_timeout on every spec just in case? I wish we had an analogous api to Go's context so spec would just setup a deadline for each case. After 1.0 maybe 😉

@jhass
Copy link
Member Author

jhass commented Jun 23, 2020

I wouldn't be very opposed to that! Of course it should be possible to set the actual timeout value or disable it entirely if you know what you're doing. Also probably to set the default value per context. But capping a spec to something like 10 seconds per default sounds entirely sensible to me.

@waj
Copy link
Member

waj commented Jun 23, 2020

Yup, totally doable, but that would work very differently than creating fibers and channels, and work more tightly with the runtime instead.

@jhass
Copy link
Member Author

jhass commented Jun 23, 2020

Are you sure? it { loop { } } still would need to trigger the timeout, at least in MT mode.

@waj
Copy link
Member

waj commented Jun 23, 2020

No, because even in MT the fibers work in cooperative mode. In a tight loop there is no place where the timeout exception could occur. The same happened to Go until version 1.14 where the goroutines became fully preemptible. Although I don't know the details about how is that implemented. We could eventually do something similar in Crystal.

Note that if you just use your implementation of with_timeout with that loop, the spec might raise an error and keep running (in MT), but the loop will still be running and fully locking the thread where it lives.

@straight-shoota
Copy link
Member

So, what do we do about this?
Maybe we could guard the actual with_timeout implementation by a compile time flag. It would be disabled by default, but you can enable it if you need to debug a hanging spec. Not sure that's gonna be that useful though. And it might be misleading because the spec code suggest there is a timeout but then it's actually deactivated by default.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind:bug A bug in the code. Does not apply to documentation, specs, etc. kind:specs topic:stdlib:networking

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants