Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

output of println get all mixed up from coroutines #10615

Closed
ssagaert opened this issue Mar 24, 2015 · 14 comments
Closed

output of println get all mixed up from coroutines #10615

ssagaert opened this issue Mar 24, 2015 · 14 comments
Labels
io Involving the I/O subsystem: libuv, read, write, etc.

Comments

@ssagaert
Copy link

scenario: web page with multiple iframes => multiple http requests. Handled via HttpServer which handles multiple requests via coroutines.

When doing a simple println of stacktraces then single lines (even values from a single var!) get mixed up by the multiple coroutines. I know it's normal that different lines of the different iframes will get intermingled but not within the lines itself which makes for very hard debugging...

example:

"key not found: 5"3415785505"4
" in
getDescription in at getDescription/media/sf_Documents/julia_workspace/VMRecommender/src/Rec3.jl at :/media/sf_Documents/julia_workspace/VMRecommender/src/Rec3.jl190:
190 in
webRecFrame in at webRecFrame/media/sf_Documents/julia_workspace/VMRecommender/src/Rec3.jl at :/media/sf_Documents/julia_workspace/VMRecommender/src/Rec3.jl1501:
1501 in
anonymous in at anonymousno file at :no file266:
266 in
on_message_complete in at on_message_complete/home/steven/.julia/v0.3/HttpServer/src/HttpServer.jl at :/home/steven/.julia/v0.3/HttpServer/src/HttpServer.jl303:
303 in
on_message_complete in at on_message_complete/home/steven/.julia/v0.3/HttpServer/src/RequestParser.jl at :/home/steven/.julia/v0.3/HttpServer/src/RequestParser.jl99:MethodError99(MethodErrorResponse(,Response(,nothing(,nothing),))
)ERROR: Response has no method matching Response(::Nothing)
in on_message_complete at /home/steven/.julia/v0.3/HttpServer/src/HttpServer.jl:305
in on_message_complete at /home/steven/.julia/v0.3/HttpServer/src/RequestParser.jl:99

ERROR: Response has no method matching Response(::Nothing)
in on_message_complete at /home/steven/.julia/v0.3/HttpServer/src/HttpServer.jl:305
in on_message_complete at /home/steven/.julia/v0.3/HttpServer/src/RequestParser.jl:99

@amitmurthy
Copy link
Contributor

This commit - #10521 - a few hours back has a solution for println only. Can you do a rebuild and see if it closes your issue?

@ihnorton ihnorton added the io Involving the I/O subsystem: libuv, read, write, etc. label Mar 24, 2015
@tkelman
Copy link
Contributor

tkelman commented Mar 25, 2015

Judging by .julia/v0.3 looks like @ssagaert is using releases. So would need to test on nightly to check if the recently added locking helps.

@ssagaert
Copy link
Author

Is this fix on the 0.4 branch or 0.3? I’m on 0.3.5 and my code is not going to run unchanged on 0.4 I think (due to changes in syntax for Dict etc.) I don’t want to migrate to 0.4 at this point.

Also I’ve never build julia from source so… I’d prefer you do a small test 😊

Van: Amit Murthy
Verzonden: ‎dinsdag‎ ‎24‎ ‎maart‎ ‎2015 ‎13‎:‎09
Aan: JuliaLang/julia
CC: Steven Sagaert

This commit - #10521 - a few hours back has a solution for println only. Can you do a rebuild and see if it closes your issue?


Reply to this email directly or view it on GitHub.

@tkelman
Copy link
Contributor

tkelman commented Mar 27, 2015

That makes sense. Can you provide example code that causes this that runs on 0.3?

@ssagaert
Copy link
Author

That will be difficult I think.

the code is for a recommendation system I’m writing for my company so I can’t just ‘open source’ this.

The code is quite large and specific for our databases. Without the databases you won’t be able to run the code anyway.

Your best bet is to write a small program that does the following:

serves html page with more than 1 iframe (=> multiple simultaneous http requests)

process the http request via HttpServer.jl

when processing the request lookup data in a Dict with a missing key and println the stacktrace of that exception

Van: Tony Kelman
Verzonden: ‎vrijdag‎ ‎27‎ ‎maart‎ ‎2015 ‎9‎:‎14
Aan: JuliaLang/julia
CC: Steven Sagaert

That makes sense. Can you provide example code that causes this that runs on 0.3?


Reply to this email directly or view it on GitHub.

@amitmurthy
Copy link
Contributor

On 0.3, a workaround for you would be to

  • copy

    julia/base/util.jl

    Lines 272 to 328 in 80e2cd5

    # Advisory reentrant lock
    type ReentrantLock
    locked_by::Nullable{Task}
    cond_wait::Condition
    reentrancy_cnt::Int
    ReentrantLock() = new(nothing, Condition(), 0)
    end
    # Lock object during function execution. Recursive calls by the same task is OK.
    const adv_locks_map = WeakKeyDict{Any, ReentrantLock}()
    function lock(f::Function, o::Any)
    rl = get(adv_locks_map, o, nothing)
    if rl == nothing
    rl = ReentrantLock()
    adv_locks_map[o] = rl
    end
    lock(rl)
    try
    f(o)
    finally
    unlock(o)
    end
    end
    function lock(rl::ReentrantLock)
    t = current_task()
    while true
    if rl.reentrancy_cnt == 0
    rl.locked_by = t
    rl.reentrancy_cnt = 1
    return
    elseif t == get(rl.locked_by)
    rl.reentrancy_cnt += 1
    return
    end
    wait(rl.cond_wait)
    end
    end
    function unlock(o::Any)
    rl = adv_locks_map[o]
    unlock(rl)
    end
    function unlock(rl::ReentrantLock)
    rl.reentrancy_cnt = rl.reentrancy_cnt - 1
    if rl.reentrancy_cnt == 0
    rl.locked_by = nothing
    notify(rl.cond_wait)
    elseif rl.reentrancy_cnt < 0
    AssertionError("unlock count must match lock count")
    end
    rl
    end
    into your codebase
  • wrap your debug println that are getting all mixed up like this:
lock(STDOUT) do io
  println(io, stacktrace)
end

backporting this feature will take sometime, given the performance regression detected due to the lock.

@amitmurthy
Copy link
Contributor

In the code block, you will need to replace Nullable with an appropriate Union and change lines using that field as well.

@ssagaert
Copy link
Author

Hi,
concerning the performance regression,
In the current design you have a global lock so when you have a lot of threads competing for that lock it becomes a bottleneck. Have you thought about something like some sort of ‘event driven io’?
E.g. have a multiple queue thread pool (one queue per thread + work stealing) where print events/messages are dropped by different threads and the messages are then printed async in the background. Or maybe you can avoid the julia level thread pool by using OS level event-driven/non-blocking IO (see e.g. java.NIO, Python also has non-blocking io now).

On 28 Mar 2015, at 04:50, Amit Murthy [email protected] wrote:

On 0.3, a workaround for you would be to

copy

julia/base/util.jl

Lines 272 to 328 in 80e2cd5

# Advisory reentrant lock
type ReentrantLock
locked_by::Nullable{Task}
cond_wait::Condition
reentrancy_cnt::Int
ReentrantLock() = new(nothing, Condition(), 0)
end
# Lock object during function execution. Recursive calls by the same task is OK.
const adv_locks_map = WeakKeyDict{Any, ReentrantLock}()
function lock(f::Function, o::Any)
rl = get(adv_locks_map, o, nothing)
if rl == nothing
rl = ReentrantLock()
adv_locks_map[o] = rl
end
lock(rl)
try
f(o)
finally
unlock(o)
end
end
function lock(rl::ReentrantLock)
t = current_task()
while true
if rl.reentrancy_cnt == 0
rl.locked_by = t
rl.reentrancy_cnt = 1
return
elseif t == get(rl.locked_by)
rl.reentrancy_cnt += 1
return
end
wait(rl.cond_wait)
end
end
function unlock(o::Any)
rl = adv_locks_map[o]
unlock(rl)
end
function unlock(rl::ReentrantLock)
rl.reentrancy_cnt = rl.reentrancy_cnt - 1
if rl.reentrancy_cnt == 0
rl.locked_by = nothing
notify(rl.cond_wait)
elseif rl.reentrancy_cnt < 0
AssertionError("unlock count must match lock count")
end
rl
end
into your codebase

wrap your debug println that are getting up all mixed up like this:

lock(STDOUT) do io
println(io, stacktrace)
end
backporting this feature will take sometime, given the performance regression detected due to the lock.


Reply to this email directly or view it on GitHub.

@amitmurthy
Copy link
Contributor

The lock is per object. Also, currently there is one thread of execution, with multiple tasks, i.e. coroutines executing concurrently. The IO subsystem, based on libuv, is event driven io. The mixed output from multiple tasks happens since the IO subsystem yields to other tasks during any IO by a task.

@ssagaert
Copy link
Author

Ok. Thanks for the clarification.

On 28 Mar 2015, at 11:07, Amit Murthy [email protected] wrote:

The lock is per object. Also, currently there is one thread of execution, with multiple tasks, i.e. coroutines executing concurrently. The IO subsystem, based on libuv, is event driven io. The mixed output from multiple tasks happens since the IO subsystem yields to other tasks during any IO by a task.


Reply to this email directly or view it on GitHub.

@ssfrr
Copy link
Contributor

ssfrr commented Sep 5, 2015

This is a test case on recent master (fcb38c1). I'd expect the println calls to be atomic because of the locking introduced in #10521 and #10679, but it looks like the newlines at the end are being interleaved.

received = IOBuffer()

@sync begin
    server = listen(10001)
    @async begin sock = accept(server)
        println("Connected")
        while isopen(sock)
            data = readline(sock)
            write(received, data)
        end
        close(server)
    end
    # yield so we hit the server accept
    yield()

    busy = false
    sender = connect(10001)
    @async println(sender, "1234567890")
    @async println(sender, "abcdefgabc")
    # sleep so all our messages get through
    sleep(0.1)
    close(sender)
end

println("total received data:")
print(takebuf_string(received))
println("-----")

I would expect to see

Connected
total received data:
1234567890
abcdefgabc
-----

But instead I see

Connected
total received data:
1234567890abcdefgabc

-----

So it looks like the second message is coming before the newline from the first one. I don't see anything wrong in the locking code though. @amitmurthy any ideas?

@ssfrr
Copy link
Contributor

ssfrr commented Sep 5, 2015

strangely enough, wrapping the println calls in another shared lock/unlock fixes the problem.

I just replaced the pair of @async lines above with

    lk = ReentrantLock()
    @async begin
        lock(lk)
        println(sender, "1234567890")
        unlock(lk)
    end
    @async begin
        lock(lk)
        println(sender, "abcdefgabc")
        unlock(lk)
    end

Maybe there's something weird going on where the tasks are locking different locks? Just a guess.

@amitmurthy
Copy link
Contributor

334a2be#diff-ac5d350530574f3f82c860d1b963bc0a is causing this.We again need to add appropriate locks for specific IO types.

cc : @JeffBezanson

@amitmurthy
Copy link
Contributor

#12978 will fix this on 0.4. Closing this since we are not porting the change to 0.3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
io Involving the I/O subsystem: libuv, read, write, etc.
Projects
None yet
Development

No branches or pull requests

5 participants