Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Concurrent program crashes 1.12 nightly build (1.10 is fine) #54720

Closed
levy opened this issue Jun 6, 2024 · 5 comments · Fixed by #55336
Closed

Concurrent program crashes 1.12 nightly build (1.10 is fine) #54720

levy opened this issue Jun 6, 2024 · 5 comments · Fixed by #55336
Labels
atomics bug Indicates an unexpected problem or unintended behavior compiler:codegen Generation of LLVM IR and native code multithreading Base.Threads and related functionality regression 1.12 Regression in the 1.12 release
Milestone

Comments

@levy
Copy link

levy commented Jun 6, 2024

I attached the code that reproduces the crash. I'm sorry for the extension but couldn't attach a Julia source file.

You may need to give it a few tries though. It should crash in a few seconds, but sometimes it just hangs. I have a 16 core machine, so that may also affect the outcome.

See discussion for some details.

I installed Julia with juliaup.

julia> versioninfo()
Julia Version 1.12.0-DEV.670
Commit 5cb1107cce1 (2024-06-06 03:00 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 16 × AMD Ryzen 7 4800HS with Radeon Graphics
  WORD_SIZE: 64
  LLVM: libLLVM-17.0.6 (ORCJIT, znver2)
Threads: 16 default, 0 interactive, 16 GC (on 16 virtual cores)
Environment:
  JULIA_NUM_THREADS = 4

crash.jl.txt

image

@oscardssmith
Copy link
Member

oscardssmith commented Jun 6, 2024

minimized slightly. I can also confirm that this doesn't occur on 1.11.


using .Threads

mutable struct AtomicMarked{T}
  @atomic p::Pair{T, Bool}
  AtomicMarked{T}(value::T, marked::Bool) where T <: Any = begin
    return new{T}(Pair{T, Bool}(value, marked))
  end
end

mutable struct Node{T}
  value::T
  next::AtomicMarked{Union{Node{T}, Nothing}}
  Node{T}(value::T) where T <: Any = begin
    return new{T}(value, AtomicMarked{Union{Node{T}, Nothing}}(nothing, false))
  end
end

mutable struct LockFreeSortedLinkedList{T}
  head::AtomicMarked{Union{Node{T}, Nothing}} # sentinel
  tail::AtomicMarked{Union{Node{T}, Nothing}} # sentinel
  LockFreeSortedLinkedList{T}() where T <: Any = begin
    head = AtomicMarked{Union{Node{T}, Nothing}}(Node{T}(T(:smallest)), false)
    tail = AtomicMarked{Union{Node{T}, Nothing}}(Node{T}(T(:largest)), false)
    r = @atomicreplace head.p.first.next.p Pair{Union{Node{T}, Nothing}, Bool}(nothing, false) => Pair{Union{Node{T}, Nothing}, Bool}(tail.p.first, false)
    @assert(r.success)
    return new{T}(head, tail)
  end
end

function Base.isempty(list::LockFreeSortedLinkedList{T}) where T <: Any
  find(list.head.p.first, list.tail.p.first.value)
  return list.head.p.first.next.p.first == list.tail.p.first
end

function find(head::Node{T}, element::T) where T <: Any
  pred = nothing
  curr = nothing
  succ = nothing
  while true
    pred = head
    curr = pred.next.p.first
    while true
      (succ, marked) = curr.next.p
      snip = true
      while marked
        r = @atomicreplace pred.next.p Pair{Union{Node{T}, Nothing}, Bool}(curr, false) => Pair{Union{Node{T}, Nothing}, Bool}(succ, false)
        snip = r.success
        if !snip
          break
        end
        curr = succ
        (succ, marked) = curr.next.p
      end
      if !snip
        break
      end
      if curr.value >= element
        return (pred, curr)
      end
      pred = curr
      curr = succ
    end
  end
end

function Base.push!(list::LockFreeSortedLinkedList{T}, element::T) where T <: Any
  while true
    (pred, curr) = find(list.head.p.first, element)
    if (curr.value == element)
      return false
    else
      node = Node{T}(element)
      node.next = AtomicMarked{Union{Node{T}, Nothing}}(curr, false)
      r = @atomicreplace pred.next.p Pair{Union{Node{T}, Nothing}, Bool}(curr, false) => Pair{Union{Node{T}, Nothing}, Bool}(node, false)
      if r.success
        return true
      end
    end
  end
end

function Base.pop!(list::LockFreeSortedLinkedList{T}) where T <: Any
  while true
    pred = list.head.p.first
    curr = pred.next.p.first
    succ = curr.next.p.first
    r = @atomicreplace curr.next.p Pair{Union{Node{T}, Nothing}, Bool}(succ, false) => Pair{Union{Node{T}, Nothing}, Bool}(succ, true)
    if r.success
      continue
    end
    @atomicreplace pred.next.p Pair{Union{Node{T}, Nothing}, Bool}(curr, false) => Pair{Union{Node{T}, Nothing}, Bool}(succ, false)
    return curr.value
  end
end

function remove!(list::LockFreeSortedLinkedList{T}, element::T) where T <: Any
  while true
    (pred, curr) = find(list.head.p.first, element)
    if curr.value != element
      return false
    else
      succ = curr.next.p.first
      snip = @atomicreplace curr.next.p Pair{Union{Node{T}, Nothing}, Bool}(succ, false) => Pair{Union{Node{T}, Nothing}, Bool}(succ, true)
      if !snip.success
        continue
      end
      @atomicreplace pred.next.p Pair{Union{Node{T}, Nothing}, Bool}(curr, false) => Pair{Union{Node{T}, Nothing}, Bool}(succ, false)
      return true
    end
  end
end

function Base.empty!(list::LockFreeSortedLinkedList{T}) where T <: Any
  while !isempty(list)
    pop!(list)
  end
end

function search(f, list::LockFreeSortedLinkedList{T}) where T <: Any
  curr = list.head.p.first.next.p.first
  while curr != list.tail.p.first
    if f(curr.value)
      return curr.value
    end
    curr = curr.next.p.first
  end
end

mutable struct ParallelEvent
  id::Int64
  time::Float64
  action::Function
  executable::Bool
  @atomic selectedForExecution::Bool
  executing::Bool
  delayMatrixIndex::Union{Nothing, Int64}
  ParallelEvent(symbol) = begin
    if symbol == :smallest
      return new(-1000000, -Inf64, event -> nothing, false, false, false, nothing)
    elseif symbol == :largest
      return new(+1000000, +Inf64, event -> nothing, false, false, false, nothing)
    end
  end
  ParallelEvent(id, time, action, delayMatrixIndex) = return new(id, time, action, false, false, false, delayMatrixIndex)
end

function Base.isless(a::ParallelEvent, b::ParallelEvent)
  return a.time < b.time || (a.time == b.time && a.id < b.id)
end

function Base.isequal(a::ParallelEvent, b::ParallelEvent)
  return a.time == b.time && a.id == b.id
end

mutable struct ParallelSimulation{T}
  network::Any
  futureEvents::T
  terminated::Bool
  @atomic eventId::Int64
  delayMatrix::Matrix{Float64}
  ParallelSimulation{T}(network, futureEvents, delayMatrix) where T <: Any = new{T}(network, futureEvents, false, 0, delayMatrix)
end

function markEventExecutable(simulation, event)
  @assert !event.executable
  event.executable = true
end

function executeEvent(simulation, event)
  @assert event.executable
  @assert event.selectedForExecution
  @assert !event.executing
  @assert !simulation.terminated
  event.executing = true
  event.action(event)
  event.executing = false
end

function runSimulation(simulation::ParallelSimulation)
  tasks = []
  for _ in 1:Threads.nthreads() / 2
    task = @spawn runExecutableEvents(simulation)
    push!(tasks, task)
  end
  task = @spawn runDetermineExecutableEvents(simulation)
  push!(tasks, task)
  for task in tasks
    wait(task)
  end
end

function runDetermineExecutableEvents(simulation::ParallelSimulation)
    while !simulation.terminated
      determineExecutableEvents(simulation)
    end
end

function hasInterferingPreviousEvent(simulation::ParallelSimulation, event::ParallelEvent, previousEvents)
  for previousEvent in previousEvents
    minimumDelay = simulation.delayMatrix[previousEvent.delayMatrixIndex, event.delayMatrixIndex]
    if previousEvent.time + minimumDelay <= event.time
      return true
    end
  end
  return false
end

function determineExecutableEvents(simulation::ParallelSimulation)
  i = 1
  previousEvents = []
  search(event -> begin
    if i == 1
      if !event.executable
        markEventExecutable(simulation, event)
      end
      if isnothing(event.delayMatrixIndex)
        return true
      end
    elseif isnothing(event.delayMatrixIndex)
      return true
    elseif !event.executable && !hasInterferingPreviousEvent(simulation, event, previousEvents)
      markEventExecutable(simulation, event)
    end
    i += 1
    push!(previousEvents, event)
    return false
  end, simulation.futureEvents)
end

function runExecutableEvents(simulation::ParallelSimulation)
    while !simulation.terminated
      event = selectExecutableEvent(simulation)
      if !isnothing(event)
        executeEvent(simulation, event)
        remove!(simulation.futureEvents, event)
      end
    end
end

function selectExecutableEvent(simulation::ParallelSimulation)
  search(event -> begin
    if event.executable && !event.selectedForExecution
      r = @atomicreplace event.selectedForExecution false => true
      return r.success
    else
      return false
    end
  end, simulation.futureEvents)
end

function scheduleAt(simulation::ParallelSimulation, time, context, action)
  id = @atomic simulation.eventId += 1
  push!(simulation.futureEvents, ParallelEvent(id, time, action, context.index))
end

function terminateSimulationAt(simulation::ParallelSimulation, time)
  id = @atomic simulation.eventId += 1
  event = ParallelEvent(id, time, event -> begin
    simulation.terminated = true
    empty!(simulation.futureEvents)
  end, nothing)
  push!(simulation.futureEvents, event)
end

mutable struct NetworkNode
  index::Int64
  connections::Vector
  difficulty::UInt64
end

struct Connection
  source::NetworkNode
  destination::NetworkNode
  delay::Float64
  datarate::Float64
end

mutable struct Network
  networkNodes::Vector{NetworkNode}
  connections::Vector{Connection}
end

function addConnection(networkNode1, networkNode2, delay, datarate)
  c1 = Connection(networkNode1, networkNode2, delay, datarate)
  c2 = Connection(networkNode2, networkNode1, delay, datarate)
  push!(networkNode1.connections, c1)
  push!(networkNode2.connections, c2)
  return [c1, c2]
end

function computeDelayMatrix(network)
  networkNodes = network.networkNodes
  connections = network.connections
  nodeCount = length(networkNodes)
  delayMatrix = fill(Inf, nodeCount, nodeCount)
  for i in 1:nodeCount
    delayMatrix[i, i] = 0.0
  end
  for conn in connections
    i = conn.source.index
    j = conn.destination.index
    delayMatrix[i, j] = min(delayMatrix[i, j], conn.delay)
  end
  # Apply the Floyd-Warshall algorithm
  for k in 1:nodeCount
    for i in 1:nodeCount
      for j in 1:nodeCount
        if delayMatrix[i, j] > delayMatrix[i, k] + delayMatrix[k, j]
          delayMatrix[i, j] = delayMatrix[i, k] + delayMatrix[k, j]
        end
      end
    end
  end
  return delayMatrix
end

fib(n) = n < 2 ? 1 : fib(n - 1) + fib(n - 2)

function sendPacket(simulation, event, networkNode, connection)
  destination = connection.destination
  delay = connection.delay + rand(1:100) / connection.datarate
  fib(networkNode.difficulty)
  scheduleAt(simulation, event.time + delay, destination, event -> handlePacketReceived(simulation, event, destination))
end

function handlePacketReceived(simulation, event, networkNode)
  connection = rand(networkNode.connections)
  sendPacket(simulation, event, networkNode, connection)
end

function buildNetwork(nodeCount, connectionCount, difficulty)
  networkNodes = []
  for i in 1:nodeCount
    push!(networkNodes, NetworkNode(i, [], difficulty))
  end
  connections = []
  for i in 1:nodeCount
    for _ in 1:connectionCount
      j = rand(filter!(x -> x != i, Vector(1:nodeCount)))
      append!(connections, addConnection(networkNodes[i], networkNodes[j], rand(1:10), rand(1:100)))
    end
  end
  return Network(networkNodes, connections)
end

function crash(;timeLimit=100, nodeCount=10, connectionCount=3, packetCount=2, difficulty=30)
  network = buildNetwork(nodeCount, connectionCount, difficulty)
  futureEvents = LockFreeSortedLinkedList{ParallelEvent}()
  simulation = ParallelSimulation{LockFreeSortedLinkedList{ParallelEvent}}(network, futureEvents, computeDelayMatrix(network))
  for i in 1:packetCount
    networkNode = rand(network.networkNodes)
    scheduleAt(simulation, 0, networkNode, event -> handlePacketReceived(simulation, event, networkNode))
  end
  terminateSimulationAt(simulation, timeLimit)
  runSimulation(simulation)
end
crash()

@oscardssmith oscardssmith added bug Indicates an unexpected problem or unintended behavior multithreading Base.Threads and related functionality atomics labels Jun 6, 2024
@vchuravy vchuravy added this to the 1.12 milestone Jun 6, 2024
@oscardssmith oscardssmith added the regression 1.12 Regression in the 1.12 release label Jun 6, 2024
@sgaure
Copy link

sgaure commented Jul 31, 2024

Bisected to 39f141d. (Oscar's variant, but with for _ in 1:5; crash(); end and -t 3).

However, that's the segfault. Hangs can be provoked even in 1.7 by calling crash() in a longer loop with more threads. It might be separate issues.

@oscardssmith
Copy link
Member

oscardssmith commented Jul 31, 2024

I wonder if the problem is a fault of the original code. if there is a datarace it might be constructing a bool with an illegal value...

@vtjnash
Copy link
Member

vtjnash commented Jul 31, 2024

My first attempt at looking at this hangs here because of a badly implemented spinlock / Channel:

getproperty at ./Base.jl:49 [inlined]
search at /home/vtjnash/julia/zipyard/I54720.jl:121 [inlined]
selectExecutableEvent at /home/vtjnash/julia/zipyard/I54720.jl:241 [inlined]
runExecutableEvents at /home/vtjnash/julia/zipyard/I54720.jl:232

You need a Condition variable there (or a least a GC.safepoint()) to ensure it can make forward progress without being expected to hang

@vtjnash
Copy link
Member

vtjnash commented Jul 31, 2024

As for the segfault, it looks like codegen is failing to emit a write-multi-barrier in

julia> code_llvm(push!, (LockFreeSortedLinkedList{Vector{Any}}, Vector{Any}), optimize=false)

which just has an empty block

xchg_wb:                                          ; preds = %done_xchg
    br label %done_xchg_wb

resulting in a gc of the values in the linked list before they are used

@vtjnash vtjnash added compiler:codegen Generation of LLVM IR and native code and removed bisect wanted labels Jul 31, 2024
vtjnash added a commit that referenced this issue Aug 1, 2024
Due to limitations in the LLVM implementation, we are forced to emit
fairly bad code here. But we need to make sure it is still correct with
respect to GC rooting.

Fixes #54720
vtjnash added a commit that referenced this issue Aug 1, 2024
Due to limitations in the LLVM implementation, we are forced to emit
fairly bad code here. But we need to make sure it is still correct with
respect to GC rooting.

Fixes #54720
vtjnash added a commit that referenced this issue Aug 3, 2024
Due to limitations in the LLVM implementation, we are forced to emit
fairly bad code here. But we need to make sure it is still correct with
respect to GC rooting.

The PR 50833c8 also changed the meaning
of haspadding without changing all of the existing users to use the new
equivalent flag (haspadding || !isbitsegal), incurring additional
breakage here as well and needing more tests for that.

Fixes #54720
vtjnash added a commit that referenced this issue Aug 3, 2024
Due to limitations in the LLVM implementation, we are forced to emit
fairly bad code here. But we need to make sure it is still correct with
respect to GC rooting.

The PR 50833c8 also changed the meaning
of haspadding without changing all of the existing users to use the new
equivalent flag (haspadding || !isbitsegal), incurring additional
breakage here as well and needing more tests for that.

Fixes #54720
vtjnash added a commit that referenced this issue Aug 3, 2024
Due to limitations in the LLVM implementation, we are forced to emit
fairly bad code here. But we need to make sure it is still correct with
respect to GC rooting.

The PR 50833c8 also changed the meaning
of haspadding without changing all of the existing users to use the new
equivalent flag (haspadding || !isbitsegal), incurring additional
breakage here as well and needing more tests for that.

Fixes #54720
vtjnash added a commit that referenced this issue Aug 7, 2024
Due to limitations in the LLVM implementation, we are forced to emit
fairly bad code here. But we need to make sure it is still correct with
respect to GC rooting.

Fixes #54720
lazarusA pushed a commit to lazarusA/julia that referenced this issue Aug 17, 2024
Due to limitations in the LLVM implementation, we are forced to emit
fairly bad code here. But we need to make sure it is still correct with
respect to GC rooting.

The PR 50833c8 also changed the meaning
of haspadding without changing all of the existing users to use the new
equivalent flag (haspadding || !isbitsegal), incurring additional
breakage here as well and needing more tests for that.

Fixes JuliaLang#54720
vtjnash added a commit that referenced this issue Aug 19, 2024
Due to limitations in the LLVM implementation, we are forced to emit
fairly bad code here. But we need to make sure it is still correct with
respect to GC rooting.

The PR 50833c8 also changed the meaning
of haspadding without changing all of the existing users to use the new
equivalent flag (haspadding || !isbitsegal), incurring additional
breakage here as well and needing more tests for that.

Fixes #54720

(cherry picked from commit 8bfef8f)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
atomics bug Indicates an unexpected problem or unintended behavior compiler:codegen Generation of LLVM IR and native code multithreading Base.Threads and related functionality regression 1.12 Regression in the 1.12 release
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants