Use IController in implicit restart test for deterministic behavior by ChrisRackauckas-Claude · Pull Request #2839 · trixi-framework/Trixi.jl

ChrisRackauckas-Claude · 2026-03-01T11:11:01Z

Summary

The implicit sparse Jacobian restart test (elixir_advection_implicit_sparse_jacobian_restart.jl) broke with OrdinaryDiffEqCore v3.10 because of a new first-step acceleration feature (SciML/OrdinaryDiffEq.jl#3101).

The test was already fragile for two reasons:

PIController has memory (qold) — the previous error estimate ratio isn't saved/restored in checkpoints, so the restarted controller state differs from the continuous run. It only worked because the problem is trivial enough that the error estimate was near zero, making qmax the binding constraint and qold irrelevant.
First-step acceleration — OrdinaryDiffEqCore v3.10 added qmax_first_step=10000 (matching Sundials CVODE behavior), which allows 10000x step growth on the first step. Since a restarted integrator has success_iter=0, it gets this acceleration, diverging from the continuous run.

Fix

Switch the restart elixir's solver to use NewIController(alg, qmax=10, qmax_first_step=10):

IController is memoryless — no dependency on previous error estimates, so restart produces deterministic step sizes
qmax_first_step = qmax — disables first-step acceleration so the restart behaves identically to a continuation

Reference values updated to match the new controller.

Test plan

Both restart test variants (with and without colorvec) pass locally with updated reference values
CI passes

🤖 Generated with Claude Code

Co-Authored-By: Chris Rackauckas accounts@chrisrackauckas.com

The implicit sparse Jacobian restart test was fragile because: 1. The default PIController has memory (qold) that isn't saved/restored in checkpoints, so restart step sizes depend on controller state 2. OrdinaryDiffEqCore v3.10 added first-step acceleration (qmax=10000) which changed the restart trajectory since fresh integrators have success_iter=0 Switch to NewIController with qmax_first_step=qmax=10 which: - Has no memory of previous error estimates (memoryless I-control) - Disables first-step acceleration so restarts behave identically to continuation Reference values updated accordingly. Fixes the test regression from SciML/OrdinaryDiffEq.jl#3101. Co-Authored-By: Chris Rackauckas <accounts@chrisrackauckas.com>

github-actions · 2026-03-01T11:11:11Z

codecov · 2026-03-01T12:09:09Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 97.13%. Comparing base (8ce8682) to head (a54d28c).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #2839   +/-   ##
=======================================
  Coverage   97.13%   97.13%           
=======================================
  Files         596      596           
  Lines       45962    45962           
=======================================
  Hits        44642    44642           
  Misses       1320     1320

Flag	Coverage Δ
unittests	`97.13% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

JoshuaLampert · 2026-03-01T12:50:38Z

Thanks! So this fixes the failing test from the restart example, but there is also another failing test, which does not use the restart functionality, see https://github.com/trixi-framework/Trixi.jl/actions/runs/22542181717/job/65300052174?pr=2839#step:7:8507. This test runs this example file: https://github.com/trixi-framework/Trixi.jl/blob/8ce86828611783942b2a4518df908f1cb434a218/examples/p4est_2d_dgsem/elixir_euler_supersonic_cylinder_scO2.jl.

ChrisRackauckas-Claude · 2026-03-01T14:47:30Z

Investigation: `elixir_euler_supersonic_cylinder_scO2.jl` test failure

This failure is not caused by this PR — it's a pre-existing issue on main caused by OrdinaryDiffEqCore v3.10 introducing qmax_first_step=10000 (matching Sundials CVODE behavior), which allows 10000x step growth on the first step. This changes the adaptive stepping trajectory, which triggers different AMR decisions, producing a different (but equally valid) numerical solution.

Root cause confirmation

Running the elixir at tspan=(0.0, 0.001) with the test's default tolerances (abstol=1e-6, reltol=1e-3):

OrdinaryDiffEqCore	L2[rho]	Matches test reference?
v3.9 (old, no first-step acceleration)	0.02931403129205...	Yes, to ~1e-16
v3.11 (new, with `qmax_first_step=10000`)	0.02922...	No, ~0.3% different

The old ODECore reproduces the hardcoded reference values to machine precision. The new ODECore gives a different trajectory through the AMR decisions, producing ~0.3–8% differences in L2/Linf norms.

The real problem: the test tolerance is absurdly tight

The test asserts rtol=1.5e-8 on the solution norms, but the time integration error at the default tolerances is ~1–3%. I ran a convergence study sweeping abstol=reltol from 1e-3 to 1e-8 to measure actual solution accuracy. Using the tightest tolerance (1e-8) as a reference:

Old OrdinaryDiffEqCore (v3.9):

tol	L2[rho]	Error vs ref	Steps
1e-3	0.02228	21.5%	15
1e-4	0.02765	2.6%	54
1e-5	0.02926	3.1%	141
1e-6 (test default)	0.02875	1.3%	399
1e-7	0.02840	0.06%	940
1e-8 (ref)	0.02838	—	2204

New OrdinaryDiffEqCore (v3.11):

tol	L2[rho]	Error vs ref	Steps
1e-3	0.02291	19.3%	14
1e-4	0.02747	3.2%	49
1e-5	0.02901	2.2%	131
1e-6 (test default)	0.02922	3.0%	385
1e-7	0.02838	0.03%	910
1e-8 (ref)	0.02838	—	2268

Both old and new ODECore converge to the same solution (L2[rho] ≈ 0.02838) at tight tolerances, confirming the new ODECore is correct. The ~1–3% time integration error at tol=1e-6 means the test is asserting 8+ digits of agreement (rtol=1.5e-8) on a solution that only has ~2 digits of accuracy. Any change to the stepping trajectory (like qmax_first_step) that stays within the time integration error is equally valid.

Recommendation

The test reference values should either be updated for the new ODECore, or (better) the test rtol should be relaxed to something consistent with the actual solution accuracy — e.g., rtol=0.05 (5%).

Reproduction script

convergence_study.jl (click to expand)

# Convergence study: run the supersonic cylinder scO2 elixir at different tolerances

using Pkg
Pkg.activate(mktempdir())
Pkg.develop(path="Trixi.jl")
Pkg.add("OrdinaryDiffEqCore")  # latest (with first-step acceleration)
Pkg.add("OrdinaryDiffEqSSPRK")

using Trixi
using OrdinaryDiffEqSSPRK
using Printf

elixir = joinpath(@__DIR__, "elixir_scO2_tol.jl")

tolerances = [1e-3, 1e-4, 1e-5, 1e-6, 1e-7, 1e-8]

# Store results to a global that survives world age changes
const RESULTS = Dict{Float64, Any}()

for tol in tolerances
    @printf("\n\n>>> Running with abstol=%.0e, reltol=%.0e\n", tol, tol)

    @eval begin
        global MY_ABSTOL = $tol
        global MY_RELTOL = $tol
        trixi_include($elixir, tspan=(0.0, 0.001))
        l2_vals = analysis_callback(sol).l2
        linf_vals = analysis_callback(sol).linf
        RESULTS[$tol] = (l2=copy(l2_vals), linf=copy(linf_vals),
                         nsteps=sol.stats.naccept, ndofs=length(sol.u[end]) ÷ 4)
    end
end

println("\n\n" * "="^120)
println("CONVERGENCE STUDY RESULTS")
println("="^120)
println()
@printf("%-10s  %-10s  %-6s  %-22s  %-22s  %-22s  %-22s\n",
        "tol", "#DOFs", "steps", "L2[rho]", "L2[rho_v1]", "L2[rho_v2]", "L2[rho_e]")
println("-"^120)
for tol in tolerances
    r = RESULTS[tol]
    @printf("%.0e    %-10d  %-6d  %.15e  %.15e  %.15e  %.15e\n",
            tol, r.ndofs, r.nsteps, r.l2[1], r.l2[2], r.l2[3], r.l2[4])
end

elixir_scO2_tol.jl — modified elixir that reads MY_ABSTOL/MY_RELTOL globals (click to expand)

# Channel flow around a cylinder at Mach 3
# (Modified version that uses global MY_ABSTOL, MY_RELTOL for convergence studies)

using OrdinaryDiffEqSSPRK
using Trixi

equations = CompressibleEulerEquations2D(1.4)

@inline function initial_condition_mach3_flow(x, t, equations::CompressibleEulerEquations2D)
    rho_freestream = 1.4
    v1 = 3.0
    v2 = 0.0
    p_freestream = 1.0
    prim = SVector(rho_freestream, v1, v2, p_freestream)
    return prim2cons(prim, equations)
end

initial_condition = initial_condition_mach3_flow

@inline function boundary_condition_supersonic_inflow(u_inner,
                                                      normal_direction::AbstractVector,
                                                      x, t, surface_flux_function,
                                                      equations::CompressibleEulerEquations2D)
    u_boundary = initial_condition_mach3_flow(x, t, equations)
    return flux(u_boundary, normal_direction, equations)
end

@inline function boundary_condition_outflow(u_inner, normal_direction::AbstractVector, x, t,
                                            surface_flux_function,
                                            equations::CompressibleEulerEquations2D)
    return flux(u_inner, normal_direction, equations)
end

boundary_conditions = (; Bottom = boundary_condition_slip_wall,
                       Circle = boundary_condition_slip_wall,
                       Top = boundary_condition_slip_wall,
                       Right = boundary_condition_outflow,
                       Left = boundary_condition_supersonic_inflow)

volume_flux = flux_ranocha_turbo
surface_flux = flux_lax_friedrichs

polydeg = 3
basis = LobattoLegendreBasis(polydeg)
shock_indicator = IndicatorHennemannGassner(equations, basis,
                                            alpha_max = 0.5,
                                            alpha_min = 0.001,
                                            alpha_smooth = true,
                                            variable = density_pressure)
volume_integral = VolumeIntegralShockCapturingRRG(basis, shock_indicator;
                                                  volume_flux_dg = volume_flux,
                                                  volume_flux_fv = surface_flux,
                                                  slope_limiter = minmod)

solver = DGSEM(polydeg = polydeg, surface_flux = surface_flux,
               volume_integral = volume_integral)

mesh_file = Trixi.download("https://gist.githubusercontent.com/andrewwinters5000/a08f78f6b185b63c3baeff911a63f628/raw/addac716ea0541f588b9d2bd3f92f643eb27b88f/abaqus_cylinder_in_channel.inp",
                           joinpath(@__DIR__, "abaqus_cylinder_in_channel.inp"))

mesh = P4estMesh{2}(mesh_file)

semi = SemidiscretizationHyperbolic(mesh, equations, initial_condition, solver;
                                    boundary_conditions = boundary_conditions)

tspan = (0.0, 2.0)
ode = semidiscretize(semi, tspan)

summary_callback = SummaryCallback()
analysis_interval = 1000
analysis_callback = AnalysisCallback(semi, interval = analysis_interval)
alive_callback = AliveCallback(analysis_interval = analysis_interval)
save_solution = SaveSolutionCallback(interval = 5000,
                                     save_initial_solution = false,
                                     save_final_solution = true,
                                     solution_variables = cons2prim)

amr_indicator = IndicatorLöhner(semi, variable = Trixi.density)
amr_controller = ControllerThreeLevel(semi, amr_indicator,
                                      base_level = 0,
                                      med_level = 3, med_threshold = 0.05,
                                      max_level = 5, max_threshold = 0.1)
amr_callback = AMRCallback(semi, amr_controller,
                           interval = 2,
                           adapt_initial_condition = true,
                           adapt_initial_condition_only_refine = true)

callbacks = CallbackSet(summary_callback,
                        analysis_callback, alive_callback,
                        save_solution,
                        amr_callback)

stage_limiter! = PositivityPreservingLimiterZhangShu(thresholds = (5.0e-7, 1.0e-6),
                                                     variables = (pressure, Trixi.density))

dt0 = 1e-8
sol = solve(ode, SSPRK43(stage_limiter! = stage_limiter!, thread = Trixi.True());
            dt = dt0, save_everystep = false,
            abstol = MY_ABSTOL, reltol = MY_RELTOL,
            callback = callbacks, maxiters = 1_000_000);

ChrisRackauckas · 2026-03-01T14:56:30Z

That result is pretty self-explanatory. It's very clear the test as written isn't very meaningful because it expects 6 orders of magnitude lower error than what you're actually getting from the solution. What would you like out of your options?

Make the test tolerance something reasonable based on the true error?
Decrease the solver tolerances so that the chosen test tolerance works?
Update the reference solution to the new value (i.e. treat it as a regression test not a correctness test)

It can be reasonable to have a test like this to say "did anything change?", but the test as written isn't asking the question "are things correct?", it's asking "did anything change?". Those are two very different questions and here is a case where you get a different answer. I think it's totally fine to have regression tests to check whether anything changed unexpectedly, OrdinaryDiffEq.jl has a few them, but it is not an indicator that something is wrong if a non-correctness regression test has a drift, it is a sign for further analysis. And here, the further analysis was very straightforward and clear because it's not borderline: the test is asking for 6 orders of magnitude tighter results than what you're getting against the true solution, obviously that's going to drift even for small changes in CPU choice, SIMD, threading, etc.?

I'm happy to help isolate these things, because yes these kinds of regression tests are good for double checking things, but please do the double check before jumping to the conclusion that something is wrong and the world is on fire. Because what we can see from this is that, there wasn't anything wrong when it was looked into, and both of these were test issues, test issues that lead to better code in your repo!

JoshuaLampert · 2026-03-01T15:10:54Z

I never said the new version in OrdinaryDiffEqCore.jl is wrong and our previous reference value is the "true" solution. As you say, we have these tests to see if something changes. And if the changes were of the order of the tolerances for the time integration we would not complain (in fact this is sometimes the case and we did not complain), but the difference is not something like 1e-3 or smaller, but rather 0.5:

   Evaluated: isapprox(7.397166897133932, 6.8750812520355655; atol = 1.1102230246251565e-13, rtol = 1.4901161193847656e-8)

julia> 7.397166897133932 - 6.8750812520355655
0.5220856450983664

julia> 7.397166897133932/6.8750812520355655
1.0759388327147097

I never saw before such a big change, which was solely explained by solver tolerances. But if you say, this is still within the tolerance, it's fine and we adapt our tests. I just didn't expect it before when I saw this huge difference in the test result.
Regarding your three suggestions, we will probably do 2. or 3. because 1. does not seem appropriate for this case. We will not set, e.g., an absolute tolerance of more than 0.5. This will only mean we will miss differences we like to know of.

ChrisRackauckas-Claude · 2026-03-01T16:51:20Z

Investigation: Can lowering solver tolerance achieve rtol=1.5e-8 reproducibility?

Following up on the suggestion to go with option 2 (decrease solver tolerances), I ran two experiments to test whether this is viable.

Experiment 1: Convergence study (single ODECore, sweeping tolerances)

Running the elixir at tspan=(0.0, 0.001) with abstol=reltol=tol on the current ODECore v3.11, comparing consecutive tolerance levels:

tol	Steps	L2[rho]	Max rel_diff vs next tighter tol
1e-6	385	0.029223159337780365	1.9e-2
1e-7	899	0.02866922858136196	2.9e-3
5e-8	1217	0.028587446274307562	3.6e-3
2e-8	1749	0.028485133731793903	4.6e-4
1e-8	2278	0.028471921125559973	1.7e-3
5e-9	2867	0.028423664475640067	7.6e-4
2e-9	4097	0.02844530272496351	8.5e-4
1e-9	5187	0.028421004274159126	—

None of the consecutive pairs meet rtol=1.5e-8. The norms oscillate rather than converge smoothly because AMR decisions are discrete (refine-or-not), so small changes in the solution can trigger different mesh topologies.

Experiment 2: Direct comparison of old vs new stepping behavior

This is the key test. At each solver tolerance, I compared qmax_first_step=6 (simulating old ODECore behavior) vs qmax_first_step=10000 (new behavior) using NewPIController on the same ODECore v3.11:

Solver tol	Steps (old/new)	Max rel_diff (L2)	Max rel_diff (Linf)
1e-6	402 / 392	3.6e-2 (rho_v2)	8.0e-2 (rho)
1e-7	904 / 897	1.3e-3 (rho_v2)	1.2e-2 (rho)
1e-8	2278 / 2269	4.5e-4 (rho_v2)	4.8e-4 (rho_e)
1e-9	5202 / 5215	7.3e-4 (rho_v1)	5.8e-3 (rho)

Even at tol=1e-9 (5200 steps, ~13x slower than the current 385), the max relative difference is 5.8e-3 — still 5 orders of magnitude above the test rtol of 1.5e-8.

Going from tol=1e-8 to tol=1e-9, several Linf components actually increased their disagreement (Linf[rho] went from 5.6e-6 to 5.8e-3). Different tolerances cause different cells to cross the AMR threshold, producing fundamentally different meshes.

Conclusion

Option 2 (lower solver tolerance to hit test rtol=1.5e-8) is not viable for this AMR problem. The discrete nature of AMR decisions creates O(1e-3) jumps in solution norms that cannot be eliminated by tightening the time integrator tolerance. No solver tolerance will make this test robust to stepping algorithm changes at rtol=1.5e-8.

Viable alternatives:

Option 3 (update reference values): Update the hardcoded L2/Linf values for the new ODECore. Future stepping changes may require another update.
Relax the test rtol: e.g. rtol=0.05 matches the actual solution accuracy at the default solver tolerances.
Hybrid: Tighten solver to abstol=reltol=1e-8 (6x slower, ~2278 steps), update reference values, and set test rtol=1e-3. This gives a meaningful correctness check (values within 0.1% of converged solution) while being robust to minor stepping changes.

ranocha

Let us please not use an IController (or NewIController), since it will not satisfy our requirements of step size control stability.
The tests here are regression tests. We do not want them to use huge tolerances, since we would otherwise not catch errors we would like to know about, as Joshua wrote.
We should likely update our reference values for the new implementation after checking whether the full simulation results are reasonable.

DanielDoehring · 2026-03-02T09:22:47Z

Xref #2838

JoshuaLampert · 2026-03-02T15:37:52Z

Superseded by #2838.

ranocha requested changes Mar 1, 2026

View reviewed changes

JoshuaLampert closed this Mar 2, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use IController in implicit restart test for deterministic behavior#2839

Use IController in implicit restart test for deterministic behavior#2839
ChrisRackauckas-Claude wants to merge 1 commit intotrixi-framework:mainfrom
ChrisRackauckas-Claude:fix-implicit-restart-controller

ChrisRackauckas-Claude commented Mar 1, 2026

Uh oh!

github-actions bot commented Mar 1, 2026

Uh oh!

codecov bot commented Mar 1, 2026 •

edited

Loading

Uh oh!

JoshuaLampert commented Mar 1, 2026

Uh oh!

ChrisRackauckas-Claude commented Mar 1, 2026

Uh oh!

ChrisRackauckas commented Mar 1, 2026

Uh oh!

JoshuaLampert commented Mar 1, 2026

Uh oh!

ChrisRackauckas-Claude commented Mar 1, 2026

Uh oh!

ranocha left a comment

Uh oh!

DanielDoehring commented Mar 2, 2026

Uh oh!

JoshuaLampert commented Mar 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

ChrisRackauckas-Claude commented Mar 1, 2026

Summary

Fix

Test plan

Uh oh!

github-actions bot commented Mar 1, 2026

Review checklist

Purpose and scope

Code quality

Documentation

Testing

Performance

Verification

Uh oh!

codecov bot commented Mar 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

JoshuaLampert commented Mar 1, 2026

Uh oh!

ChrisRackauckas-Claude commented Mar 1, 2026

Investigation: elixir_euler_supersonic_cylinder_scO2.jl test failure

Root cause confirmation

The real problem: the test tolerance is absurdly tight

Recommendation

Reproduction script

Uh oh!

ChrisRackauckas commented Mar 1, 2026

Uh oh!

JoshuaLampert commented Mar 1, 2026

Uh oh!

ChrisRackauckas-Claude commented Mar 1, 2026

Investigation: Can lowering solver tolerance achieve rtol=1.5e-8 reproducibility?

Experiment 1: Convergence study (single ODECore, sweeping tolerances)

Experiment 2: Direct comparison of old vs new stepping behavior

Conclusion

Uh oh!

ranocha left a comment

Choose a reason for hiding this comment

Uh oh!

DanielDoehring commented Mar 2, 2026

Uh oh!

JoshuaLampert commented Mar 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

codecov bot commented Mar 1, 2026 •

edited

Loading

Investigation: `elixir_euler_supersonic_cylinder_scO2.jl` test failure