-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
staticdata: fix assert from partially disabled native code #53439
Conversation
@@ -3768,8 +3768,11 @@ JL_DLLEXPORT jl_value_t *jl_restore_package_image_from_file(const char *fname, j | |||
|
|||
jl_image_t pkgimage = jl_init_processor_pkgimg(pkgimg_handle); | |||
|
|||
if (ignore_native){ | |||
memset(&pkgimage.fptrs, 0, sizeof(pkgimage.fptrs)); | |||
if (ignore_native) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we just remove ignore_native
and move this instead into _tryrequire_from_serialized
here
Line 1805 in a96726b
# then load the file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can do cleanup of ignore_native
later, but this is the minimal change for backporting.
// The easiest way to do that is to disable it in all of them. | ||
jl_options.use_sysimage_native_code = JL_OPTIONS_USE_SYSIMAGE_NATIVE_CODE_NO; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't let me hold this up, but I don't quite understand what this does and how setting a sysimage flag relates to pkgimage usage.. Will this still allow pkgimages to be used by any packages, most importantly stdlibs like Pkg which are dead slow without pkgimages..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This essentially turns off native code for all images loaded after this moment (including the current one)
In #53373 this option was fixed for pkgimages (where when the sysimage is loaded without native code, pkgimages can use native code either)
IIUC we load the sysimg and all dependency with native code enabled and once we see the first cache file with ignore_native all subsequent cache files can't use native code, since they might point to images that were loaded without native code.
We could be less discriminate if we could prove that there is no overlap in the cached code.
As an example of a slightly problematic issue it is now order dependent if we use native code or not.
using Test # with native code
using MyPackage # disables native code
But
using MyPackage # disables native code
using Test # without native code
Since we can't know that a code instance is not duplicated between MyPackage and Test, and it depends on which one we load first.
So this generally means that we might ignore the native code of test dependencies, but we will use the native code of actual dependencies.
I'm seeing a 4–5x regression in package CI times from this PR. I CI ran on nightly in 1:27, 1:58, 2:01 on Ubuntu, Mac, Windows nightly Julia Version 1.12.0-DEV.63 Commit 923fe2d (2024-02-23 08:19 UTC) The next commit to my repo (which was an NFC commit) ran on Julia Version 1.12.0-DEV.64 Commit 6cbed31 (2024-02-23 16:16 UTC) in 7:22, 7:35, 7:53. All subsequent nightly CI runs on my repo are slow, and all non-nightly CI runs are fast (both before and after that commit). |
Moving the main package load to the end of the loads doesn't fix this regression. Are we sure that this doesn't propagate to deps also? |
Pkg.jl tests all timed out when running on this commit Julia Version 1.12.0-DEV.64 Commit 6cbed31 (2024-02-23 16:16 UTC) https://github.com/JuliaLang/Pkg.jl/actions/runs/8023442598/job/21919915633 Despite succeeding on the same package content in the PR's checks before merge on the commit Julia Version 1.12.0-DEV.63 Commit 923fe2d (2024-02-23 08:19 UTC) https://github.com/JuliaLang/Pkg.jl/actions/runs/8022722320/job/21917673112 This is neither spurious nor confined to a single package. |
yes, we are pretty sure it cannot affects things in the past |
That said, some packages spawn a lot of subprocesses, and this probably could affect those |
I can reproduce the performance regression (15 seconds -> 6 mins) locally when deleting all non-aqua tests. Here is a stack frame from interrupting it (Chairmarks) pkg> test
Testing Chairmarks
Status `/tmp/jl_ytpeXk/Project.toml`
[4c88cf16] Aqua v0.8.4
[0ca39b1e] Chairmarks v0.2.1 `~/.julia/dev/Chairmarks`
[10745b16] Statistics v1.11.1
[44cfe95a] Pkg v1.11.0
[de0858da] Printf v1.11.0
[8dfed614] Test v1.11.0
Status `/tmp/jl_ytpeXk/Manifest.toml`
[4c88cf16] Aqua v0.8.4
[0ca39b1e] Chairmarks v0.2.1 `~/.julia/dev/Chairmarks`
[34da2185] Compat v4.13.0
[10745b16] Statistics v1.11.1
[0dad84c5] ArgTools v1.1.2
[56f22d72] Artifacts v1.11.0
[2a0f44e3] Base64 v1.11.0
[ade2ca70] Dates v1.11.0
[f43a241f] Downloads v1.6.0
[7b1f6079] FileWatching v1.11.0
[b77e0a4c] InteractiveUtils v1.11.0
[b27032c2] LibCURL v0.6.4
[76f85450] LibGit2 v1.11.0
[8f399da3] Libdl v1.11.0
[37e2e46d] LinearAlgebra v1.11.0
[56ddb016] Logging v1.11.0
[d6f4376e] Markdown v1.11.0
[ca575930] NetworkOptions v1.2.0
[44cfe95a] Pkg v1.11.0
[de0858da] Printf v1.11.0
[9a3f8284] Random v1.11.0
[ea8e919c] SHA v0.7.0
[9e88b42a] Serialization v1.11.0
[f489334b] StyledStrings v1.11.0
[fa267f1f] TOML v1.0.3
[a4e569a6] Tar v1.10.0
[8dfed614] Test v1.11.0
[cf7118a7] UUIDs v1.11.0
[4ec0a83e] Unicode v1.11.0
[e66e0078] CompilerSupportLibraries_jll v1.1.0+0
[deac9b47] LibCURL_jll v8.6.0+0
[e37daf67] LibGit2_jll v1.7.1+0
[29816b5a] LibSSH2_jll v1.11.0+1
[c8ffd9c3] MbedTLS_jll v2.28.6+0
[14a3606d] MozillaCACerts_jll v2023.1.10
[4536629a] OpenBLAS_jll v0.3.26+2
[83775a58] Zlib_jll v1.2.13+1
[8e850b90] libblastrampoline_jll v5.8.0+1
[8e850ede] nghttp2_jll v1.59.0+0
[3f19e933] p7zip_jll v17.4.0+2
^C┌ Warning: Process I/O error
│ exception =
│ InterruptException:
│ Stacktrace:
│ [1] sigatomic_end
│ @ ./c.jl:146 [inlined]
│ [2] uv_write(s::Base.TTY, p::Ptr{UInt8}, n::UInt64)
│ @ Base ./stream.jl:1065
│ [3] unsafe_write(s::Base.TTY, p::Ptr{UInt8}, n::UInt64)
│ @ Base ./stream.jl:1141
│ [4] unsafe_write
│ @ ./io.jl:801 [inlined]
│ [5] unsafe_write(s::Base.TTY, p::Base.RefValue{UInt8}, n::Int64)
│ @ Base ./io.jl:799
│ [6] write
│ @ ./io.jl:805 [inlined]
│ [7] write(s::Base.TTY, b::UInt8)
│ @ Base ./stream.jl:1179
│ [8] write(io::Pkg.UnstableIO, b::UInt8)
│ @ Pkg ~/.julia/juliaup/julia-nightly/share/julia/stdlib/v1.12/Pkg/src/Pkg.jl:46
│ [9] unsafe_write
│ @ ./io.jl:318 [inlined]
│ [10] unsafe_write
│ @ ./io.jl:801 [inlined]
│ [11] write
│ @ ./io.jl:835 [inlined]
│ [12] write(to::Pkg.UnstableIO, from::Base.PipeEndpoint)
│ @ Base ./io.jl:883
│ [13] (::Base.var"#879#880"{Base.PipeEndpoint, Pkg.UnstableIO, Base.PipeEndpoint})()
│ @ Base ./process.jl:313
└ @ Base process.jl:315
ERROR: TaskFailedException
Stacktrace:
[1] wait(t::Task)
@ Base ./task.jl:363
[2] wait(x::Base.Process, syncd::Bool)
@ Base ./process.jl:705
[3] wait(x::Base.Process)
@ Base ./process.jl:687
[4] subprocess_handler(cmd::Cmd, ctx::Pkg.Types.Context, sandbox_ctx::Pkg.Types.Context, error_msg::String)
@ Pkg.Operations ~/.julia/juliaup/julia-nightly/share/julia/stdlib/v1.12/Pkg/src/Operations.jl:2077
[5] (::Pkg.Operations.var"#132#136"{Bool, Cmd, Cmd, Nothing, Pkg.Types.Context, Vector{Tuple{String, Base.Process}}, String, Pkg.Types.PackageSpec})()
@ Pkg.Operations ~/.julia/juliaup/julia-nightly/share/julia/stdlib/v1.12/Pkg/src/Operations.jl:2008
[6] withenv(::Pkg.Operations.var"#132#136"{Bool, Cmd, Cmd, Nothing, Pkg.Types.Context, Vector{Tuple{String, Base.Process}}, String, Pkg.Types.PackageSpec}, ::Pair{String, String}, ::Vararg{Pair{String}})
@ Base ./env.jl:265
[7] (::Pkg.Operations.var"#119#124"{String, Bool, Bool, Bool, Pkg.Operations.var"#132#136"{Bool, Cmd, Cmd, Nothing, Pkg.Types.Context, Vector{Tuple{String, Base.Process}}, String, Pkg.Types.PackageSpec}, Pkg.Types.PackageSpec})()
@ Pkg.Operations ~/.julia/juliaup/julia-nightly/share/julia/stdlib/v1.12/Pkg/src/Operations.jl:1869
[8] with_temp_env(fn::Pkg.Operations.var"#119#124"{String, Bool, Bool, Bool, Pkg.Operations.var"#132#136"{Bool, Cmd, Cmd, Nothing, Pkg.Types.Context, Vector{Tuple{String, Base.Process}}, String, Pkg.Types.PackageSpec}, Pkg.Types.PackageSpec}, temp_env::String)
@ Pkg.Operations ~/.julia/juliaup/julia-nightly/share/julia/stdlib/v1.12/Pkg/src/Operations.jl:1741
[9] (::Pkg.Operations.var"#117#122"{Dict{String, Any}, Bool, Bool, Bool, Pkg.Operations.var"#132#136"{Bool, Cmd, Cmd, Nothing, Pkg.Types.Context, Vector{Tuple{String, Base.Process}}, String, Pkg.Types.PackageSpec}, Pkg.Types.Context, Pkg.Types.PackageSpec, String, String})(tmp::String)
@ Pkg.Operations ~/.julia/juliaup/julia-nightly/share/julia/stdlib/v1.12/Pkg/src/Operations.jl:1839
[10] mktempdir(fn::Pkg.Operations.var"#117#122"{Dict{String, Any}, Bool, Bool, Bool, Pkg.Operations.var"#132#136"{Bool, Cmd, Cmd, Nothing, Pkg.Types.Context, Vector{Tuple{String, Base.Process}}, String, Pkg.Types.PackageSpec}, Pkg.Types.Context, Pkg.Types.PackageSpec, String, String}, parent::String; prefix::String)
@ Base.Filesystem ./file.jl:819
[11] mktempdir(fn::Function, parent::String)
@ Base.Filesystem ./file.jl:815
[12] mktempdir
@ ./file.jl:815 [inlined]
[13] #sandbox#116
@ ~/.julia/juliaup/julia-nightly/share/julia/stdlib/v1.12/Pkg/src/Operations.jl:1787 [inlined]
[14] test(ctx::Pkg.Types.Context, pkgs::Vector{Pkg.Types.PackageSpec}; coverage::Bool, julia_args::Cmd, test_args::Cmd, test_fn::Nothing, force_latest_compatible_version::Bool, allow_earlier_backwards_compatible_versions::Bool, allow_reresolve::Bool)
@ Pkg.Operations ~/.julia/juliaup/julia-nightly/share/julia/stdlib/v1.12/Pkg/src/Operations.jl:2000
[15] test
@ ~/.julia/juliaup/julia-nightly/share/julia/stdlib/v1.12/Pkg/src/Operations.jl:1944 [inlined]
[16] test(ctx::Pkg.Types.Context, pkgs::Vector{Pkg.Types.PackageSpec}; coverage::Bool, test_fn::Nothing, julia_args::Cmd, test_args::Cmd, force_latest_compatible_version::Bool, allow_earlier_backwards_compatible_versions::Bool, allow_reresolve::Bool, kwargs::@Kwargs{io::Pkg.UnstableIO})
@ Pkg.API ~/.julia/juliaup/julia-nightly/share/julia/stdlib/v1.12/Pkg/src/API.jl:443
[17] test(pkgs::Vector{Pkg.Types.PackageSpec}; io::Pkg.UnstableIO, kwargs::@Kwargs{})
@ Pkg.API ~/.julia/juliaup/julia-nightly/share/julia/stdlib/v1.12/Pkg/src/API.jl:158
[18] test(pkgs::Vector{Pkg.Types.PackageSpec})
@ Pkg.API ~/.julia/juliaup/julia-nightly/share/julia/stdlib/v1.12/Pkg/src/API.jl:147
[19] do_cmd(command::Pkg.REPLMode.Command, io::Base.TTY)
@ Pkg.REPLMode ~/.julia/juliaup/julia-nightly/share/julia/stdlib/v1.12/Pkg/src/REPLMode/REPLMode.jl:407
[20] do_cmds(commands::Vector{Pkg.REPLMode.Command}, io::Base.TTY)
@ Pkg.REPLMode ~/.julia/juliaup/julia-nightly/share/julia/stdlib/v1.12/Pkg/src/REPLMode/REPLMode.jl:393
[21] do_cmds(repl::REPL.LineEditREPL, commands::String)
@ REPLExt ~/.julia/juliaup/julia-nightly/share/julia/stdlib/v1.12/Pkg/ext/REPLExt/REPLExt.jl:92
[22] (::REPLExt.var"#31#34"{REPL.LineEditREPL, REPL.LineEdit.Prompt})(s::REPL.LineEdit.MIState, buf::IOBuffer, ok::Bool)
@ REPLExt ~/.julia/juliaup/julia-nightly/share/julia/stdlib/v1.12/Pkg/ext/REPLExt/REPLExt.jl:122
[23] #invokelatest#2
@ ./essentials.jl:1030 [inlined]
[24] invokelatest
@ ./essentials.jl:1027 [inlined]
[25] run_interface(terminal::REPL.Terminals.TextTerminal, m::REPL.LineEdit.ModalInterface, s::REPL.LineEdit.MIState)
@ REPL.LineEdit ~/.julia/juliaup/julia-nightly/share/julia/stdlib/v1.12/REPL/src/LineEdit.jl:2745
[26] run_frontend(repl::REPL.LineEditREPL, backend::REPL.REPLBackendRef)
@ REPL ~/.julia/juliaup/julia-nightly/share/julia/stdlib/v1.12/REPL/src/REPL.jl:1408
[27] (::REPL.var"#75#81"{REPL.LineEditREPL, REPL.REPLBackendRef})()
@ REPL ~/.julia/juliaup/julia-nightly/share/julia/stdlib/v1.12/REPL/src/REPL.jl:458
nested task error: InterruptException:
Stacktrace:
[1] sigatomic_end
@ ./c.jl:146 [inlined]
[2] uv_write(s::Base.TTY, p::Ptr{UInt8}, n::UInt64)
@ Base ./stream.jl:1065
[3] unsafe_write(s::Base.TTY, p::Ptr{UInt8}, n::UInt64)
@ Base ./stream.jl:1141
[4] unsafe_write
@ ./io.jl:801 [inlined]
[5] unsafe_write(s::Base.TTY, p::Base.RefValue{UInt8}, n::Int64)
@ Base ./io.jl:799
[6] write
@ ./io.jl:805 [inlined]
[7] write(s::Base.TTY, b::UInt8)
@ Base ./stream.jl:1179
[8] write(io::Pkg.UnstableIO, b::UInt8)
@ Pkg ~/.julia/juliaup/julia-nightly/share/julia/stdlib/v1.12/Pkg/src/Pkg.jl:46
[9] unsafe_write
@ ./io.jl:318 [inlined]
[10] unsafe_write
@ ./io.jl:801 [inlined]
[11] write
@ ./io.jl:835 [inlined]
[12] write(to::Pkg.UnstableIO, from::Base.PipeEndpoint)
@ Base ./io.jl:883
[13] (::Base.var"#879#880"{Base.PipeEndpoint, Pkg.UnstableIO, Base.PipeEndpoint})()
@ Base ./process.jl:313 |
Seems we can't just not propagate |
But we are propagating the coverage/alloc tracking option.. so we should be able to and let the subprocess figure out where native code should be allowed, like the parent process did? |
Yes, we could instead set this as an internal flag rather than changing jl_options so that it doesn't propagate |
I don't spawn any subprocesses in my CI, but I still see the regression going from Note that both runs are the exact same commit, just on a different branch since I rebased inbetween the two runs. The actual tests go from ~1m to ~6m.
|
One issue is likely that a common test pattern is
And that after The crux is that this is a correctness fix and we have observed crashes without it. We definitely shouldn't propagate the taint to launched processes, but currently we don't track enough information to continue loading native code caches without the risk of crashing after skipping on native cache for code coverage purposes. At worst the time is back to what it was before #52123 (slightly less since we don't need to regenerate files unnecessarily anymore) |
Does this also apply for anything other than LLVM? My package doesn't use it; is there a new guidance that I should do
now? Seems a bit odd, especially if there's substantial overlap between the dependency chains. For my particular example, the only addition for test-only dependencies is Aqua.jl and Statistics (though I guess I can do without it). The package in question is (soon-to-be-registered) Supposition.jl. |
Okay, but that still has to be weighed against the impact of this change. It's not reasonable to regress testing time by many x to fix a bug that has barely been seen in the ecosystem (only crash is CI with rr runner?). |
When we use options like code coverage, we can't use the native code present in the cache file since it is not instrumented. PR #52123 introduced the capability of skipping the native code during loading, but created the issue that subsequent packages could have an explicit or implicit dependency on the native code. PR #53439 tainted the current process by setting `use_sysimage_native_code`, but this flag is propagated to subprocesses and lead to a regression in test time. Move this to a process local flag to avoid the regression. In the future we might be able to change the calling convention for cross-image calls to `invoke(ci::CodeInstance, args...)` instead of `ci.fptr(args...)` to handle native code not being present.
We could argue about correctness vs performance or we could simply fix it
LLVM was just an example here.
The dependency chain of I think there is a better long-term fix, but not one that will make it in time for 1.11 and doesn't carry a larger risk with it. |
Nightly is currently significantly slower than 1.0, 1.6, and 1.10. I don't recall that being the case any time recently, so this is a regression of grater magnitude than any improvements in #52123 |
@LilithHafner my comment was with respect to @Seelengrab observation. The slowdowns you are seeing should be fixed by #53457 if they are not we need to do a deeper root-cause analysis. |
When we use options like code coverage, we can't use the native code present in the cache file since it is not instrumented. PR #52123 introduced the capability of skipping the native code during loading, but created the issue that subsequent packages could have an explicit or implicit dependency on the native code. PR #53439 tainted the current process by setting `use_sysimage_native_code`, but this flag is propagated to subprocesses and lead to a regression in test time. Move this to a process local flag to avoid the regression. In the future we might be able to change the calling convention for cross-image calls to `invoke(ci::CodeInstance, args...)` instead of `ci.fptr(args...)` to handle native code not being present. --------- Co-authored-by: Jameson Nash <[email protected]>
This should fix the assertion failure that has been plaguing the Pkg tests, as discussed in #52123 (comment) (cherry picked from commit 6cbed31)
I'm removing this from backports since even though it is more correct it is causing unexpected effects. Can be added back when everything is resolved. |
It's resolved from my end with #53457, which even improved overall CI time by 25%! Thank you very much for the quick turnaround on this! |
This should fix the assertion failure that has been plaguing the Pkg tests, as discussed in #52123 (comment) (cherry picked from commit 6cbed31)
When we use options like code coverage, we can't use the native code present in the cache file since it is not instrumented. PR #52123 introduced the capability of skipping the native code during loading, but created the issue that subsequent packages could have an explicit or implicit dependency on the native code. PR #53439 tainted the current process by setting `use_sysimage_native_code`, but this flag is propagated to subprocesses and lead to a regression in test time. Move this to a process local flag to avoid the regression. In the future we might be able to change the calling convention for cross-image calls to `invoke(ci::CodeInstance, args...)` instead of `ci.fptr(args...)` to handle native code not being present. --------- Co-authored-by: Jameson Nash <[email protected]> (cherry picked from commit b8a0a39)
Ok, thanks for checking, let's add those to then. |
Backported PRs: - [x] #53205 <!-- Profile: add notes to `print()` docs --> - [x] #53170 <!-- Remove outdated discussion about externally changing module bindings --> - [x] #53228 <!-- SubArray: avoid invalid elimination of singleton indices --> - [x] #51361 <!-- code_warntype docs: more neutral reference to @code_warntype --> - [x] #50480 <!-- Document --heap-size-hint in Command-line Interface --> - [x] #53301 <!-- Fix typo in `Sys.total_memory` docstring. --> - [x] #53354 <!-- fix code coverage bug in tail position and `else` --> - [x] #53388 <!-- Fix documentation: thread pool of main thread --> - [x] #53429 <!-- Subtype: skip slow-path in `local_∀_∃_subtype` if inputs contain no ∃ typevar. --> - [x] #53437 <!-- Add debug variant of loader_trampolines.o --> Need manual backport: - [ ] #52505 <!-- fix alignment of emit_unbox_store copy --> - [ ] #53373 <!-- fix sysimage-native-code=no option with pkgimages --> - [ ] #53439 <!-- staticdata: fix assert from partially disabled native code --> Contains multiple commits, manual intervention needed: - [ ] #52913 <!-- Added docstring for Artifacts.jl --> - [ ] #53218 <!-- Fix interpreter_exec.jl test --> Non-merged PRs with backport label: - [ ] #53424 <!-- yet more atomics & cache-line fixes on work-stealing queue --> - [ ] #53125 <!-- coverage: count coverage where explicitly requested by inference only --> - [ ] #52694 <!-- Reinstate similar for AbstractQ for backward compatibility --> - [ ] #51479 <!-- prevent code loading from lookin in the versioned environment when building Julia -->
…#53439) This should fix the assertion failure that has been plaguing the Pkg tests, as discussed in JuliaLang#52123 (comment)
When we use options like code coverage, we can't use the native code present in the cache file since it is not instrumented. PR JuliaLang#52123 introduced the capability of skipping the native code during loading, but created the issue that subsequent packages could have an explicit or implicit dependency on the native code. PR JuliaLang#53439 tainted the current process by setting `use_sysimage_native_code`, but this flag is propagated to subprocesses and lead to a regression in test time. Move this to a process local flag to avoid the regression. In the future we might be able to change the calling convention for cross-image calls to `invoke(ci::CodeInstance, args...)` instead of `ci.fptr(args...)` to handle native code not being present. --------- Co-authored-by: Jameson Nash <[email protected]>
…#53439) This should fix the assertion failure that has been plaguing the Pkg tests, as discussed in JuliaLang#52123 (comment)
When we use options like code coverage, we can't use the native code present in the cache file since it is not instrumented. PR JuliaLang#52123 introduced the capability of skipping the native code during loading, but created the issue that subsequent packages could have an explicit or implicit dependency on the native code. PR JuliaLang#53439 tainted the current process by setting `use_sysimage_native_code`, but this flag is propagated to subprocesses and lead to a regression in test time. Move this to a process local flag to avoid the regression. In the future we might be able to change the calling convention for cross-image calls to `invoke(ci::CodeInstance, args...)` instead of `ci.fptr(args...)` to handle native code not being present. --------- Co-authored-by: Jameson Nash <[email protected]>
As @vchuravy pointed out in #52123 (comment), this should fix the assertion failure that has been plaguing the Pkg tests