Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make any_gc_flag threadsafe #38200

Closed

Conversation

jonas-schulze
Copy link
Contributor

I don't know the intrinsics of Julia's garbage collection, but maybe it's the remoterefs to chan running out of scope on the worker processes that cause the failures:

If the corresponding closures are executed on threads 1 and 2, and both issue the finalizers for (their local copy of) chan to run, there are to threads executing send_del_client and thus calling notify(any_gc_flag). Is my guess correct?

Ref #38134
Ref JuliaLang/Distributed.jl#73

cc @vtjnash @vchuravy

@jonas-schulze
Copy link
Contributor Author

The failure is unrelated.

@vchuravy vchuravy requested a review from JeffBezanson October 27, 2020 19:33
@vchuravy
Copy link
Member

Also #38201 to get a better error for this case.

@jonas-schulze
Copy link
Contributor Author

Should that one be merged first, so that we can verify that it's the finalizer I suspected, or is this change worth having anyway?

@vtjnash
Copy link
Member

vtjnash commented Oct 28, 2020

finalizers can't acquire locks, as this would block forward progress of the application:

      From worker 8:	ERROR: LoadError: LoadError: Some tests did not pass: 60 passed, 0 failed, 1 errored, 0 broken.
      From worker 8:	in expression starting at C:\buildbot\worker-tabularasa\tester_win32\build\share\julia\stdlib\v1.6\Distributed\test\threads.jl:28
      From worker 8:	in expression starting at C:\buildbot\worker-tabularasa\tester_win32\build\share\julia\stdlib\v1.6\Distributed\test\distributed_exec.jl:1714
      From worker 56:	error in running finalizer: Distributed.ProcessExitedException(worker_id=55)
      From worker 56:	worker_from_id at C:\buildbot\worker\package_win32\build\usr\share\julia\stdlib\v1.6\Distributed\src\cluster.jl:1074
      From worker 56:	worker_from_id at C:\buildbot\worker\package_win32\build\usr\share\julia\stdlib\v1.6\Distributed\src\cluster.jl:1071 [inlined]
      From worker 56:	send_del_client at C:\buildbot\worker\package_win32\build\usr\share\julia\stdlib\v1.6\Distributed\src\remotecall.jl:262
      From worker 56:	finalize_ref at C:\buildbot\worker\package_win32\build\usr\share\julia\stdlib\v1.6\Distributed\src\remotecall.jl:94
      From worker 56:	jl_fptr_args at /cygdrive/c/buildbot/worker/package_win32/build/src\gf.c:1960
      From worker 56:	_jl_invoke at /cygdrive/c/buildbot/worker/package_win32/build/src\gf.c:2191 [inlined]
      From worker 56:	jl_apply_generic at /cygdrive/c/buildbot/worker/package_win32/build/src\gf.c:2373
      From worker 56:	jl_apply at /cygdrive/c/buildbot/worker/package_win32/build/src\julia.h:1691 [inlined]
      From worker 56:	run_finalizer at /cygdrive/c/buildbot/worker/package_win32/build/src\gc.c:276
      From worker 56:	jl_gc_run_finalizers_in_list at /cygdrive/c/buildbot/worker/package_win32/build/src\gc.c:363
      From worker 56:	run_finalizers at /cygdrive/c/buildbot/worker/package_win32/build/src\gc.c:391 [inlined]
      From worker 56:	run_finalizers at /cygdrive/c/buildbot/worker/package_win32/build/src\gc.c:370 [inlined]
      From worker 56:	jl_gc_run_all_finalizers at /cygdrive/c/buildbot/worker/package_win32/build/src\gc.c:428
      From worker 56:	jl_atexit_hook at /cygdrive/c/buildbot/worker/package_win32/build/src\init.c:239
      From worker 56:	jl_exit at /cygdrive/c/buildbot/worker/package_win32/build/src\jl_uv.c:632
      From worker 56:	exit at .\initdefs.jl:28 [inlined]
      From worker 56:	exit at .\initdefs.jl:29
      From worker 56:	_jfptr_exit_34444.clone_1 at C:\buildbot\worker-tabularasa\tester_win32\build\lib\julia\sys.dll (unknown line)
      From worker 56:	_jl_invoke at /cygdrive/c/buildbot/worker/package_win32/build/src\gf.c:2172 [inlined]
      From worker 56:	jl_apply_generic at /cygdrive/c/buildbot/worker/package_win32/build/src\gf.c:2373
      From worker 56:	jl_apply at /cygdrive/c/buildbot/worker/package_win32/build/src\julia.h:1691 [inlined]
      From worker 56:	do_apply at /cygdrive/c/buildbot/worker/package_win32/build/src\builtins.c:672
      From worker 56:	jl_f__apply_iterate at /cygdrive/c/buildbot/worker/package_win32/build/src\builtins.c:680
      From worker 56:	JuliaLang/julia#114 at C:\buildbot\worker\package_win32\build\usr\share\julia\stdlib\v1.6\Distributed\src\process_messages.jl:299
      From worker 56:	run_work_thunk at C:\buildbot\worker\package_win32\build\usr\share\julia\stdlib\v1.6\Distributed\src\process_messages.jl:63
      From worker 56:	JuliaLang/julia#113 at .\task.jl:395
      From worker 56:	unknown function (ip: 1e9a0993)
      From worker 56:	_jl_invoke at /cygdrive/c/buildbot/worker/package_win32/build/src\gf.c:2172 [inlined]
      From worker 56:	jl_apply_generic at /cygdrive/c/buildbot/worker/package_win32/build/src\gf.c:2373
      From worker 56:	jl_apply at /cygdrive/c/buildbot/worker/package_win32/build/src\julia.h:1691 [inlined]
      From worker 56:	start_task at /cygdrive/c/buildbot/worker/package_win32/build/src\task.c:821

from https://build.julialang.org/#/builders/61/builds/4993/steps/5/logs/stdio

@jonas-schulze
Copy link
Contributor Author

From which task/thread are finalizers being run? If they run on the thread that just lost the last reference to object due to be finalized, as I guessed and asked about in #38200 (comment), they must be synchronized, i.e. maybe acquire a lock.

@jonas-schulze
Copy link
Contributor Author

The failure in worker_from_id (which is called even before notify(any_gc_flag) in case that will become the next problem) suggests that the worker has been / is being deregister_workered while also running send_del_client. How is that possible? I don't understand the complete architecture of Distributed to tell all the moving pieces.

@jonas-schulze
Copy link
Contributor Author

Test Summary:                       |     Pass  Error  Broken     Total
    Downloads                       |       97      1                98

@jonas-schulze
Copy link
Contributor Author

Most of this is included in #38405, closing this then.

@vchuravy vchuravy reopened this Nov 22, 2020
@vchuravy vchuravy closed this Nov 22, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants