Skip to content

Conversation

@mtelvers
Copy link
Member

The rsync folder result-tmp grows uncontrollably, eventually leading to the worker stalling on with the following error:

2023-01-15 13:22.58       obuilder [INFO] Pruning 0 items (of 100 requested)
2023-01-15 13:22.58         worker [INFO] OBuilder partition: 7% free after pruning 0 items
2023-01-15 13:22.58         worker [WARNING] Out of space, but nothing left to prune! (will wait and then retry)

This is caused by the following two considitons:

  1. when a job is cancelled, the macOS sandbox does not delete the result-tmp folder, and
  2. when a job fails, the rsync backend does not delete the result-tmp.

Furthermore, under normal circumstances, the fuse file system is unmounted between jobs. This is essential for the Docker health check to be able to find the Docker binary which is installed in /usr/local. However, when a job is cancelled, fuse is not unmounted; therefore, we see the health check fail if it occurs immediately after a cancelled job.

Copy link
Contributor

@MisterDA MisterDA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Could you squash the commits when you'll think this is ready for merging?

Copy link
Member

@tmcgilchrist tmcgilchrist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor style suggestions but otherwise the change looks good.
We need an osx rsync stress test to validate we don't leak disk space.

if Lwt.is_sleeping proc then (
match !proc_id with
if Lwt.is_sleeping proc then
let _ = match !proc_id with
Copy link
Contributor

@MisterDA MisterDA Jan 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand exactly what's happening here, but it looks wrong as you're not binding the Lwt promise, or am I reading that wrong?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated the binding and changed the function a bit as well.

@tmcgilchrist
Copy link
Member

tmcgilchrist commented Jan 18, 2023 via email

@mtelvers
Copy link
Member Author

kill_all_descendants invokes multiple kill -9 pid commands. The sleep was an attempt to allow the processes time to actually terminate and be cleared up by the operating system before trying to delete any files they had open.

@MisterDA
Copy link
Contributor

MisterDA commented Jan 23, 2023

The LWT sleep looks curious to me. Does the kill_all_descendants function not wait for the command to terminate?

kill_all_descendants invokes multiple kill -9 pid commands. The sleep was an attempt to allow the processes time to actually terminate and be cleared up by the operating system before trying to delete any files they had open.

I've looked in depth, and I'm convinced that we're correctly waiting until all kill have finished, but maybe macOS does still require some time to catch up.

)
else Lwt.return_unit (* Process has already finished *)
else Lwt.return_unit >>= fun () ->
post_cancellation ~result_tmp t
Copy link
Contributor

@MisterDA MisterDA Jan 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I meant is that the indentation of post_cancellation is confusing: it is actually part of the else block here. If you want to execute it regardless of the Lwt.is_sleeping proc, you have to enclose the if … else … block in a begin … end (or parenthesis) block.
I think that's what you want.

open Lwt.Infix

let () =
  let x = ref true in
  Lwt_main.run begin
      if !x then
        Lwt_io.print "hello"
      else
        Lwt_io.print "world"
        >>= fun () ->
        Lwt_io.print "else branch"
    end
    
(* prints "hello" *)

let () =
  let x = ref true in
  Lwt_main.run begin
      begin if !x then
        Lwt_io.print "hello"
      else
        Lwt_io.print "world"
      end >>= fun () ->
      Lwt_io.print " common"
    end

(* prints "hello common" *)    

You could configure your editor to use ocp-indent (since we're not applying ocamlformat here).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I have an updated version with that modification which I haven't pushed yet.

post_build ~result_dir ~home_dir t >>= fun () ->
Lwt.return r
in
let () = Lwt.on_failure cancelled (fun _ -> let _ = unmount_fuse t in () )
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The returned promise most likely needs to be passed to Lwt.async to be executed by Lwt, just like the function passed to Lwt.on_termination below.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that on_failure can be removed as on_termination covers both cases.

@mtelvers
Copy link
Member Author

Thanks @MisterDA. Right now, the issue I am seeing is that when a job is cancelled, a new job is started immediately, but new job appears to run concurrently with the on_termination code. For a normal worker, this isn't a problem because jobs can run concurrently, but on the Mac, the new job is starting as the old one is still finishing both using the same home directory.

@MisterDA
Copy link
Contributor

What's the state of this PR now? Are you satisfied with your changes?

@MisterDA MisterDA force-pushed the master branch 2 times, most recently from 89e84ac to ccd3f2f Compare February 3, 2023 12:39
MisterDA added a commit that referenced this pull request Feb 9, 2023
MisterDA added a commit that referenced this pull request Feb 9, 2023
@MisterDA
Copy link
Contributor

MisterDA commented Feb 9, 2023

Merged as of 66f5164, thanks!

@MisterDA MisterDA closed this Feb 9, 2023
MisterDA added a commit to ocurrent/ocluster that referenced this pull request Feb 10, 2023
- Updates to address rsync and sandbox issues (ocurrent/obuilder#139)
- Minor updates.
MisterDA added a commit to ocurrent/ocluster that referenced this pull request Feb 10, 2023
- Updates to address rsync and sandbox issues (ocurrent/obuilder#139)
- Minor updates.
kit-ty-kate pushed a commit to ocaml/opam-repository that referenced this pull request Feb 20, 2023
CHANGES:

- Updates to address rsync and sandbox issues.
  (@mtelvers ocurrent/obuilder#139, reviewed by @tmcgilchrist and @MisterDA)
- Add an obuilder clean command to clean all build results.
  (@MisterDA ocurrent/obuilder#140, reviewed by @tmcgilchrist)
- Make rsync-mode mandatory when using rsync store.
  (@tmcgilchrist ocurrent/obuilder#132, reviewed by @kit-ty-kate and @MisterDA)
benmandrew pushed a commit to benmandrew/ocluster that referenced this pull request Mar 6, 2023
- Updates to address rsync and sandbox issues (ocurrent/obuilder#139)
- Minor updates.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants