Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race condition or deadlock in Close() #72

Open
jpatters opened this issue Mar 31, 2019 · 6 comments
Open

Race condition or deadlock in Close() #72

jpatters opened this issue Mar 31, 2019 · 6 comments

Comments

@jpatters
Copy link

I am having a pretty terrible time trying to debug this. Things seem to get hung up in multiple places inside of Close().
There are times when it fails to obtain a lock and just gets stuck there forever. And there are times when it can't seem to send on w.close. Any help would be appreciated.

I've added some logging to Close() like so.

func (w *Watcher) Close() {
	fmt.Println("=== starting close")
	w.mu.Lock()
	fmt.Println("=== have lock in close")
	if !w.running {
		w.mu.Unlock()
		return
	}
	w.running = false
	w.files = make(map[string]os.FileInfo)
	w.names = make(map[string]bool)
	w.mu.Unlock()
	// Send a close signal to the Start method.
	fmt.Println("=== sending close")
	w.close <- struct{}{}
	fmt.Println("=== returning from close")
}

In some instances, the last message I get is === starting close. And in some instances it is === sending close. And then other times it works just fine.

Here is the code I am using. Let me know if you need to see more.

func listenForChanges(ctx context.Context, w *watcher.Watcher) {
	w.AddRecursive(outputDirectory())
	go func() {
		haveChange := false
		for {
			select {
			case <-ctx.Done():
				logger.Println("timed out")
				w.Close()
				return
			case err := <-w.Error:
				logger.Println(err)
			case event := <-w.Event:
				logger.Println("found change", event.Name())
				haveChange = true
			case <-w.Closed:
				return
			case <-time.After(200 * time.Millisecond):
				if haveChange {
					logger.Println("we have enough changes")
					w.Close() // This is the line I am calling it from most often when it fails
					return
				}
			}
		}
	}()
}

It is probably worth noting that the directory I am watching contains about 6000 files.

Thanks in advance.

@jpatters
Copy link
Author

This seems to happen when a file event comes in after Close() has been called

@jpatters
Copy link
Author

There is at least one deadlock going on here. When Close() is called, if events are still being published, Close() is trying to obtain a lock (https://github.com/radovskyb/watcher/blob/master/watcher.go#L702) but an event as been published to w.Event (https://github.com/radovskyb/watcher/blob/master/watcher.go#L595) which means that it has the lock.
At the same time there is nothing reading from w.Event because we are still trying to complete Close(). So Close() can never get the lock because nothing is reading from w.Event. And nothing is reading from w.Event because we are trying to Close().
I am unsure if this is caused by my code or something that should be handled by watcher.

I was able to work around it by dumping the 'extra' events around the call to Close() like so.

case <-time.After(200 * time.Millisecond):
  if haveChange {
    logger.Println("we have enough changes")
    quit := make(chan bool)
    go func() {
      for {
        select {
        case event := <-w.Event:
          logger.Println("extra event:", event.Name())
        case <-quit:
          return
        }
      }
    }()
    w.Close()
    quit <- true
    return
  }

It feels like this is something that watcher should handle. If I have time I'll see if I can fix it in a more elegant way inside of watcher, and submit a pr.

@ykumar-rb
Copy link

Even I am facing same issue with my application. Application hangs and unable to process new event .I believe this should be handled inside watcher package .

@tul
Copy link

tul commented Jan 10, 2020

It's not documented, but essentially in the current implementation it's not supported to call Close() from your event polling go routine, or in any way that causes your event handling routine to stop draining w.Event. I think the only safe way to do it without changing the code would be to run Close() in a go routine then have your event handler continue pulling events but as soon as you get a read from w.Closed you can safely exit.
I played around with fixing this earlier today actually, it wasn't such a big change but I discarded it in the end as I didn't really need it. Essentially I changed the places that were writing to w.Event to be in a select which was also reading from w.close. I also changed Close() to do close(w.close) so it could be read multiple times.

@tul
Copy link

tul commented Jan 12, 2020

Regarding allowing Close() to be called from the polling go routine, I just looked at the code again now. There is a second issue which is that retrieveFileList() holds the mutex for the duration of the iteration, which will prevent Close() from being able to execute. So more work will be needed beyond what I outlined above.

@techfg
Copy link

techfg commented Oct 3, 2024

@jpatters @ykumar-rb @tul - Thanks for your research and ideas on this issue, helped me arrive at a solution for my use case.

In short, I have a situation where I need to be able to gracefully "leave" the watcher making sure all go routines complete in the following situations:

  1. When watcher sends an error (w.Error)
  2. When user cancels (e.g., Ctrl+C)
  3. When Start fails - Note that given the design of watcher, you must start your listener go routine prior to calling start and if Start fails, that go routine will remain because watcher won't trigger w.Wait nor will it signal Closed and calling Close is a no-op because the watcher isn't running.
  4. If a panic occurs (e.g., within the listener)

Here is the solution I came up with for anyone that might stumble upon this in the future - feedback welcomed!

Note

In my case, I have the maximum number of events set to one (1) (w.SetMaxEvents(1)). I believe the code below should work regardless, but I've only tested max events 1 scenario.

func listenForEvents(w *watcher.Watcher, abort <-chan struct{}) <-chan error {
	term := make(chan error, 1)
	done := make(chan error, 1)
	cancel := make(chan os.Signal, 1)
	quit := make(chan struct{}, 1)
	signal.Notify(cancel, os.Interrupt, syscall.SIGTERM, syscall.SIGTSTP)

	go func() {
		defer func() {
			// Note that if a panic happened there is nothing we can do at this point
			close(term)
			close(done)
			close(cancel)
			close(quit)
		}()

		// must drain the queue in a different go routine than the one calling Close()
		// we drain here instead of in the go routine below because in a panic situation
		// our listening loop will have exited unexpectedly so handle both expected
		// and unexpected scenarios in one place
		drainQueue := func() {
			go func() {
				for {
					select {
					case wErr := <-w.Error:
						logger.Debugf("Skipping processing of watcher error because in the processing of closing: %v", wErr)
					case event := <-w.Event:
						logger.Debugf("Skipping processing of watcher %v event for file %v because in the processing of closing", event.Op, event.Path)
					case <-w.Closed:
						logger.Debug("Skipping processing of watcher closed because in the processing of closing")
					case <-quit:
						return
					}
				}
			}()
		}

		wErr := <-term
		drainQueue()
		w.Close()
		quit <- struct{}{}
		done <- wErr
	}()

	go func() {
		term <- func() (err error) {
			defer func() {
				if recovered := recover(); recovered != nil {
					err = fmt.Errorf("panic occurred: %v", recovered)
					return
				}
			}()

			for {
				select {
				case event := <-w.Event:
					logger.Infof("event received: %v", event.Op)
				case wErr := <-w.Error:
					return wErr
				case <-w.Closed:
					return nil
				case <-cancel:
					return nil
				case <-abort:
					return nil
				}
			}
		}()
	}()

	return done
}

func watchFiles() error {
	w := watcher.New()

	// configure watcher....

	abort := make(chan struct{}, 1)
	defer close(abort)

	done := listenForEvents(w, abort)
	if err := w.Start(time.Millisecond * 100); err != nil {
		return err
	}

	if err := <-done; err != nil {
		return err
	}

	return nil
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants