Find runs by glob-ing levels of subdirectories #1087

chihuahua · 2018-03-31T01:53:19Z

Previously, TensorBoard found runs within the log directory by
performing a tf.gfile.Walk across directories. That turned out to be
very slow for users with expansive log directories (with many
subdirectories and files) because tf.gfile.Walk makes many calls at
the python level to C++ logic for reading data from disk (as opposed to
performing much of that logic directly at the C++ level). Disk reads can
be very costly for certain (such as remote and distributed) file systems.

We find runs (subdirectories) much more quickly by iteratively glob-ing
deeper and deeper directory levels. For a certain internal user, this
reduced the time it took to find all runs from 20 minutes to 20 seconds.

We already have tests that check whether the multiplexer correctly finds
subdirectories that are runs, so this pull request adds no further
tests. We preserve that correctness and just speed up that logic. This
change may result in us finding runs in different orders, but that is
fine - TensorBoard documents no guarantees in terms of the order in
which runs are discovered.

Thank you to @nfelt for this idea and the fruitful brainstorming
session.

nfelt · 2018-03-31T03:56:16Z

tensorboard/backend/event_processing/plugin_event_multiplexer.py

@@ -432,17 +471,3 @@ def GetAccumulator(self, run):
    """
    with self._accumulators_mutex:
      return self._accumulators[run]
-
-
-def GetLogdirSubdirectories(path):


Hmm, would it be more general to just replace LIstRecursively() with the iterative glob technique, rather than leaving it using Walk() and then getting rid of both copies of GetLogdirSubdirectories()? That way we don't have to duplicate the code in both multiplexers.

I think it doesn't really add overhead to return all the filenames and then do the filtering in a single pass later (in GetLogdirSubdirectories() like we do now) as long as we keep ListRecursively() as a generator the way it is now. We could even make GetLogdirSubdirectories() itself a generator too so that it doesn't block on finishing all the globbing, by just maintaining a dict of subdirs and only yielding a subdir for a given file when it's not yet in the dict.

Happy to send a PR myself for this alternative if that would be easiest, just let me know.

+1. See updates.

nfelt · 2018-03-31T07:46:23Z

I've done a bit more thinking and research here and this might be a bit more complicated than we initially thought. My original assumption was that lot of the performance benefit of the glob approach had to do with reducing back-and-forth across the python/C++ boundary. But this is belied by the fact that the existing gfile.Walk()-based implementation actually is fine on local disk - it executes quickly even for very large logdirs.

My new interpretation is that the real speedup comes from reducing the number of round trips from local C++ code to a remote filesystem and back. Globbing achieves a huge reduction in round trips when the remote file system implements TF's FileSystem.GetMatchingPaths() in the equivalent of a single possibly-streaming RPC, which is true of CNS (an internal-only filesystem) and happens to also be essentially true for GCS (link to code). However, this isn't generally true for other remote TF FileSystem implementations or even for other internal Google filesystems. In those cases, each level of the glob actually does a full gfile.Walk() - check if dir; if so, list it; recurse on results (it should be a depth-bounded walk, but the FileSystem default implementation doesn't actually do any such bounding...). So for those remote FileSystems we're now doing D+1 times as much work and sending D+1 times as many RPCs as the existing approach.

Given that issue, I think we should probably make this behavior conditioned on the underlying filesystem implementation, and then enable it only when we know the filesystem is well-suited to it. The "correct" way to do that would arguably be adding another C++-level TF FileSystem method for this, but for now we could also just add a heuristic within our codebase that checks the known path prefixes.

Note that if this interpretation of the speedup is correct, then writing our own C++ code won't necessarily add much additional speedup, because there the real limitation is what RPCs the remote filesystem exposes. With GCS, if we had access to the underlying RPC interface we could implement recursive listing in a single (up to pagination) RPC, so it could be worthwhile. But with CNS, the C++ code doesn't have access to any RPC that's more efficient for our purposes here than the one used by glob, and unlike GCS that one is limited to returning one directory level at a time, so there's not much overhead we could remove without literally writing new RPCs for CNS.

chihuahua · 2018-03-31T19:16:59Z

Thank you, Nick! That makes sense to me. +1 to how the calls to walk take big latency hits from individual disk accesses to CNS. PTAL

These updates take into account the nuance of how the glob-ing based method makes listing files much faster for certain file systems (that implement analogs of GetMatchingPaths), but might have a different effect for other file systems. Based on internal documentation for Glob patterns, the glob-ing based method could raise errors for certain file systems within Google because the method could use patterns with multiple *s.

These new changes move much logic into io_wrapper so that they can be generalized, tested, and placed in a file that organizationally makes sense. To emphasize the difference between glob-ing for all files and performing the recursive walk, we assign starkly different names to the GlobAndListFiles and WalkAndListFilesPerDirectory methods.

nfelt · 2018-04-03T18:55:04Z

tensorboard/backend/event_processing/io_wrapper.py

+        'GlobAndListFiles: %d files glob-ed at level %d', len(glob), level)
+
+    if not glob:
+      # This subdirectory level lacks files. Terminate.


FWIW, I thought of another optional optimization. Don't have to do it in this PR but maybe add a TODO(nickfelt) to implement at some point? The idea is that if at any point the glob returns files that are all in a single directory (i.e. len(set(map(os.path.dirname, glob))) == 1) then we can replace the current globbing path with that directory as the literal prefix. We can think of this as basically saying that once the globbing is decending down a single branch in the directory tree, we can focus on that specific branch rather than still globbing through all the layers. This should improve efficiency in cases where a single subdir is significantly deeper than the rest of the log directory subdirs.

Great idea! Indeed, this should improve efficiency for that scenario! Done.

Awesome, thanks for implementing this! It'd probably be good to have a test that exercises this just to be sure it works at producing the same listing as the non-optimized version. Could be done by just having a structure like:

root

dir1

fileA

dir2

fileB

dir2-child

fileC

dir2-grandchild1

fileD

dir2-grandchild2

fileE

After listing root/*/*/* which should produce /root/dir2/dir2-child/fileC and /root/dir2/dir2-child/dir2-grandchild1 and 2, the "pruning" behavior should kick in and the next listing should be /root/dir2/dir2-child/*/* which should produce /root/dir2/dir2-child/dir2-grandchild1/fileD and dir2-grandchild2/fileE.

+1, I made the deep directory structure test this case. Specifically, see quuz/garply/corge and quuz/garply/grault.

nfelt · 2018-04-03T18:58:19Z

tensorboard/backend/event_processing/io_wrapper_test.py

@@ -64,32 +86,30 @@ def testListDirectoryAbsolute(self):
        (os.path.join(temp_dir, f) for f in expected_files),
        io_wrapper.ListDirectoryAbsolute(temp_dir))

-  def testListRecursivelyForNestedFiles(self):
+  def testGlobAndListFiles(self):


Maybe also have a test for this on an empty directory, or an otherwise non-deep directory?

Alternatively, another way to approach this could be to rearrange the globbing method a little so that it returns the exact same (subdir, files) tuples that the Walk()-based version does, and then just test the glob method by checking that it returns the exact same data (up to ordering) as the tried-and-true walk method.

test the glob method by checking that it returns the exact same data (up to ordering) as the tried-and-true walk method.

Awesome idea. The 2 methods now return via the same format, but their tests are slightly different since GlobAndListFiles excludes subdirectories from listings.

nfelt · 2018-04-03T19:01:11Z

tensorboard/backend/event_processing/io_wrapper_test.py

    self.assertEqual(1, len(subdirectory_entries))

    entry = subdirectory_entries[0]
    self.assertEqual(empty_dir, entry[0])
    self.assertItemsEqual((), entry[1])

+  def _CreateDeepDirectoryStructure(self, top_directory):
+    """Creates a reasonable deep structure of subdirectories with files.


This structure could probably be a little more complicated to fully exercise all the possible edge cases. Some things that might be good to include:

Subdir that is non-empty but has no event files (should be ignored)

Subdir that has no event files, but has a subdir that has event files (first level should be ignored, second level should be included)

"Sibling" subdirs that both have event files (i.e. two subdirs at same depth)

+1, we now cover those cases. Comments now clarify them.

nfelt · 2018-04-03T19:08:40Z

tensorboard/backend/event_processing/io_wrapper.py

+  Yields:
+    The absolute path of each file.
+  """
+  current_glob_string = top


Optional, but we may want to escape the original directory from globbing just in case somebody has an actual directory with glob-aware characters in it (fairly unlikely, but possible). In python 3.4 plus there's glob.escape() but unfortunately nothing in python 2.7. However, it's a short function, so we could just backport it:
https://github.com/python/cpython/blob/3.6/Lib/glob.py#L161

Feel free to also just leave this as a TODO(nickfelt).

Thanks for noting that detail! Done.

nfelt · 2018-04-03T19:09:12Z

tensorboard/backend/event_processing/io_wrapper.py

+  """Recursively lists all files within the directory.
+
+  This method does so by glob-ing deeper and deeper directories, ie
+  /foo/*, foo/*/*, foo/*/*/* and so on until all files are listed. All file


nit: remove initial slash from "/foo/*" to match the other examples

nfelt · 2018-04-03T19:16:34Z

tensorboard/backend/event_processing/io_wrapper.py

+    path: The path to a directory under which to find subdirectories.
+
+  Returns:
+    A tuple of absolute paths of all subdirectories each with at least 1 events


Probably worth mentioning that the order of the subdirectories returned is unspecified, since the two different methods return the directories in different orders.

Also, this could be done as a generator to match the old version of this function, with a little more work around the unique-ifying logic for the glob implementation. Do you think that would be useful?

Sounds great! ListRecursivelyViaGlobbing and ListRecursivelyViaWalking now yield the same types of tuples. One difference is that the lists of absolute paths returned by ListRecursivelyViaGlobbing include subdirectories.

nfelt · 2018-04-03T19:21:50Z

tensorboard/backend/event_processing/io_wrapper.py

 def ListDirectoryAbsolute(directory):
  """Yields all files in the given directory. The paths are absolute."""
  return (os.path.join(directory, path)
          for path in tf.gfile.ListDirectory(directory))


-def ListRecursively(top):


Hmm, it could be good to keep the word "recursive" in the new function names just to signal that they're listing the entire directory tree. It's clear in the docstring but not so much from the function name itself.

I wonder it it might make sense to call these ListRecursivelyViaGlobbing() and ListRecursivelyViaWalking()? While the implementations are different, functionally they are basically interchangeable aside from slightly different output structures (which actually could be made the same by just doing per-subdir grouping in the globbing function before yielding).

Those names sound great! They indicate recursion. Furthermore, the methods (almost) return the same values now. ListRecursivelyViaGlobbing raises an exception when the directory does not exist.

nfelt · 2018-04-04T00:13:31Z

tensorboard/backend/event_processing/directory_watcher_test.py

@@ -193,7 +193,12 @@ def Fake(*args, **kwargs):

    FakeFactory.has_been_called = False

-    for stub_name in ['ListDirectoryAbsolute', 'ListRecursively']:
+    stub_names = [
+        'GlobAndListFiles',


Needs update?

nfelt · 2018-04-04T00:19:30Z

tensorboard/backend/event_processing/io_wrapper.py

+    pairs = collections.defaultdict(list)
+    for file_path in glob:
+      pairs[os.path.dirname(file_path)].append(file_path)
+    for dir_name, file_paths in pairs.items():


Might as well use six.iteritems() to keep memory usage down, since we're yielding each one individually anyway.

nfelt · 2018-04-04T00:21:01Z

tensorboard/backend/event_processing/io_wrapper.py

+  if IsGCSPath(path) or IsCnsPath(path):
+    # Glob-ing for files can be significantly faster than recursively
+    # walking through directories for some file systems.
+    # Use a dict comprehension to unique-ify subdirectories.


This line of comment is stale

nfelt · 2018-04-04T00:23:01Z

tensorboard/backend/event_processing/io_wrapper_test.py

+        'bar/quux',
+        # A directory that lacks events files, but contains 2 subdirectories
+        # with events files (first level should be ignored, second level should
+        # be included). corge and grault are thus subling events files.


typo: sibling

nfelt · 2018-04-04T00:30:37Z

tensorboard/backend/event_processing/io_wrapper_test.py

+        # be included). corge and grault are thus subling events files.
+        'quuz/corge',
+        'quuz/grault',
+        # A directory with a glob character in its name.


Good idea, but I think this would only have been affected by the lack of escaping if it was passed in as the original argument to ListRecursivelyViaGlobbing()? I.e. I don't think the test would fail if we didn't have escaping.

So maybe worth having a separate test for this that creates a "ba*" subdirectory of the tmpdir and then passes that as the directory to list. Also note that the glob "ba*" will actually match a literal "ba*" directory so by itself that doesn't indicate the issue; to test that it's escaped we'd need to have a sibling directory "bar" and check that we only recursively list "ba*" and not "bar". (An easier test for escaping might be to test that a literal path like "foo[bar]" is matched, since the glob interpretation only matches "foob" "fooa" or "foor" but not literal "foo[bar]".)

+1. Thanks for the catch. We now add a test that globs a directory containing a * in its name. The test checks that we only recursively list "ba*" and not "bar" - the other method would work too!

nfelt · 2018-04-04T00:56:32Z

tensorboard/backend/event_processing/io_wrapper_test.py

+    self._CreateDeepDirectoryStructure(temp_dir)
+    expected = [
+        ['', [
+            'events.out.tfevents.1473720381.meep.com',


Totally optional, but since this test is only about generic recursive directory listing, these names wouldn't have to be event file names. They could just be "a" "b" "c" etc. In that case, they might fit all on one line and be a bit easier to read.

Yeah, the events files are hard to tell apart at a glance. How about we use the minimal semblance of an events file? ie, a.tfevents.1, albeit the method contract does not care about file type ...

nfelt · 2018-04-04T00:58:14Z

tensorboard/backend/event_processing/io_wrapper.py

  """
  for dir_path, _, filenames in tf.gfile.Walk(top):
    yield (dir_path, (os.path.join(dir_path, filename)
                      for filename in filenames))
+
+
+def GetLogdirSubdirectories(path):


Should there be a test for GetLogdirSubdirectories() itself? It can probably be lighter weight since it's not testing the full recursive walk algorithm, but something that checks the basic logic of returning only subdirs that contain at least one events file seems good to have.

nfelt · 2018-04-04T01:06:49Z

tensorboard/backend/event_processing/io_wrapper.py

+        'GlobAndListFiles: %d files glob-ed at level %d', len(glob), level)
+
+    if not glob:
+      # This subdirectory level lacks files. Terminate.


Awesome, thanks for implementing this! It'd probably be good to have a test that exercises this just to be sure it works at producing the same listing as the non-optimized version. Could be done by just having a structure like:

root

dir1

fileA

dir2

fileB

dir2-child

fileC

dir2-grandchild1

fileD

dir2-grandchild2

fileE

After listing root/*/*/* which should produce /root/dir2/dir2-child/fileC and /root/dir2/dir2-child/dir2-grandchild1 and 2, the "pruning" behavior should kick in and the next listing should be /root/dir2/dir2-child/*/* which should produce /root/dir2/dir2-child/dir2-grandchild1/fileD and dir2-grandchild2/fileE.

googlebot · 2018-04-04T19:27:53Z

So there's good news and bad news.

👍 The good news is that everyone that needs to sign a CLA (the pull request submitter and all commit authors) have done so. Everything is all good there.

😕 The bad news is that it appears that one or more commits were authored or co-authored by someone other than the pull request submitter. We need to confirm that all authors are ok with their commits being contributed to this project. Please have them confirm that here in the pull request.

Note to project maintainer: This is a terminal state, meaning the cla/google commit status will not change from this state. It's up to you to confirm consent of the commit author(s) and merge this pull request when appropriate.

Previously, TensorBoard found runs within the log directory by performing a `tf.gfile.Walk` across directories. That turned out to be very slow for users with expansive log directories (with many subdirectories and files) because `tf.gfile.Walk` makes many calls at the python level to C++ logic for reading data from disk (as opposed to performing much of that logic directly at the C++ level). Disk reads can be very costly for certain (such as remote and distributed) file systems. For those file systems, we find runs (subdirectories) much more quickly by iteratively glob-ing deeper and deeper directory levels. For a certain internal user, this reduced the time it took to find all runs from 20 minutes to 20 seconds. For other file systems, we preserve the previous behavior of performing a recursive walk aross directories. For certain file systems, using the glob-based method to find runs might actually be slower because those file systems do not implement analogs of TensorFlow's method for matching files on disk via a pattern, resulting in costly disk accesses. This change also re-organizes code to make that logic possible. We move code away from the 2 multiplexers into the io_wrapper helper module. I am generally not a huge of fan of large helper modules, but in this case, it seems desired because we have duplicate entire multiplexers and event accumulators. We also generalize the logic for listing files via blobbing into a `GlobAndListFiles` method.

chihuahua added the core:backend label Mar 31, 2018

chihuahua requested review from jart and nfelt March 31, 2018 01:53

nfelt reviewed Mar 31, 2018

View reviewed changes

chihuahua force-pushed the glob branch from d751830 to 9fcc922 Compare March 31, 2018 19:11

chihuahua force-pushed the glob branch from 954fe3a to d8b02f0 Compare April 2, 2018 17:29

nfelt reviewed Apr 3, 2018

View reviewed changes

chihuahua force-pushed the glob branch from f179f6b to 8eac881 Compare April 3, 2018 23:53

nfelt approved these changes Apr 4, 2018

View reviewed changes

chihuahua added 4 commits April 4, 2018 12:28

Modify functions to have same-style return values

9a2d32a

Test cases

a1d6285

Convert key to list

9e92cbf

chihuahua force-pushed the glob branch from 5e58ed2 to 9e92cbf Compare April 4, 2018 19:28

chihuahua merged commit 8eaf24e into tensorflow:master Apr 4, 2018

nfelt mentioned this pull request Jun 7, 2018

TensorBoard very slow on GCS #158

Closed

nfelt mentioned this pull request Aug 31, 2018

Allow profile plugin to search subdirectories. #1360

Closed

Find runs by glob-ing levels of subdirectories #1087

Find runs by glob-ing levels of subdirectories #1087

Conversation

chihuahua commented Mar 31, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nfelt commented Mar 31, 2018

chihuahua commented Mar 31, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chihuahua Apr 3, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

googlebot commented Apr 4, 2018

chihuahua commented Mar 31, 2018 •

edited

Loading

chihuahua commented Mar 31, 2018 •

edited

Loading

chihuahua Apr 3, 2018 •

edited

Loading