Skip to content

Conversation

@fsaintjacques
Copy link
Contributor

@fsaintjacques fsaintjacques commented Sep 18, 2019

Construct tree structure from std::vectorfs::FileStats, following the path directory hierarchy.

@fsaintjacques fsaintjacques force-pushed the ARROW-6606-path-tree branch 2 times, most recently from aff3032 to bbb9f3c Compare September 18, 2019 20:11
Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The whole design looks a bit complicated. I would expect a simple DirectoryTree. Non-directories need not be tree entries (but if you want a file to point to its DirectoryTree parent, perhaps you can have a simple struct FileTreeEntry { DirectoryTree*, FileStats }).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better to give this a less generic name, e.g. StatsForFile.

Copy link
Contributor Author

@fsaintjacques fsaintjacques Sep 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The goal is to write succinct test. That's why I kept it short.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of computing path depth, you can probably sort by string length...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can even sort the file stats by string length in a work queue (without distinguishing directories up front) and build up the tree like that. I don't think you need to build directories first explicitly, everything will be done naturally in graph order.

@fsaintjacques
Copy link
Contributor Author

I iterated a couple of time over the design. I initially implemented the tree with Directory only node
but it made the Visit awkward to use. Here's the current code (in another branch) that walks the PathTree to consume files.

DataFragmentIterator FileSystemBasedDataSource::GetFragments(
    std::shared_ptr<ScanOptions> options) {
  std::vector<std::unique_ptr<fs::FileStats>> files;

  auto visitor = [&files](const fs::FileStats& stats) {
    if (stats.IsFile()) {
      files.emplace_back(new fs::FileStats(stats));
    }
  };
  // The matcher ensures that directories (and their descendants) are not
  // visited.
  auto matcher = [this, options](const fs::FileStats& stats) {
    return this->PartitionMatches(stats, options);
  };

  for (auto tree : forest_) {
    tree->Visit(visitor, matcher);
  }

  auto file_it = MakeVectorIterator(std::move(files));
  auto file_to_fragment = [options, this](std::unique_ptr<fs::FileStats> stats,
                                          std::shared_ptr<DataFragment>* out) {
    std::unique_ptr<DataFragment> fragment;
    FileSource src(stats->path(), filesystem_);
    RETURN_NOT_OK(format_->MakeFragment(src, options, &fragment));
    *out = std::move(fragment);
    return Status::OK();
  };

  return MakeMaybeMapIterator(file_to_fragment, std::move(file_it));
}

@pitrou
Copy link
Member

pitrou commented Sep 19, 2019

You could have a VisitFiles and/or a VisitFilesRecursive.

@fsaintjacques
Copy link
Contributor Author

Do you prefer I rewrite it that way?

@pitrou
Copy link
Member

pitrou commented Sep 19, 2019

You mean as a tree of directories? Yes.

@fsaintjacques fsaintjacques force-pushed the ARROW-6606-path-tree branch 3 times, most recently from cf9762d to d609ef7 Compare September 23, 2019 13:51
@fsaintjacques
Copy link
Contributor Author

@pitrou, I'd prefer that we keep it that way for the sake of simplifying consumption, creation and testing. The downside of this version is that the type allow invalid tree as per what is found in a real file system. If the user calls Make(), there can't be any "non-directory" in non-leaves nodes. We could make this class namespace "internal" so users can't call the constructor and build invalid trees. The constructor are useful for simplifying unit tests.

Minus this point, the other comments were addressed.

@pitrou
Copy link
Member

pitrou commented Sep 23, 2019

(woops, sorry, wrong PR :-))

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, just a couple more questions :-)

@fsaintjacques
Copy link
Contributor Author

Comments were addressed and I also added missing test for the Visit facility.

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, thank you!

@pitrou
Copy link
Member

pitrou commented Sep 26, 2019

Hmm, this needs to be rebased, sorry.

Construct tree structure from std::vector<fs::FileStats>, following the
path directory hierarchy.
@fsaintjacques
Copy link
Contributor Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants