Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW] Add debug/info/trace logging and fix multithreaded replay bench #529

Merged
merged 36 commits into from
Sep 4, 2020

Conversation

harrism
Copy link
Member

@harrism harrism commented Aug 31, 2020

This PR adds debug logging using spdlog so that it is accessible from anywhere in RMM, and provides convenience macros for various levels of logging.

Adds CMake configuration to set the default logging level OFF, and allows specifying spdlog levels
using -DLOGGING_LEVEL=<level> on the cmake command line, where level is one of [TRACE, INFO, DEBUG, WARNING, ERROR, CRITICAL].

Changing the level changes which level of logging messages are compiled in, but since spdlog's
default logging level is INFO, if you need lower-level logging, then the application must also
set spdlog's runtime logging level using rmm::logger().set_level() (e.g. to spdlog::level::trace).

By default the log goes to a file rmm_log.txt unless a filename is specified in the environment variable RMM_DEBUG_LOG_FILENAME.

Also replaces calls to assert with a new macro RMM_LOGGING_ASSERT which logs the reason for the assertion (only active in debug build, just like assert).

This PR also fixes replay of multithreaded logs to ensure the logs are replayed in the actual order they occured to prevent deallocations preceding their corresponding allocations.
Fixes #431.

TODO: Add logging in more memory resources. We can add further logging in follow up PRs to keep this one manageable.

CC @jlowe @rongou @abellina

@harrism harrism added the 2 - In Progress Currently a work in progress label Aug 31, 2020
@harrism harrism requested a review from a team as a code owner August 31, 2020 04:45
@harrism harrism self-assigned this Aug 31, 2020
@GPUtester
Copy link
Contributor

Please update the changelog in order to start CI tests.

View the gpuCI docs here.

@GPUtester
Copy link
Contributor

Please update the changelog in order to start CI tests.

View the gpuCI docs here.

*
* Does not copy the mutex or the map
*/
replay_benchmark(replay_benchmark const& other)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need a copy ctor?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes because of the mutex. Won't compile without it.

Copy link
Contributor

@jrhemstad jrhemstad Sep 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand. Where is replay_benchmark being copied that it needs a copy ctor?. std::mutex is neither copyable or movable, so I understand why the default copy ctor would fail to compile if used, but I wouldn't think replay_benchmark should ever need to be copied. Instead I'd explicitly delete replay_benchmarks copy and move ctors.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

template <class Lam> benchmark::RegisterBenchmark(const char*, Lam&&) creates an instance of this class, which invokes the copy ctor of the Lam that is passed to it.

#ifdef BENCHMARK_HAS_CXX11
template <class Lambda>
class LambdaBenchmark : public Benchmark {
 public:
  virtual void Run(State& st) { lambda_(st); }

 private:
  template <class OLambda>
  LambdaBenchmark(const char* name, OLambda&& lam)
      : Benchmark(name), lambda_(std::forward<OLambda>(lam)) {}

  LambdaBenchmark(LambdaBenchmark const&) = delete;

 private:
  template <class Lam>
  friend Benchmark* ::benchmark::RegisterBenchmark(const char*, Lam&&);

  Lambda lambda_;
};
#endif

https://github.com/google/benchmark/blob/4475ff6b8a7a4077d7492b76ef5278a3dc53a2e4/include/benchmark/benchmark.h#L1017-L1036

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad, it invokes the move ctor. I have now replaced this copy ctor with a move ctor and explicitly deleted the copy ctor.

Comment on lines 146 to 151
void SetUp(const ::benchmark::State& state) override
{
if (state.thread_index == 0) {
rmm::logger().log(spdlog::level::info, "------ Start of Benchmark -----");
mr_ = factory_();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just put this logic in the replay_benchmark ctor? That way you don't have to do the check against the thread_index. Only one instance of the fixture object is constructed and shared by all threads.

Copy link
Member Author

@harrism harrism Sep 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is the replay_benchmarks ctor is called when you register the benchmark, and the dtor is not called until after ALL the benchmarks have run. So you end up constructing all the MRs you are benchmarking first, then running the benchmarks, then destroying them. If you have a pool_mr and a binning_mr, then you effectively use up all the GPU memory before any benchmark runs, and the result is OOM during the first benchmark of one of those MRs. This is the whole reason for all the shenanigans with the fixture.

Google benchmark is not well designed for this case. It took me the better part of a day of reading their undocumented code to figure out a workaround.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also each benchmark may be run multiple times, and you don't want to reuse the pool (or time creation of the pool) across runs.

@harrism
Copy link
Member Author

harrism commented Sep 1, 2020

Thanks @jrhemstad I addressed all of your feedback (except the benchmark construction timing).

@harrism harrism changed the title [WIP] Add debug/info/trace logging and fix multithreaded replay bench [REVIEW] Add debug/info/trace logging and fix multithreaded replay bench Sep 1, 2020
@harrism harrism added 3 - Ready for review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Sep 1, 2020
@harrism harrism added the 5 - Merge After Dependencies Depends on another PR: do not merge out of order label Sep 1, 2020
@harrism
Copy link
Member Author

harrism commented Sep 1, 2020

Please do not merge. There appears to be a problem with the fmt lib included in spdlog and libcudf compilation.

@harrism
Copy link
Member Author

harrism commented Sep 2, 2020

This PR introduces a libcudf build error that I have root-caused to fmt lib: fmtlib/fmt#1850. I have a fix: https://github.com/fmtlib/fmt/pr/1852 but that needs to get incorporated into spdlog: gabime/spdlog#1661

So I think I will need to patch the file (2-character fix) locally as part of the present PR rather than wait...

Update: patch added.

@harrism
Copy link
Member Author

harrism commented Sep 4, 2020

Turns out there's no bug in fmt, it's in our copy of libcudacxx (already fixed in mainline libcudacxx included in CUDA toolkit). I put in a better (simpler) workaround into this PR, which can be removed once libcudacxx is available on github so we can depend on it.

default_log_filename(), true // truncate file
)}
{
logger_.set_pattern("[%l][%t][%H:%M:%S:%f] %v");
Copy link
Contributor

@cwharris cwharris Sep 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a thought: using a different format, we could sort lines in the log file to effectively group by thread id, which might be useful? This has the additional advantage of visually aligning the timestamp and thread id (the log level string is variable width).

[23:59:58:056789][1232][debug]

Suggested change
logger_.set_pattern("[%l][%t][%H:%M:%S:%f] %v");
logger_.set_pattern("[%t][%H:%M:%S:%f][%l] %v");

edit: updated log pattern

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why you can't sort by thread ID with the existing format?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First, my suggestion is formatted incorrectly. The format I had in mind is:

[%t][%H:%M:%S:%f][%l] %v

It would be nice to open a log file in a text editor and "sort by lines ascending" to group the logs by thread. For that to work, the thread id would need to be printed first. The second part would need to be the timestamp, otherwise the logs would get out of order, even within a specific thread id. Therefore, the log level would need to be last. Since everything prior to the log level is fixed width, you also end up with a nice visual alignment.

[1232][23:59:58:056789][debug]

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my experience grouping by thread has not been useful. You need to see what is happening in time order due to multiple threads using a resource. That said, I guess this change won't hurt anything. I'll fix the visual alignment with an explicit width for the level.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Contributor

@cwharris cwharris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm.

fyi, the log format suggestion was just an idea. I don't know how useful it would be in practice. 👍

@harrism
Copy link
Member Author

harrism commented Sep 4, 2020

lgtm.

fyi, the log format suggestion was just an idea. I don't know how useful it would be in practice. +1

Thanks for the review!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for review Ready for review by team 5 - Merge After Dependencies Depends on another PR: do not merge out of order
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Multi-threaded replay benchmark broken when an allocation is freed on a different thread
5 participants