Refactoring of Pooled Storage Manager classes (#18582)

* Refactoring of Pooled Storage Manager classes * Adding test for new functionality * Fixing compilation problems which appear for MXNET_USE_CUDA=0 * Fixing compilation problems for WINDOWS and ANDROID * Fixing compilation problems which appear for WINDOWS and __APPLE__ * Fixing lint problems * test_dataloader_context(): Bypassing custom_dev_id pinned mem test on system with GPUs < 2. * Fixing compilation for Android. Elimination of unused includes. * Fixing problems with CPUPinned Storage Manager which appears when MXNET_USE_CUDA = 0 * Removing test_bucketing.py * Imroving CPU_Pinned Pooled Storage Manager case. * Fixing lint problem * The GPU profiling commands calls moved into mutex area * Fixing lint problem * Improved reporting regarding the Storage Manager used. * Fixing lint problem * Trigger CI * Removing some comments, as suggested by @szha * Trigger CI * Trigger CI Co-authored-by: andreii <[email protected]>
apache · Jul 16, 2020 · 3ef00b8 · 3ef00b8
1 parent 2abf0b8
commit 3ef00b8
Show file tree

Hide file tree

Showing 12 changed files with 732 additions and 478 deletions.
diff --git a/docs/static_site/src/pages/api/faq/env_var.md b/docs/static_site/src/pages/api/faq/env_var.md
@@ -85,28 +85,67 @@ $env:MXNET_STORAGE_FALLBACK_LOG_VERBOSE=0
   - Setting this to a small number can save GPU memory. It will also likely decrease the level of parallelism, which is usually acceptable.
     - MXNet internally uses graph coloring algorithm to [optimize memory consumption]({{'/api/architecture/note_memory'|relative_url}}).
   - This parameter is also used to get number of matching colors in graph and in turn how much parallelism one can get in each GPU. Color based match usually costs more memory but also enables more parallelism.
+* MXNET_GPU_MEM_POOL_TYPE
+  - Values: String ```(default=Naive)```
+  - The type of GPU memory pool.
+  - Choices:
+    - *Naive*: A simple memory pool that allocates memory for the requested size and cache memory buffers, when this memory is released. The size of memory chunk is defined by rounding the requested memory size to the nearest bigger multiple of MXNET_GPU_MEM_POOL_PAGE_SIZE (or MXNET_GPU_MEM_LARGE_ALLOC_ROUND_SIZE, when the result of rounding for MXNET_GPU_MEM_POOL_PAGE_SIZE is bigger than MXNET_GPU_MEM_LARGE_ALLOC_ROUND_SIZE) and allocates memory of the rounded size.
+    - *Round*: A memory pool that try to rounds the requested memory size to the nearest bigger power of 2. When this rounded number is bigger that 2**MXNET_GPU_MEM_POOL_ROUND_LINEAR_CUTOFF, the *Naive* rounding algorithm is used. Caching and allocating buffered memory works in the same way as the naive memory pool.
+    - *Unpooled*: No memory pool is used.
 * MXNET_GPU_MEM_POOL_RESERVE
   - Values: Int ```(default=5)```
   - The percentage of GPU memory to reserve for things other than the GPU array, such as kernel launch or cudnn handle space.
+  - The value is used only by the GPU memory pool. If it is not possible to allocate new memory AND still save this reserve, the memory pool will free the cached memory.
   - If you see a strange out-of-memory error from the kernel launch, after multiple iterations, try setting this to a larger value.
-
-* MXNET_GPU_MEM_POOL_TYPE
+* MXNET_GPU_MEM_LARGE_ALLOC_ROUND_SIZE
+  - Values: Int ```(default=2097152)```
+  - When the rounded size of memory allocations calculated by the pool of *Naive* type is larger than this threshold, it will be rounded up to a multiple of this value.
+  - The default was chosen to minimize global memory fragmentation within the GPU driver. Set this to 1 to disable.
+* MXNET_GPU_MEM_POOL_ROUND_LINEAR_CUTOFF
+  - Values: Int ```(default=24)```
+  - The cutoff threshold used by *Round* strategy. Let's denote the threshold as T. If the memory size is smaller than `2 ** T` (by default, it's 2 ** 24 = 16MB), it rounds to the smallest `2 ** n` that is larger than the requested memory size; if the memory size is larger than `2 ** T`, it rounds to the next k * 2 ** T.
+* MXNET_CPU_MEM_POOL_TYPE
   - Values: String ```(default=Naive)```
-  - The type of memory pool.
+  - The type of CPU memory pool.
   - Choices:
-    - Naive: A simple memory pool that allocates memory for the exact requested size and cache memory buffers. If a buffered memory chunk matches the size of a new request, the chunk from the memory pool will be returned and reused.
-    - Round: A memory pool that always rounds the requested memory size and allocates memory of the rounded size. MXNET_GPU_MEM_POOL_ROUND_LINEAR_CUTOFF defines how to round up a memory size. Caching and allocating buffered memory works in the same way as the naive memory pool.
-    - Unpooled: No memory pool is used.
-
-* MXNET_GPU_MEM_POOL_ROUND_LINEAR_CUTOFF
+    - *Naive*: A simple memory pool that allocates memory for the requested size and cache memory buffers, when this memory is released. The size of memory chunk is defined by rounding the requested memory size to the nearest bigger multiple of MXNET_CPU_MEM_POOL_PAGE_SIZE (or MXNET_CPU_MEM_LARGE_ALLOC_ROUND_SIZE, when the result of rounding for MXNET_CPU_MEM_POOL_PAGE_SIZE is bigger than MXNET_CPU_MEM_LARGE_ALLOC_ROUND_SIZE) and allocates memory of the rounded size.
+    - *Round*: A memory pool that try to rounds the requested memory size to the nearest bigger power of 2. When this rounded number is bigger that 2**MXNET_CPU_MEM_POOL_ROUND_LINEAR_CUTOFF, the the *Naive* rounding algorithm is used. Caching and allocating buffered memory works in the same way as the naive memory pool.
+    - *Unpooled*: No memory pool is used.
+* MXNET_CPU_MEM_POOL_RESERVE
+  - Values: Int ```(default=5)```
+  - The percentage of CPU memory to reserve for things other than the CPU array.
+  - The value is used only by the CPU memory pool. If it is not possible to allocate new memory AND still save this reserve, the memory pool will free the cached memory.
+  - If you see a strange out-of-memory error from the kernel launch, after multiple iterations, try setting this to a larger value.
+* MXNET_CPU_MEM_LARGE_ALLOC_ROUND_SIZE
+  - Values: Int ```(default=2097152)```
+  - When the rounded size of memory allocations calculated by the pool of *Naive* type is larger than this threshold, it will be rounded up to a multiple of this value.
+  - Set this to 1 to disable.
+* MXNET_CPU_MEM_POOL_ROUND_LINEAR_CUTOFF
   - Values: Int ```(default=24)```
-  - The cutoff threshold that decides the rounding strategy. Let's denote the threshold as T. If the memory size is smaller than `2 ** T` (by default, it's 2 ** 24 = 16MB), it rounds to the smallest `2 ** n` that is larger than the requested memory size; if the memory size is larger than `2 ** T`, it rounds to the next k * 2 ** T.
-
-* MXNET_GPU_MEM_LARGE_ALLOC_ROUND_SIZE
+  - The cutoff threshold used by *Round* strategy. Let's denote the threshold as T. If the memory size is smaller than `2 ** T` (by default, it's 2 ** 24 = 16MB), it rounds to the smallest `2 ** n` that is larger than the requested memory size; if the memory size is larger than `2 ** T`, it rounds to the next k * 2 ** T.
+* MXNET_CPU_PINNED_MEM_POOL_TYPE
+  - Values: String ```(default=Naive)```
+  - The type of CPU_PINNED memory pool.
+  - Choices:
+    - *Naive*: A simple memory pool that allocates memory for the requested size and cache memory buffers, when this memory is released. The size of memory chunk is defined by rounding the requested memory size to the nearest bigger multiple of MXNET_CPU_PINNED_MEM_POOL_PAGE_SIZE (or MXNET_CPU_PINNED_MEM_LARGE_ALLOC_ROUND_SIZE, when the result of rounding for MXNET_CPU_PINNED_MEM_POOL_PAGE_SIZE is bigger than MXNET_CPU_PINNED_MEM_LARGE_ALLOC_ROUND_SIZE) and allocates memory of the rounded size.
+    - *Round*: A memory pool that try to rounds the requested memory size to the nearest bigger power of 2. When this rounded number is bigger that 2**MXNET_CPU_PINNED_MEM_POOL_ROUND_LINEAR_CUTOFF, the the *Naive* rounding algorithm is used. Caching and allocating buffered memory works in the same way as the naive memory pool.
+    - *Unpooled*: No memory pool is used.
+* MXNET_CPU_PINNED_MEM_POOL_RESERVE
+  - Values: Int ```(default=5)```
+  - The percentage of GPU memory to reserve for things other than the GPU array.
+  - The value is used only by the CPU memory pool. If it is not possible to allocate new memory AND still save this reserve, the memory pool will free the cached memory.
+  - If you see a strange out-of-memory error from the kernel launch, after multiple iterations, try setting this to a larger value.
+* MXNET_CPU_PINNED_MEM_LARGE_ALLOC_ROUND_SIZE
   - Values: Int ```(default=2097152)```
-  - When using the naive pool type, memory allocations larger than this threshhold are rounded up to a multiple of this value.
-  - The default was chosen to minimize global memory fragmentation within the GPU driver.  Set this to 1 to disable.
-
+  - When the rounded size of memory allocations calculated by the pool of *Naive* type is larger than this threshold, it will be rounded up to a multiple of this value.
+  - Set this to 1 to disable.
+* MXNET_CPU_PINNED_MEM_POOL_ROUND_LINEAR_CUTOFF
+  - Values: Int ```(default=24)```
+  - The cutoff threshold used by *Round* strategy. Let's denote the threshold as T. If the memory size is smaller than `2 ** T` (by default, it's 2 ** 24 = 16MB), it rounds to the smallest `2 ** n` that is larger than the requested memory size; if the memory size is larger than `2 ** T`, it rounds to the next k * 2 ** T.
+* MXNET_USE_NAIVE_STORAGE_MANAGERS
+  - Values: Int ```(default=0)```
+  - When value is not 0, no memory pools will be used for any of the following three types of memory: GPU, CPU, CPU_PINNED.
+
 ## Engine Type
 
 * MXNET_ENGINE_TYPE

diff --git a/src/profiler/storage_profiler.h b/src/profiler/storage_profiler.h
@@ -162,16 +162,17 @@ class GpuDeviceStorageProfiler {
     }
   }
 
+  inline void OnFree(void *dptr) {
+    // In case of bug which tries to free first
+    if (gpu_mem_alloc_entries_.find(dptr) != gpu_mem_alloc_entries_.end())
+      gpu_mem_alloc_entries_.erase(dptr);
+  }
+
   void OnFree(const Storage::Handle &handle) {
     if (handle.size > 0) {
       profiler::Profiler *prof = profiler::Profiler::Get();
-      if (prof->IsProfiling(profiler::Profiler::kMemory)) {
-        // In case of bug which tries to free first
-        if (gpu_mem_alloc_entries_.find(handle.dptr) !=
-            gpu_mem_alloc_entries_.end()) {
-          gpu_mem_alloc_entries_.erase(handle.dptr);
-        }
-      }
+      if (prof->IsProfiling(profiler::Profiler::kMemory))
+        OnFree(handle.dptr);
     }
   }
 
@@ -195,6 +196,11 @@ class GpuDeviceStorageProfiler {
   /*! \brief dump the allocation entries to file */
   void DumpProfile() const;
 
+  bool inline IsProfiling() const {
+    profiler::Profiler *prof = profiler::Profiler::Get();
+    return prof->IsProfiling(profiler::Profiler::kMemory);
+  }
+
  private:
   std::string filename_prefix_ = "gpu_memory_profile";
   /*! \brief Dynamically-sized dictionary of memory profile counters */

diff --git a/src/storage/cpu_device_storage.h b/src/storage/cpu_device_storage.h
@@ -25,9 +25,6 @@
 #ifndef MXNET_STORAGE_CPU_DEVICE_STORAGE_H_
 #define MXNET_STORAGE_CPU_DEVICE_STORAGE_H_
 
-#include <dmlc/logging.h>
-#include <cstdlib>
-#include <new>
 #include "mxnet/base.h"
 
 namespace mxnet {
@@ -63,15 +60,11 @@ class CPUDeviceStorage {
 };  // class CPUDeviceStorage
 
 inline void CPUDeviceStorage::Alloc(Storage::Handle* handle) {
-  handle->dptr = nullptr;
-  const size_t size = handle->size;
-  if (size == 0) return;
-
 #if _MSC_VER
-  handle->dptr = _aligned_malloc(size, alignment_);
+  handle->dptr = _aligned_malloc(handle->size, alignment_);
   if (handle->dptr == nullptr) LOG(FATAL) << "Failed to allocate CPU Memory";
 #else
-  int ret = posix_memalign(&handle->dptr, alignment_, size);
+  int ret = posix_memalign(&handle->dptr, alignment_, handle->size);
   if (ret != 0) LOG(FATAL) << "Failed to allocate CPU Memory";
 #endif
 }

diff --git a/src/storage/cpu_shared_storage_manager.h b/src/storage/cpu_shared_storage_manager.h
@@ -25,23 +25,14 @@
 #ifndef _WIN32
 #include <sys/mman.h>
 #include <sys/fcntl.h>
-#include <unistd.h>
-#include <sys/types.h>
 #include <sys/stat.h>
 #else
 #include <Windows.h>
 #include <process.h>
 #endif  // _WIN32
 
-#include <unordered_map>
-#include <vector>
-#include <atomic>
-#include <iostream>
-#include <mutex>
-#include <new>
 #include <string>
 #include <limits>
-
 #include "./storage_manager.h"
 
 namespace mxnet {
@@ -115,11 +106,6 @@ class CPUSharedStorageManager final : public StorageManager {
 };  // class CPUSharedStorageManager
 
 void CPUSharedStorageManager::Alloc(Storage::Handle* handle) {
-  if (handle->size == 0) {
-    handle->dptr = nullptr;
-    return;
-  }
-
   std::lock_guard<std::recursive_mutex> lock(mutex_);
   std::uniform_int_distribution<> dis(0, std::numeric_limits<int>::max());
   int fid = -1;
@@ -140,7 +126,7 @@ void CPUSharedStorageManager::Alloc(Storage::Handle* handle) {
       map_handle = CreateFileMapping(INVALID_HANDLE_VALUE,
                                      nullptr, PAGE_READWRITE, 0, size, filename.c_str());
       if ((error = GetLastError()) == ERROR_SUCCESS) {
-        break;;
+        break;
       }
     }
   } else {

diff --git a/src/storage/gpu_device_storage.h b/src/storage/gpu_device_storage.h
@@ -25,14 +25,8 @@
 #ifndef MXNET_STORAGE_GPU_DEVICE_STORAGE_H_
 #define MXNET_STORAGE_GPU_DEVICE_STORAGE_H_
 
-#include "mxnet/base.h"
-#include "mxnet/storage.h"
-#include "../common/cuda_utils.h"
-#include "../profiler/storage_profiler.h"
 #if MXNET_USE_CUDA
-#include <cuda_runtime.h>
-#endif  // MXNET_USE_CUDA
-#include <new>
+#include "mxnet/storage.h"
 
 namespace mxnet {
 namespace storage {
@@ -55,46 +49,25 @@ class GPUDeviceStorage {
 };  // class GPUDeviceStorage
 
 inline void GPUDeviceStorage::Alloc(Storage::Handle* handle) {
-  handle->dptr = nullptr;
-  const size_t size = handle->size;
-  if (size == 0) return;
-
-#if MXNET_USE_CUDA
   mxnet::common::cuda::DeviceStore device_store(handle->ctx.real_dev_id(), true);
 #if MXNET_USE_NCCL
   std::lock_guard<std::mutex> l(Storage::Get()->GetMutex(Context::kGPU));
 #endif  // MXNET_USE_NCCL
-  cudaError_t e = cudaMalloc(&handle->dptr, size);
-  if (e != cudaSuccess && e != cudaErrorCudartUnloading) {
-    LOG(FATAL) << "CUDA: " << cudaGetErrorString(e);
-  }
-  // record the allocation event in the memory profiler
-  profiler::GpuDeviceStorageProfiler::Get()->OnAlloc(*handle, size, false);
-#else   // MXNET_USE_CUDA
-  LOG(FATAL) << "Please compile with CUDA enabled";
-#endif  // MXNET_USE_CUDA
+  CUDA_CALL(cudaMalloc(&handle->dptr, handle->size));
+  profiler::GpuDeviceStorageProfiler::Get()->OnAlloc(*handle, handle->size, false);
 }
 
 inline void GPUDeviceStorage::Free(Storage::Handle handle) {
-#if MXNET_USE_CUDA
   mxnet::common::cuda::DeviceStore device_store(handle.ctx.real_dev_id(), true);
 #if MXNET_USE_NCCL
   std::lock_guard<std::mutex> l(Storage::Get()->GetMutex(Context::kGPU));
 #endif  // MXNET_USE_NCCL
-  // throw special exception for caller to catch.
-  cudaError_t err = cudaFree(handle.dptr);
-  // ignore unloading error, as memory has already been recycled
-  if (err != cudaSuccess && err != cudaErrorCudartUnloading) {
-    LOG(FATAL) << "CUDA: " << cudaGetErrorString(err);
-  }
-  // record the deallocation event in the memory profiler
+  CUDA_CALL(cudaFree(handle.dptr))
   profiler::GpuDeviceStorageProfiler::Get()->OnFree(handle);
-#else   // MXNET_USE_CUDA
-  LOG(FATAL) << "Please compile with CUDA enabled";
-#endif  // MXNET_USE_CUDA
 }
 
 }  // namespace storage
 }  // namespace mxnet
 
+#endif  // MXNET_USE_CUDA
 #endif  // MXNET_STORAGE_GPU_DEVICE_STORAGE_H_
diff --git a/src/storage/naive_storage_manager.h b/src/storage/naive_storage_manager.h
@@ -26,7 +26,6 @@
 #define MXNET_STORAGE_NAIVE_STORAGE_MANAGER_H_
 
 #include "storage_manager.h"
-#include "mxnet/base.h"
 
 namespace mxnet {
 namespace storage {

diff --git a/src/storage/pinned_memory_storage.h b/src/storage/pinned_memory_storage.h
@@ -26,11 +26,7 @@
 #define MXNET_STORAGE_PINNED_MEMORY_STORAGE_H_
 #if MXNET_USE_CUDA
 
-#include <dmlc/logging.h>
-#include "mxnet/base.h"
 #include "mxnet/storage.h"
-#include "../common/cuda_utils.h"
-#include "../profiler/storage_profiler.h"
 
 namespace mxnet {
 namespace storage {
@@ -51,32 +47,20 @@ class PinnedMemoryStorage {
 };
 
 inline void PinnedMemoryStorage::Alloc(Storage::Handle* handle) {
-  handle->dptr = nullptr;
-  const size_t size = handle->size;
-  if (size == 0) return;
-
 #if MXNET_USE_NCCL
   std::lock_guard<std::mutex> lock(Storage::Get()->GetMutex(Context::kGPU));
 #endif
   mxnet::common::cuda::DeviceStore device_store(handle->ctx.real_dev_id(), true);
   // make the memory available across all devices
-  CUDA_CALL(cudaHostAlloc(&handle->dptr, size, cudaHostAllocPortable));
-  // record the allocation event in the memory profiler
-  profiler::GpuDeviceStorageProfiler::Get()->OnAlloc(*handle, size, false);
+  CUDA_CALL(cudaHostAlloc(&handle->dptr, handle->size, cudaHostAllocPortable));
 }
 
 inline void PinnedMemoryStorage::Free(Storage::Handle handle) {
 #if MXNET_USE_NCCL
   std::lock_guard<std::mutex> lock(Storage::Get()->GetMutex(Context::kGPU));
 #endif
   mxnet::common::cuda::DeviceStore device_store(handle.ctx.real_dev_id(), true);
-  cudaError_t err = cudaFreeHost(handle.dptr);
-  // ignore unloading error, as memory has already been recycled
-  if (err != cudaSuccess && err != cudaErrorCudartUnloading) {
-    LOG(FATAL) << "CUDA: " << cudaGetErrorString(err);
-  }
-  // record the deallocation event in the memory profiler
-  profiler::GpuDeviceStorageProfiler::Get()->OnFree(handle);
+  CUDA_CALL(cudaFreeHost(handle.dptr));
 }
 
 }  // namespace storage