Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Commit

Permalink
Refactoring of Pooled Storage Manager classes (#18582)
Browse files Browse the repository at this point in the history
* Refactoring of Pooled Storage Manager classes

* Adding test for new functionality

* Fixing compilation problems which appear for MXNET_USE_CUDA=0

* Fixing compilation problems for WINDOWS and ANDROID

* Fixing compilation problems which appear for WINDOWS and __APPLE__

* Fixing lint problems

* test_dataloader_context(): Bypassing custom_dev_id pinned mem test on system with GPUs < 2.

* Fixing compilation for Android. Elimination of unused includes.

* Fixing problems with CPUPinned Storage Manager which appears when MXNET_USE_CUDA = 0

* Removing test_bucketing.py

* Imroving CPU_Pinned Pooled Storage Manager case.

* Fixing lint problem

* The GPU profiling commands calls moved into mutex area

* Fixing lint problem

* Improved reporting regarding the Storage Manager used.

* Fixing lint problem

* Trigger CI

* Removing some comments, as suggested by @szha

* Trigger CI

* Trigger CI

Co-authored-by: andreii <[email protected]>
  • Loading branch information
andrei5055 and drivanov committed Jul 16, 2020
1 parent 2abf0b8 commit 3ef00b8
Show file tree
Hide file tree
Showing 12 changed files with 732 additions and 478 deletions.
67 changes: 53 additions & 14 deletions docs/static_site/src/pages/api/faq/env_var.md
Original file line number Diff line number Diff line change
Expand Up @@ -85,28 +85,67 @@ $env:MXNET_STORAGE_FALLBACK_LOG_VERBOSE=0
- Setting this to a small number can save GPU memory. It will also likely decrease the level of parallelism, which is usually acceptable.
- MXNet internally uses graph coloring algorithm to [optimize memory consumption]({{'/api/architecture/note_memory'|relative_url}}).
- This parameter is also used to get number of matching colors in graph and in turn how much parallelism one can get in each GPU. Color based match usually costs more memory but also enables more parallelism.
* MXNET_GPU_MEM_POOL_TYPE
- Values: String ```(default=Naive)```
- The type of GPU memory pool.
- Choices:
- *Naive*: A simple memory pool that allocates memory for the requested size and cache memory buffers, when this memory is released. The size of memory chunk is defined by rounding the requested memory size to the nearest bigger multiple of MXNET_GPU_MEM_POOL_PAGE_SIZE (or MXNET_GPU_MEM_LARGE_ALLOC_ROUND_SIZE, when the result of rounding for MXNET_GPU_MEM_POOL_PAGE_SIZE is bigger than MXNET_GPU_MEM_LARGE_ALLOC_ROUND_SIZE) and allocates memory of the rounded size.
- *Round*: A memory pool that try to rounds the requested memory size to the nearest bigger power of 2. When this rounded number is bigger that 2**MXNET_GPU_MEM_POOL_ROUND_LINEAR_CUTOFF, the *Naive* rounding algorithm is used. Caching and allocating buffered memory works in the same way as the naive memory pool.
- *Unpooled*: No memory pool is used.
* MXNET_GPU_MEM_POOL_RESERVE
- Values: Int ```(default=5)```
- The percentage of GPU memory to reserve for things other than the GPU array, such as kernel launch or cudnn handle space.
- The value is used only by the GPU memory pool. If it is not possible to allocate new memory AND still save this reserve, the memory pool will free the cached memory.
- If you see a strange out-of-memory error from the kernel launch, after multiple iterations, try setting this to a larger value.

* MXNET_GPU_MEM_POOL_TYPE
* MXNET_GPU_MEM_LARGE_ALLOC_ROUND_SIZE
- Values: Int ```(default=2097152)```
- When the rounded size of memory allocations calculated by the pool of *Naive* type is larger than this threshold, it will be rounded up to a multiple of this value.
- The default was chosen to minimize global memory fragmentation within the GPU driver. Set this to 1 to disable.
* MXNET_GPU_MEM_POOL_ROUND_LINEAR_CUTOFF
- Values: Int ```(default=24)```
- The cutoff threshold used by *Round* strategy. Let's denote the threshold as T. If the memory size is smaller than `2 ** T` (by default, it's 2 ** 24 = 16MB), it rounds to the smallest `2 ** n` that is larger than the requested memory size; if the memory size is larger than `2 ** T`, it rounds to the next k * 2 ** T.
* MXNET_CPU_MEM_POOL_TYPE
- Values: String ```(default=Naive)```
- The type of memory pool.
- The type of CPU memory pool.
- Choices:
- Naive: A simple memory pool that allocates memory for the exact requested size and cache memory buffers. If a buffered memory chunk matches the size of a new request, the chunk from the memory pool will be returned and reused.
- Round: A memory pool that always rounds the requested memory size and allocates memory of the rounded size. MXNET_GPU_MEM_POOL_ROUND_LINEAR_CUTOFF defines how to round up a memory size. Caching and allocating buffered memory works in the same way as the naive memory pool.
- Unpooled: No memory pool is used.

* MXNET_GPU_MEM_POOL_ROUND_LINEAR_CUTOFF
- *Naive*: A simple memory pool that allocates memory for the requested size and cache memory buffers, when this memory is released. The size of memory chunk is defined by rounding the requested memory size to the nearest bigger multiple of MXNET_CPU_MEM_POOL_PAGE_SIZE (or MXNET_CPU_MEM_LARGE_ALLOC_ROUND_SIZE, when the result of rounding for MXNET_CPU_MEM_POOL_PAGE_SIZE is bigger than MXNET_CPU_MEM_LARGE_ALLOC_ROUND_SIZE) and allocates memory of the rounded size.
- *Round*: A memory pool that try to rounds the requested memory size to the nearest bigger power of 2. When this rounded number is bigger that 2**MXNET_CPU_MEM_POOL_ROUND_LINEAR_CUTOFF, the the *Naive* rounding algorithm is used. Caching and allocating buffered memory works in the same way as the naive memory pool.
- *Unpooled*: No memory pool is used.
* MXNET_CPU_MEM_POOL_RESERVE
- Values: Int ```(default=5)```
- The percentage of CPU memory to reserve for things other than the CPU array.
- The value is used only by the CPU memory pool. If it is not possible to allocate new memory AND still save this reserve, the memory pool will free the cached memory.
- If you see a strange out-of-memory error from the kernel launch, after multiple iterations, try setting this to a larger value.
* MXNET_CPU_MEM_LARGE_ALLOC_ROUND_SIZE
- Values: Int ```(default=2097152)```
- When the rounded size of memory allocations calculated by the pool of *Naive* type is larger than this threshold, it will be rounded up to a multiple of this value.
- Set this to 1 to disable.
* MXNET_CPU_MEM_POOL_ROUND_LINEAR_CUTOFF
- Values: Int ```(default=24)```
- The cutoff threshold that decides the rounding strategy. Let's denote the threshold as T. If the memory size is smaller than `2 ** T` (by default, it's 2 ** 24 = 16MB), it rounds to the smallest `2 ** n` that is larger than the requested memory size; if the memory size is larger than `2 ** T`, it rounds to the next k * 2 ** T.

* MXNET_GPU_MEM_LARGE_ALLOC_ROUND_SIZE
- The cutoff threshold used by *Round* strategy. Let's denote the threshold as T. If the memory size is smaller than `2 ** T` (by default, it's 2 ** 24 = 16MB), it rounds to the smallest `2 ** n` that is larger than the requested memory size; if the memory size is larger than `2 ** T`, it rounds to the next k * 2 ** T.
* MXNET_CPU_PINNED_MEM_POOL_TYPE
- Values: String ```(default=Naive)```
- The type of CPU_PINNED memory pool.
- Choices:
- *Naive*: A simple memory pool that allocates memory for the requested size and cache memory buffers, when this memory is released. The size of memory chunk is defined by rounding the requested memory size to the nearest bigger multiple of MXNET_CPU_PINNED_MEM_POOL_PAGE_SIZE (or MXNET_CPU_PINNED_MEM_LARGE_ALLOC_ROUND_SIZE, when the result of rounding for MXNET_CPU_PINNED_MEM_POOL_PAGE_SIZE is bigger than MXNET_CPU_PINNED_MEM_LARGE_ALLOC_ROUND_SIZE) and allocates memory of the rounded size.
- *Round*: A memory pool that try to rounds the requested memory size to the nearest bigger power of 2. When this rounded number is bigger that 2**MXNET_CPU_PINNED_MEM_POOL_ROUND_LINEAR_CUTOFF, the the *Naive* rounding algorithm is used. Caching and allocating buffered memory works in the same way as the naive memory pool.
- *Unpooled*: No memory pool is used.
* MXNET_CPU_PINNED_MEM_POOL_RESERVE
- Values: Int ```(default=5)```
- The percentage of GPU memory to reserve for things other than the GPU array.
- The value is used only by the CPU memory pool. If it is not possible to allocate new memory AND still save this reserve, the memory pool will free the cached memory.
- If you see a strange out-of-memory error from the kernel launch, after multiple iterations, try setting this to a larger value.
* MXNET_CPU_PINNED_MEM_LARGE_ALLOC_ROUND_SIZE
- Values: Int ```(default=2097152)```
- When using the naive pool type, memory allocations larger than this threshhold are rounded up to a multiple of this value.
- The default was chosen to minimize global memory fragmentation within the GPU driver. Set this to 1 to disable.

- When the rounded size of memory allocations calculated by the pool of *Naive* type is larger than this threshold, it will be rounded up to a multiple of this value.
- Set this to 1 to disable.
* MXNET_CPU_PINNED_MEM_POOL_ROUND_LINEAR_CUTOFF
- Values: Int ```(default=24)```
- The cutoff threshold used by *Round* strategy. Let's denote the threshold as T. If the memory size is smaller than `2 ** T` (by default, it's 2 ** 24 = 16MB), it rounds to the smallest `2 ** n` that is larger than the requested memory size; if the memory size is larger than `2 ** T`, it rounds to the next k * 2 ** T.
* MXNET_USE_NAIVE_STORAGE_MANAGERS
- Values: Int ```(default=0)```
- When value is not 0, no memory pools will be used for any of the following three types of memory: GPU, CPU, CPU_PINNED.

## Engine Type

* MXNET_ENGINE_TYPE
Expand Down
20 changes: 13 additions & 7 deletions src/profiler/storage_profiler.h
Original file line number Diff line number Diff line change
Expand Up @@ -162,16 +162,17 @@ class GpuDeviceStorageProfiler {
}
}

inline void OnFree(void *dptr) {
// In case of bug which tries to free first
if (gpu_mem_alloc_entries_.find(dptr) != gpu_mem_alloc_entries_.end())
gpu_mem_alloc_entries_.erase(dptr);
}

void OnFree(const Storage::Handle &handle) {
if (handle.size > 0) {
profiler::Profiler *prof = profiler::Profiler::Get();
if (prof->IsProfiling(profiler::Profiler::kMemory)) {
// In case of bug which tries to free first
if (gpu_mem_alloc_entries_.find(handle.dptr) !=
gpu_mem_alloc_entries_.end()) {
gpu_mem_alloc_entries_.erase(handle.dptr);
}
}
if (prof->IsProfiling(profiler::Profiler::kMemory))
OnFree(handle.dptr);
}
}

Expand All @@ -195,6 +196,11 @@ class GpuDeviceStorageProfiler {
/*! \brief dump the allocation entries to file */
void DumpProfile() const;

bool inline IsProfiling() const {
profiler::Profiler *prof = profiler::Profiler::Get();
return prof->IsProfiling(profiler::Profiler::kMemory);
}

private:
std::string filename_prefix_ = "gpu_memory_profile";
/*! \brief Dynamically-sized dictionary of memory profile counters */
Expand Down
11 changes: 2 additions & 9 deletions src/storage/cpu_device_storage.h
Original file line number Diff line number Diff line change
Expand Up @@ -25,9 +25,6 @@
#ifndef MXNET_STORAGE_CPU_DEVICE_STORAGE_H_
#define MXNET_STORAGE_CPU_DEVICE_STORAGE_H_

#include <dmlc/logging.h>
#include <cstdlib>
#include <new>
#include "mxnet/base.h"

namespace mxnet {
Expand Down Expand Up @@ -63,15 +60,11 @@ class CPUDeviceStorage {
}; // class CPUDeviceStorage

inline void CPUDeviceStorage::Alloc(Storage::Handle* handle) {
handle->dptr = nullptr;
const size_t size = handle->size;
if (size == 0) return;

#if _MSC_VER
handle->dptr = _aligned_malloc(size, alignment_);
handle->dptr = _aligned_malloc(handle->size, alignment_);
if (handle->dptr == nullptr) LOG(FATAL) << "Failed to allocate CPU Memory";
#else
int ret = posix_memalign(&handle->dptr, alignment_, size);
int ret = posix_memalign(&handle->dptr, alignment_, handle->size);
if (ret != 0) LOG(FATAL) << "Failed to allocate CPU Memory";
#endif
}
Expand Down
16 changes: 1 addition & 15 deletions src/storage/cpu_shared_storage_manager.h
Original file line number Diff line number Diff line change
Expand Up @@ -25,23 +25,14 @@
#ifndef _WIN32
#include <sys/mman.h>
#include <sys/fcntl.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#else
#include <Windows.h>
#include <process.h>
#endif // _WIN32

#include <unordered_map>
#include <vector>
#include <atomic>
#include <iostream>
#include <mutex>
#include <new>
#include <string>
#include <limits>

#include "./storage_manager.h"

namespace mxnet {
Expand Down Expand Up @@ -115,11 +106,6 @@ class CPUSharedStorageManager final : public StorageManager {
}; // class CPUSharedStorageManager

void CPUSharedStorageManager::Alloc(Storage::Handle* handle) {
if (handle->size == 0) {
handle->dptr = nullptr;
return;
}

std::lock_guard<std::recursive_mutex> lock(mutex_);
std::uniform_int_distribution<> dis(0, std::numeric_limits<int>::max());
int fid = -1;
Expand All @@ -140,7 +126,7 @@ void CPUSharedStorageManager::Alloc(Storage::Handle* handle) {
map_handle = CreateFileMapping(INVALID_HANDLE_VALUE,
nullptr, PAGE_READWRITE, 0, size, filename.c_str());
if ((error = GetLastError()) == ERROR_SUCCESS) {
break;;
break;
}
}
} else {
Expand Down
37 changes: 5 additions & 32 deletions src/storage/gpu_device_storage.h
Original file line number Diff line number Diff line change
Expand Up @@ -25,14 +25,8 @@
#ifndef MXNET_STORAGE_GPU_DEVICE_STORAGE_H_
#define MXNET_STORAGE_GPU_DEVICE_STORAGE_H_

#include "mxnet/base.h"
#include "mxnet/storage.h"
#include "../common/cuda_utils.h"
#include "../profiler/storage_profiler.h"
#if MXNET_USE_CUDA
#include <cuda_runtime.h>
#endif // MXNET_USE_CUDA
#include <new>
#include "mxnet/storage.h"

namespace mxnet {
namespace storage {
Expand All @@ -55,46 +49,25 @@ class GPUDeviceStorage {
}; // class GPUDeviceStorage

inline void GPUDeviceStorage::Alloc(Storage::Handle* handle) {
handle->dptr = nullptr;
const size_t size = handle->size;
if (size == 0) return;

#if MXNET_USE_CUDA
mxnet::common::cuda::DeviceStore device_store(handle->ctx.real_dev_id(), true);
#if MXNET_USE_NCCL
std::lock_guard<std::mutex> l(Storage::Get()->GetMutex(Context::kGPU));
#endif // MXNET_USE_NCCL
cudaError_t e = cudaMalloc(&handle->dptr, size);
if (e != cudaSuccess && e != cudaErrorCudartUnloading) {
LOG(FATAL) << "CUDA: " << cudaGetErrorString(e);
}
// record the allocation event in the memory profiler
profiler::GpuDeviceStorageProfiler::Get()->OnAlloc(*handle, size, false);
#else // MXNET_USE_CUDA
LOG(FATAL) << "Please compile with CUDA enabled";
#endif // MXNET_USE_CUDA
CUDA_CALL(cudaMalloc(&handle->dptr, handle->size));
profiler::GpuDeviceStorageProfiler::Get()->OnAlloc(*handle, handle->size, false);
}

inline void GPUDeviceStorage::Free(Storage::Handle handle) {
#if MXNET_USE_CUDA
mxnet::common::cuda::DeviceStore device_store(handle.ctx.real_dev_id(), true);
#if MXNET_USE_NCCL
std::lock_guard<std::mutex> l(Storage::Get()->GetMutex(Context::kGPU));
#endif // MXNET_USE_NCCL
// throw special exception for caller to catch.
cudaError_t err = cudaFree(handle.dptr);
// ignore unloading error, as memory has already been recycled
if (err != cudaSuccess && err != cudaErrorCudartUnloading) {
LOG(FATAL) << "CUDA: " << cudaGetErrorString(err);
}
// record the deallocation event in the memory profiler
CUDA_CALL(cudaFree(handle.dptr))
profiler::GpuDeviceStorageProfiler::Get()->OnFree(handle);
#else // MXNET_USE_CUDA
LOG(FATAL) << "Please compile with CUDA enabled";
#endif // MXNET_USE_CUDA
}

} // namespace storage
} // namespace mxnet

#endif // MXNET_USE_CUDA
#endif // MXNET_STORAGE_GPU_DEVICE_STORAGE_H_
1 change: 0 additions & 1 deletion src/storage/naive_storage_manager.h
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,6 @@
#define MXNET_STORAGE_NAIVE_STORAGE_MANAGER_H_

#include "storage_manager.h"
#include "mxnet/base.h"

namespace mxnet {
namespace storage {
Expand Down
20 changes: 2 additions & 18 deletions src/storage/pinned_memory_storage.h
Original file line number Diff line number Diff line change
Expand Up @@ -26,11 +26,7 @@
#define MXNET_STORAGE_PINNED_MEMORY_STORAGE_H_
#if MXNET_USE_CUDA

#include <dmlc/logging.h>
#include "mxnet/base.h"
#include "mxnet/storage.h"
#include "../common/cuda_utils.h"
#include "../profiler/storage_profiler.h"

namespace mxnet {
namespace storage {
Expand All @@ -51,32 +47,20 @@ class PinnedMemoryStorage {
};

inline void PinnedMemoryStorage::Alloc(Storage::Handle* handle) {
handle->dptr = nullptr;
const size_t size = handle->size;
if (size == 0) return;

#if MXNET_USE_NCCL
std::lock_guard<std::mutex> lock(Storage::Get()->GetMutex(Context::kGPU));
#endif
mxnet::common::cuda::DeviceStore device_store(handle->ctx.real_dev_id(), true);
// make the memory available across all devices
CUDA_CALL(cudaHostAlloc(&handle->dptr, size, cudaHostAllocPortable));
// record the allocation event in the memory profiler
profiler::GpuDeviceStorageProfiler::Get()->OnAlloc(*handle, size, false);
CUDA_CALL(cudaHostAlloc(&handle->dptr, handle->size, cudaHostAllocPortable));
}

inline void PinnedMemoryStorage::Free(Storage::Handle handle) {
#if MXNET_USE_NCCL
std::lock_guard<std::mutex> lock(Storage::Get()->GetMutex(Context::kGPU));
#endif
mxnet::common::cuda::DeviceStore device_store(handle.ctx.real_dev_id(), true);
cudaError_t err = cudaFreeHost(handle.dptr);
// ignore unloading error, as memory has already been recycled
if (err != cudaSuccess && err != cudaErrorCudartUnloading) {
LOG(FATAL) << "CUDA: " << cudaGetErrorString(err);
}
// record the deallocation event in the memory profiler
profiler::GpuDeviceStorageProfiler::Get()->OnFree(handle);
CUDA_CALL(cudaFreeHost(handle.dptr));
}

} // namespace storage
Expand Down
Loading

0 comments on commit 3ef00b8

Please sign in to comment.