Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
47 commits
Select commit Hold shift + click to select a range
bd4ce81
Update
Jokeren Jun 24, 2024
854d9ed
Update
Jokeren Jul 1, 2024
7a0903b
Update
Jokeren Jul 3, 2024
cd1f0cf
Update
Jokeren Jul 3, 2024
6409fed
Update
Jokeren Jul 3, 2024
25fbd18
Update
Jokeren Jul 3, 2024
b9b5543
Update
Jokeren Jul 3, 2024
3c4dd19
Update
Jokeren Jul 3, 2024
e5a4fd7
Update
Jokeren Jul 3, 2024
6c2be4f
Update
Jokeren Jul 3, 2024
2fef490
Update
Jokeren Jul 3, 2024
3adb522
Update
Jokeren Jul 3, 2024
cdcd24e
Update
Jokeren Jul 3, 2024
9a3161b
Update
Jokeren Jul 3, 2024
3f87312
Update
Jokeren Jul 3, 2024
7e9e0ce
Update
Jokeren Jul 4, 2024
33815a3
Update
Jokeren Jul 4, 2024
d283ce0
Update
Jokeren Jul 4, 2024
860ce1a
Update
Jokeren Jul 4, 2024
1075d1b
Update
Jokeren Jul 4, 2024
2a746a0
Update
Jokeren Jul 4, 2024
20cadc7
Update
Jokeren Jul 4, 2024
da2a9e7
Update
Jokeren Jul 7, 2024
20a3e10
Update
Jokeren Jul 7, 2024
0c96768
Update
Jokeren Jul 7, 2024
52840f9
Update
Jokeren Jul 7, 2024
fd5406a
Update
Jokeren Jul 7, 2024
ec7c96b
Add backend
Jokeren Jul 29, 2024
13c5e99
Update
Jokeren Jul 29, 2024
2c91e94
Merge branch 'main' into keren/cupti-pc-samples
Jokeren Jul 29, 2024
e2d88a8
Move files
Jokeren Jul 29, 2024
8872da0
Enable pc sampling with timing
Jokeren Jul 29, 2024
a5b6e2a
Merge branch 'main' into keren/cupti-pc-samples
Jokeren Sep 8, 2024
9aea49e
Update
Jokeren Sep 8, 2024
1f3adc0
Update
Jokeren Sep 8, 2024
c2d2fcf
Update
Jokeren Sep 9, 2024
e9e5cd5
Update
Jokeren Sep 9, 2024
8494e4b
Fix samples
Jokeren Sep 9, 2024
8fbb751
Update
Jokeren Sep 23, 2024
557d423
Update readme
Jokeren Sep 23, 2024
f8c043b
Update readme
Jokeren Sep 23, 2024
b291993
Merge branch 'main' into keren/cupti-pc-samples
Jokeren Sep 24, 2024
26aab6d
Update
Jokeren Sep 30, 2024
995647e
Update
Jokeren Sep 30, 2024
e761c7a
Update
Jokeren Sep 30, 2024
b9487c8
Merge branch 'keren/cupti-pc-samples' of github.com:openai/triton int…
Jokeren Sep 30, 2024
aee678a
Update
Jokeren Sep 30, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .github/workflows/integration-tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@ env:
TRITON_BUILD_WITH_CLANG_LLD: "TRUE"
TRITON_USE_ASSERT_ENABLED_LLVM: "TRUE"
TRITON_DISABLE_LINE_INFO: 1
PROTON_SKIP_PC_SAMPLING_TEST: 1
jobs:
Runner-Preparation:
runs-on: ubuntu-latest
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/integration-tests.yml.in
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ env:
TRITON_BUILD_WITH_CLANG_LLD: "TRUE"
TRITON_USE_ASSERT_ENABLED_LLVM: "TRUE"
TRITON_DISABLE_LINE_INFO: 1

PROTON_SKIP_PC_SAMPLING_TEST: 1

jobs:
Runner-Preparation:
Expand Down
35 changes: 32 additions & 3 deletions third_party/proton/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -119,7 +119,7 @@ flops64: float # The number of 64-bit floating-point operations
bytes: int # The number of bytes expected to be transferred
```

### Command Line
### Command line

Proton can be used as a command-line tool to profile Python scripts and Pytest tests.
The following examples demonstrate how to use Proton command-line.
Expand Down Expand Up @@ -149,6 +149,22 @@ More options can be found by running the following command.
proton-viewer -h
```

### Instruction sampling (experimental)

Proton supports instruction sampling on NVIDIA GPUs.
Please note that this is an experimental feature and may not work on all GPUs.
You may experience ~20x end-to-end overhead when using instruction sampling, although the overhead for each individual GPU kernel is negligible.
The overhead is mostly caused by data transfer and processing on the CPU.
Additionally, the proton-viewer options `-i <regex> -d <depth> -t <threshold>` can be helpful for filtering out GPU kernels that are not of interest.
The following example demonstrates how to use instruction sampling:

```python
import triton.profiler as proton


proton.start(name="profile_name", context="shadow", backend="cupti_pcsampling")
```

## Proton *vs* nsys

- Runtime overhead (up to 1.5x)
Expand All @@ -173,11 +189,24 @@ Proton is designed to be portable and can be used on AMD GPUs. nsys only support

Proton can register hooks to analyze the metadata of triton kernels, while nsys cannot. **Note** that the hooks do add additional overhead to proton.

## Known Issues
## Proton *vs* ncu

Similar to the comparison between Proton and Nsight Systems (Nsys), Proton has a lower profiling overhead than Nsight Compute (NCU). We also plan to support instruction sampling on AMD GPUs.
However, Nsight Compute supports the collection of more detailed metrics than Proton, such as memory access patterns, memory transactions, and other instruction-level metrics.
In contrast, Proton only supports instruction sampling and is designed to be lightweight and portable.

## Known issues

- CUDA Graph
- CUDA graph

`hooks` cannot be used to accurately accumulate the number of FLOPs in CUDA graph mode profiling because kernels are captured and launched separately; metrics are not accumulated when kernels are launched in graph mode. This issue can be circumvented by using `scope` to supply FLOPs.

If profiling is initiated after CUDA graph capturing, there may be minor memory leak issues.
This is because the number of kernels in a graph instance (i.e., `cuGraphExec`) is unknown, preventing the deletion of mappings between the kernel ID and the graph ID.

- Instruction sampling

If you encounter permission related problems when using instruction sampling, you can lookup this [page](https://developer.nvidia.com/nvidia-development-tools-solutions-err_nvgpuctrperm-permission-issue-performance-counters) for help.

The overhead of instruction sampling on NVIDIA GPUs is about 20x using Proton because we haven't enabled continuous sampling yet.
Continuous sampling can allow for more runtime optimizations, but it makes it more challenging to attribute performance data back to the GPU kernels because: (1) it enables profiling of concurrent kernels, (2) it doesn't allow profiling of time and instruction samples simultaneously, and (3) it works best if we have a separate thread dedicated to attributing instruction samples to the GPU kernels
76 changes: 73 additions & 3 deletions third_party/proton/csrc/include/Data/Metric.h
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@

namespace proton {

enum class MetricKind { Flexible, Kernel, Count };
enum class MetricKind { Flexible, Kernel, PCSampling, Count };

using MetricValueType = std::variant<uint64_t, int64_t, double, std::string>;

Expand Down Expand Up @@ -143,8 +143,78 @@ class KernelMetric : public Metric {
const static inline bool AGGREGABLE[kernelMetricKind::Count] = {
false, false, true, true, false, false};
const static inline std::string VALUE_NAMES[kernelMetricKind::Count] = {
"StartTime (ns)", "EndTime (ns)", "Count",
"Time (ns)", "DeviceId", "DeviceType",
"start_time (ns)", "end_time (ns)", "count",
"time (ns)", "device_id", "device_type",
};
};

class PCSamplingMetric : public Metric {
public:
enum PCSamplingMetricKind : int {
NumSamples,
NumStalledSamples,
StalledBranchResolving,
StalledNoInstruction,
StalledShortScoreboard,
StalledWait,
StalledLongScoreboard,
StalledTexThrottle,
StalledBarrier,
StalledMembar,
StalledIMCMiss,
StalledMIOThrottle,
StalledMathPipeThrottle,
StalledDrain,
StalledLGThrottle,
StalledNotSelected,
StalledMisc,
StalledDispatchStall,
StalledSleeping,
StalledSelected,
Count,
};

PCSamplingMetric()
: Metric(MetricKind::PCSampling, PCSamplingMetricKind::Count) {}

PCSamplingMetric(PCSamplingMetricKind kind, uint64_t samples,
uint64_t stalledSamples)
: PCSamplingMetric() {
this->values[kind] = stalledSamples;
this->values[PCSamplingMetricKind::NumSamples] = samples;
this->values[PCSamplingMetricKind::NumStalledSamples] = stalledSamples;
}

virtual const std::string getName() const { return "PCSamplingMetric"; }

virtual const std::string getValueName(int valueId) const {
return VALUE_NAMES[valueId];
}

virtual bool isAggregable(int valueId) const { return true; }

private:
const static inline std::string VALUE_NAMES[PCSamplingMetricKind::Count] = {
"num_samples",
"num_stalled_samples",
"stalled_branch_resolving",
"stalled_no_instruction",
"stalled_short_scoreboard",
"stalled_wait",
"stalled_long_scoreboard",
"stalled_tex_throttle",
"stalled_barrier",
"stalled_membar",
"stalled_imc_miss",
"stalled_mio_throttle",
"stalled_math_pipe_throttle",
"stalled_drain",
"stalled_lg_throttle",
"stalled_not_Selected",
"stalled_misc",
"stalled_dispatch_stall",
"stalled_sleeping",
"stalled_selected",
};
};

Expand Down
10 changes: 5 additions & 5 deletions third_party/proton/csrc/include/Driver/Dispatch.h
Original file line number Diff line number Diff line change
Expand Up @@ -63,17 +63,17 @@ template <typename ExternLib> class Dispatch {
*lib = dlopen(name, RTLD_NOLOAD);
}
if (*lib == nullptr) {
// If not found, try to load it from the default path
// If not found, try to load it from LD_LIBRARY_PATH
*lib = dlopen(name, RTLD_LOCAL | RTLD_LAZY);
}
if (*lib == nullptr) {
// If still not found, try to load it from the default path
auto dir = std::string(ExternLib::defaultDir);
if (dir.length() > 0) {
auto fullPath = dir + "/" + name;
*lib = dlopen(fullPath.c_str(), RTLD_LOCAL | RTLD_LAZY);
}
}
if (*lib == nullptr) {
// If still not found, try to load it from LD_LIBRARY_PATH
*lib = dlopen(name, RTLD_LOCAL | RTLD_LAZY);
}
if (*lib == nullptr) {
throw std::runtime_error("Could not find `" + std::string(name) +
"`. Make sure it is in your "
Expand Down
40 changes: 40 additions & 0 deletions third_party/proton/csrc/include/Driver/GPU/CuptiApi.h
Original file line number Diff line number Diff line change
Expand Up @@ -2,11 +2,17 @@
#define PROTON_DRIVER_GPU_CUPTI_H_

#include "cupti.h"
#include "cupti_pcsampling.h"

namespace proton {

namespace cupti {

template <bool CheckSuccess> CUptiResult getVersion(uint32_t *version);

template <bool CheckSuccess>
CUptiResult getContextId(CUcontext context, uint32_t *pCtxId);

template <bool CheckSuccess>
CUptiResult activityRegisterCallbacks(
CUpti_BuffersCallbackRequestFunc funcBufferRequested,
Expand Down Expand Up @@ -66,6 +72,40 @@ CUptiResult getGraphExecId(CUgraphExec graph, uint32_t *pId);
template <bool CheckSuccess>
CUptiResult getGraphId(CUgraph graph, uint32_t *pId);

template <bool CheckSuccess>
CUptiResult getCubinCrc(CUpti_GetCubinCrcParams *pParams);

template <bool CheckSuccess>
CUptiResult
getSassToSourceCorrelation(CUpti_GetSassToSourceCorrelationParams *pParams);

template <bool CheckSuccess>
CUptiResult
pcSamplingGetNumStallReasons(CUpti_PCSamplingGetNumStallReasonsParams *pParams);

template <bool CheckSuccess>
CUptiResult
pcSamplingGetStallReasons(CUpti_PCSamplingGetStallReasonsParams *pParams);

template <bool CheckSuccess>
CUptiResult pcSamplingSetConfigurationAttribute(
CUpti_PCSamplingConfigurationInfoParams *pParams);

template <bool CheckSuccess>
CUptiResult pcSamplingEnable(CUpti_PCSamplingEnableParams *pParams);

template <bool CheckSuccess>
CUptiResult pcSamplingDisable(CUpti_PCSamplingDisableParams *pParams);

template <bool CheckSuccess>
CUptiResult pcSamplingGetData(CUpti_PCSamplingGetDataParams *pParams);

template <bool CheckSuccess>
CUptiResult pcSamplingStart(CUpti_PCSamplingStartParams *pParams);

template <bool CheckSuccess>
CUptiResult pcSamplingStop(CUpti_PCSamplingStopParams *pParams);

} // namespace cupti

} // namespace proton
Expand Down
141 changes: 141 additions & 0 deletions third_party/proton/csrc/include/Profiler/Cupti/CuptiPCSampling.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,141 @@
#ifndef PROTON_PROFILER_CUPTI_PC_SAMPLING_H_
#define PROTON_PROFILER_CUPTI_PC_SAMPLING_H_

#include "CuptiProfiler.h"
#include "Driver/GPU/CudaApi.h"
#include "Driver/GPU/CuptiApi.h"
#include "Utility/Map.h"
#include "Utility/Singleton.h"
#include <atomic>
#include <mutex>

namespace proton {

struct CubinData {
size_t cubinCrc;
const char *cubin;
size_t cubinSize;

struct LineInfoKey {
uint32_t functionIndex;
uint64_t pcOffset;

bool operator<(const LineInfoKey &other) const {
return functionIndex < other.functionIndex ||
(functionIndex == other.functionIndex &&
pcOffset < other.pcOffset);
}
};

struct LineInfoValue {
uint32_t lineNumber{};
const std::string functionName{};
const std::string dirName{};
const std::string fileName{};

LineInfoValue() = default;

LineInfoValue(uint32_t lineNumber, const std::string &functionName,
const std::string &dirName, const std::string &fileName)
: lineNumber(lineNumber), functionName(functionName), dirName(dirName),
fileName(fileName) {}
};

std::map<LineInfoKey, LineInfoValue> lineInfo;
};

struct ConfigureData {
ConfigureData() = default;

~ConfigureData() {
if (stallReasonNames) {
for (size_t i = 0; i < numStallReasons; i++) {
if (stallReasonNames[i])
std::free(stallReasonNames[i]);
}
std::free(stallReasonNames);
}
if (stallReasonIndices)
std::free(stallReasonIndices);
if (pcSamplingData.pPcData) {
for (size_t i = 0; i < numValidStallReasons; ++i) {
std::free(pcSamplingData.pPcData[i].stallReason);
}
std::free(pcSamplingData.pPcData);
}
}

void initialize(CUcontext context);

CUpti_PCSamplingConfigurationInfo configureStallReasons();
CUpti_PCSamplingConfigurationInfo configureSamplingPeriod();
CUpti_PCSamplingConfigurationInfo configureSamplingBuffer();
CUpti_PCSamplingConfigurationInfo configureScratchBuffer();
CUpti_PCSamplingConfigurationInfo configureHardwareBufferSize();
CUpti_PCSamplingConfigurationInfo configureStartStopControl();
CUpti_PCSamplingConfigurationInfo configureCollectionMode();

// The amount of data reserved on the GPU
static constexpr size_t HardwareBufferSize = 128 * 1024 * 1024;
// The amount of data copied from the hardware buffer each time
static constexpr size_t ScratchBufferSize = 16 * 1024 * 1024;
// The number of PCs copied from the scratch buffer each time
static constexpr size_t DataBufferPCCount = 1024;
// The sampling period in cycles = 2^frequency
static constexpr uint32_t DefaultFrequency = 10;

CUcontext context{};
uint32_t contextId;
uint32_t numStallReasons{};
uint32_t numValidStallReasons{};
char **stallReasonNames{};
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this needs to be a raw pointer of chars, as opposed to just a vector of string, because of the CUPTI interfaces?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's specified in CUpti_PCSamplingGetStallReasonsParams

uint32_t *stallReasonIndices{};
std::map<size_t, size_t> stallReasonIndexToMetricIndex{};
std::set<size_t> notIssuedStallReasonIndices{};
CUpti_PCSamplingData pcSamplingData{};
// The memory storing configuration information has to be kept alive during
// the profiling session
std::vector<CUpti_PCSamplingConfigurationInfo> configurationInfos;
};

class CuptiPCSampling : public Singleton<CuptiPCSampling> {

public:
CuptiPCSampling() = default;
virtual ~CuptiPCSampling() = default;

void initialize(CUcontext context);

void start(CUcontext context);

void stop(CUcontext context, uint64_t externId, bool isAPI);

void finalize(CUcontext context);

void loadModule(const char *cubin, size_t cubinSize);

void unloadModule(const char *cubin, size_t cubinSize);

private:
ConfigureData *getConfigureData(uint32_t contextId);

CubinData *getCubinData(uint64_t cubinCrc);

void processPCSamplingData(ConfigureData *configureData, uint64_t externId,
bool isAPI);

ThreadSafeMap<uint32_t, ConfigureData> contextIdToConfigureData;
// In case the same cubin is loaded multiple times, we need to keep track of
// all of them
ThreadSafeMap<size_t, std::pair<CubinData, /*count=*/size_t>>
cubinCrcToCubinData;
ThreadSafeSet<uint32_t> contextInitialized;

std::atomic<bool> pcSamplingStarted{false};
std::mutex pcSamplingMutex{};
std::mutex contextMutex{};
};

} // namespace proton

#endif // PROTON_PROFILER_CUPTI_PC_SAMPLING_H_
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
#ifndef PROTON_PROFILER_CUPTI_PROFILER_H_
#define PROTON_PROFILER_CUPTI_PROFILER_H_

#include "GPUProfiler.h"
#include "Profiler/GPUProfiler.h"

namespace proton {

Expand Down
Loading