[Vulkan] Vulkan runtime reimplementation (stream approach) #3849

ajtulloch · 2019-08-28T08:26:28Z

In the Vulkan API, it is recommended to launch multiple compute kernels on a shared command buffer in general, instead of having a 1-1 relationship between compute kernels and buffers. See http://on-demand.gputechconf.com/gtc/2016/events/vulkanday/High_Performance_Vulkan.pdf, https://devblogs.nvidia.com/vulkan-dos-donts/, etc. The extant TVM Vulkan runtime uses one command buffer per kernel, which can lead to significant overheads for smaller-kernels (on the order of half a millisecond on some of the devices I looked at).

An alternative approach leverages an approach similar to a CUDA stream abstraction, where we record commands onto the command buffer, and at synchronization points, submit the command buffer to the queue and wait on the fence. This is non-trivially more efficient - similar to the approach taken by ncnn - there are some useful ideas in there that applied here. In particular it's quite convenient to depend on the KHR push descriptors extension, but that could be removed without too much pain similar to how ncnn does it.

This code isn't production ready, and it's not super clear how much interest there is in the Vulkan side of things. I think it's quite promising and was planning on spending some time looking at codegen stuff, but the difficulty in getting reasonable numbers for small B/W bound kernels was the motivator in working on this to begin with.

If there's interest we could probably figure out a way to merge this into the existing Vulkan runtime, perhaps gated by a feature flag?

Performance improves for simple pointwise kernels as expected, using a script like:

import tvm
import numpy as np

tx = tvm.thread_axis("threadIdx.x")
bx = tvm.thread_axis("blockIdx.x")

num_thread = 256
from tvm import rpc
tracker = rpc.connect_tracker('localhost', 9090)
remote = tracker.request("android", priority=1,
                         session_timeout=6000)
ctx = remote.vulkan(0)
def check_vulkan(dtype, n):
    A = tvm.placeholder((n,), name='A', dtype=dtype)
    B = tvm.compute((n,), lambda i: A[i]+tvm.const(1, A.dtype), name='B')
    s = tvm.create_schedule(B.op)
    xo, xi = s[B].split(B.op.axis[0], factor=num_thread * 4)
    s[B].bind(xo, bx)
    xi, vx = s[B].split(xi, factor=4)
    s[B].bind(xi, tx)
    s[B].vectorize(vx)
    f = tvm.build(
        s, [A, B],
        target="vulkan",
        target_host="llvm -target=arm64-linux-android")
    import os
    os.environ['TVM_NDK_CC'] = os.path.expanduser("~/opt/android-toolchain-arm64/bin/aarch64-linux-android-g++")
    fname = f"dev_lib_vulkan_{np.random.random()}.so"
    path_dso_vulkan = fname
    from tvm.contrib import ndk

    f.export_library(path_dso_vulkan, ndk.create_shared)
    ctx = remote.vulkan(0)
    remote.upload(path_dso_vulkan)
    f1 = remote.load_module(fname)
    a_np = np.random.uniform(size=(n,)).astype(dtype)
    a = tvm.nd.array(a_np, ctx)
    c = tvm.nd.empty((n,), B.dtype, ctx)
    f1(a, c)
    tvm.testing.assert_allclose(c.asnumpy(), a.asnumpy() + 1)
    te = f1.time_evaluator(f.entry_name, ctx=ctx, min_repeat_ms=500, number=5)
    for _ in range(3):
        perf = te(a, c).mean
        print(f"N: {n}, t: {perf * 1.0e6:.2f}us, GFLOP/s: {n / perf / 1.0e9}")

for log_n in range(10, 20):
    check_vulkan("float32", 2 ** log_n)

ajtulloch · 2019-08-28T08:27:32Z

cc @jwfromm, @tqchen

tqchen · 2019-08-29T02:24:19Z

Awesome, this is something that i overlooked and definitely should be the way to go. The main reason that I took the current approach was because there is no clear way to do synchronization other than the fence, and the stream model is not supported by vulkan natively.

The proposed approach(to add stream support via lazy execution) is definitely better and we should just move toward this new runtime IMO. As long as we implement all the features, please feel free to remove the old vulkan runtime.

ajtulloch · 2019-08-29T13:08:02Z

@nihui would you be interested in taking a look at this PR and this approach? Are there more useful things in NCNN’s Vulkan backend that TVM should learn from?

nihui · 2019-09-02T09:19:59Z

I also found that it took a lot of time to wait for each operator's result, so I put multiple operators in a single command buffer and apply the lazy execution.
But I'm not familiar with TVM, and the code in the PR seems to be very different from ncnn, so I cannot judge if this PR is the best, but it must be much better than the current TVM approach I think.

If you have any questions about the ncnn vulkan backend, feel free to post an issue to https://github.com/Tencent/ncnn/issues

ajtulloch · 2019-09-05T04:22:01Z

Thanks @nihui, @tqchen, @jwfromm.

I generalized this approach to handle devices that don't support the push descriptor (see the Stream::LaunchDeferred APIs), and devices that don't support dedicated allocation APIs, and removed the previous runtime implementation as @tqchen suggested. I also added some more tests which (under Vulkan API validation) revealed some thread safety bugs.

I think this should be closer to something that is acceptable, so I'll remove the RFC and WIP tags, this should be ready for review.

File/class structure is certainly up for debate. I'm fine with whatever folks propose.

tqchen · 2019-09-06T13:31:38Z

@jwfromm @yzhliu @masahi @kazum can you guys help to review this PR?

jwfromm

This is really excellent work, I'm super impressed and excited to get better Vulkan performance merged in to TVM. I like that you've now replaced all the old Vulkan runtime code, however, now that it's gone I think you should remove all the 2s from your file, class, and function naming. I've also added a few minor comments, however everything else LGTM.

jwfromm · 2019-09-10T19:48:20Z

src/codegen/spirv/ir_builder.cc

+  // Use SPIR-V v1.0. This needs to be kept in sync (or at least behind)
+  // `VkApplicationInfo.apiVersion` in `vulkan2.cc` to ensure Vulkan API
+  // validation passes.
+  header_.push_back(0x10000);


Why do we need to hard-code the version like this? Is it possible to use spv::Version both here and in vulkcan2.cc and check that its >= v1.0?

My understanding is that the semantically correct thing to do is:

at code-generation time, output the lowest possible version number that supports all the features we require. Currently this is 1.0, we don't use features from 1.1 or higher.

in vkApplicationInfo.apiVersion, use the minimum required API version we require for the devices. This is also what NCNN uses, see https://github.com/Tencent/ncnn/blob/c65c215c2c1084303849f9adae2f7495a1b99409/src/gpu.cpp#L407.

src/runtime/vulkan/README.md

tests/python/test_codegen_vulkan.py

ajtulloch · 2019-09-10T20:50:31Z

Thanks @jwfromm, will follow up on your suggestions.

ajtulloch · 2019-09-11T20:11:22Z

@jwfromm, I believe I've addressed your comments (except the one re: the correct version semantics at SPIR-V emission time and Vulkan instance instantiation time).

tqchen · 2019-09-13T20:41:02Z

Thanks @ajtulloch @jwfromm @nihui !

tqchen added the status: need review label Aug 29, 2019

ajtulloch force-pushed the vulkan-reuse-command-buffer branch from 09675eb to 64b55bb Compare September 5, 2019 04:18

ajtulloch force-pushed the vulkan-reuse-command-buffer branch from 64b55bb to 762250d Compare September 5, 2019 04:31

ajtulloch changed the title ~~[RFC] [Vulkan] [WIP] Alternative Vulkan runtime implementation (stream approach)~~ [Vulkan] Vulkan runtime reimplementation (stream approach) Sep 5, 2019

ajtulloch force-pushed the vulkan-reuse-command-buffer branch 2 times, most recently from 24065ab to d200b51 Compare September 6, 2019 00:12

ajtulloch force-pushed the vulkan-reuse-command-buffer branch 2 times, most recently from a85f3d2 to 595107c Compare September 10, 2019 20:06

jwfromm reviewed Sep 10, 2019

View reviewed changes

ajtulloch force-pushed the vulkan-reuse-command-buffer branch from 595107c to 2654d3c Compare September 11, 2019 20:07

jwfromm approved these changes Sep 12, 2019

View reviewed changes

Vulkan2 Runtime API

db04992

ajtulloch force-pushed the vulkan-reuse-command-buffer branch from 2654d3c to db04992 Compare September 12, 2019 22:03

tqchen approved these changes Sep 13, 2019

View reviewed changes

tqchen merged commit 2536465 into apache:master Sep 13, 2019

tqchen added status: accepted and removed status: need review labels Sep 13, 2019

wweic pushed a commit to wweic/tvm that referenced this pull request Sep 16, 2019

Vulkan2 Runtime API (apache#3849)

67cdac7

wweic pushed a commit to wweic/tvm that referenced this pull request Sep 16, 2019

Vulkan2 Runtime API (apache#3849)

31fc736

ajtulloch mentioned this pull request Sep 16, 2019

[Vulkan] [Easy] Minor optimization for deferred token lookups. #3960

Merged

wweic pushed a commit to neo-ai/tvm that referenced this pull request Sep 16, 2019

Vulkan2 Runtime API (apache#3849)

361e624

tqchen mentioned this pull request Nov 8, 2019

[RELEASE][DRAFT] TVM v0.6 Release candidate #4259

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Vulkan] Vulkan runtime reimplementation (stream approach) #3849

[Vulkan] Vulkan runtime reimplementation (stream approach) #3849

ajtulloch commented Aug 28, 2019 •

edited

Loading

ajtulloch commented Aug 28, 2019

tqchen commented Aug 29, 2019 •

edited

Loading

ajtulloch commented Aug 29, 2019

nihui commented Sep 2, 2019 •

edited

Loading

ajtulloch commented Sep 5, 2019

tqchen commented Sep 6, 2019

jwfromm left a comment •

edited

Loading

jwfromm Sep 10, 2019

ajtulloch Sep 11, 2019

ajtulloch commented Sep 10, 2019

ajtulloch commented Sep 11, 2019

tqchen commented Sep 13, 2019

[Vulkan] Vulkan runtime reimplementation (stream approach) #3849

[Vulkan] Vulkan runtime reimplementation (stream approach) #3849

Conversation

ajtulloch commented Aug 28, 2019 • edited Loading

ajtulloch commented Aug 28, 2019

tqchen commented Aug 29, 2019 • edited Loading

ajtulloch commented Aug 29, 2019

nihui commented Sep 2, 2019 • edited Loading

ajtulloch commented Sep 5, 2019

tqchen commented Sep 6, 2019

jwfromm left a comment • edited Loading

Choose a reason for hiding this comment

jwfromm Sep 10, 2019

Choose a reason for hiding this comment

ajtulloch Sep 11, 2019

Choose a reason for hiding this comment

ajtulloch commented Sep 10, 2019

ajtulloch commented Sep 11, 2019

tqchen commented Sep 13, 2019

ajtulloch commented Aug 28, 2019 •

edited

Loading

tqchen commented Aug 29, 2019 •

edited

Loading

nihui commented Sep 2, 2019 •

edited

Loading

jwfromm left a comment •

edited

Loading