Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP, conv_bn fuse example using paddlefx #33

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

jzhang533
Copy link
Contributor

this is ported from https://pytorch.org/tutorials/intermediate/fx_conv_bn_fuser.html, but still have some critical issues need to solve.

  • the fused model is slower than the unfused model, which is unexpected, need to refactor/reimplement python code generation of paddlefx
  • result of fused resnet18 is slight different to unfused resnet18
  • iterating over fx_model.graph.nodes will cause endless loop.

See TODO in examples/conv_bn_fuse.py for details.

@SigureMo
Copy link
Collaborator

SigureMo commented Apr 1, 2023

iterating over fx_model.graph.nodes will cause endless loop.

这个已解决,之前没考虑到循环时 erase 当前 node 的情况 😂

@jzhang533 jzhang533 mentioned this pull request Apr 6, 2023
16 tasks
@Asthestarsfalll
Copy link
Collaborator

似乎不是codegen的原因,应该是原本的rn18中的conv并不包含bias,而fuse之后需要为conv添加bias,导致了速度变慢,使用以下代码为original rn18中的conv添加bias,可以观察到三者速度基本一致:

for name, module in rn18.named_sublayers():
    if isinstance(module, nn.Conv2D):
        module.bias = paddle.zeros([module.weight.shape[0]])

2023-04-08_21-35

@SigureMo
Copy link
Collaborator

SigureMo commented Apr 9, 2023

似乎不是codegen的原因,应该是原本的rn18中的conv并不包含bias,而fuse之后需要为conv添加bias,导致了速度变慢,使用以下代码为original rn18中的conv添加bias,可以观察到三者速度基本一致:

对的,这个问题之前就发现了,我们可以加上如下特判进行「优化」:

if paddle.allclose(conv_b_param, paddle.zeros_like(conv_b_param)):
    conv_b_param = None

对于未训练的 resnet18 确实可以起到加速作用,因为从 BN 层获得的参数是 0,附加的 bias 可被优化为 None

Fused time:  2.448992967605591
Unfused time:  2.5072388648986816
Traced time:  2.4950180053710938

但对于训练过的 resnet18(比如开启 pretrained=True),基本不可能会有附加 bias 是 0 的情况了,则会严重降低速度

Fused time:  3.230083703994751
Unfused time:  2.48759388923645
Traced time:  2.4984631538391113

因此这并不能算得上优化,应该找到加上 bias 会严重拖慢速度的原因

@jzhang533
Copy link
Contributor Author

jzhang533 commented Apr 10, 2023

因此这并不能算得上优化,应该找到加上 bias 会严重拖慢速度的原因

我想原因应该是paddle的Conv2D的实现,在有bias和没有bias的情况下,性能差异很大。
因为paddle的Conv2D的实现是先用C++ kernel做pre bias的部分,然后再用另一个kernel把bias加上去。
pytorch的Conv2D没有这个问题,是因为不管是否有bias,都是调用同一个C++ kernel。

no bias conv + batch_norm 因为fuse能带来的一点性能提升都因为fuse后的conv需要有bias,导致反而变慢了。

下面的代码运行一下,就能很容易的看到差别:

import paddle
import paddle.nn.functional as F
import time

class MyNet(paddle.nn.Layer):
    def __init__(self, bias=False):
        super(MyNet, self).__init__()

        self.conv1 = paddle.nn.Conv2D(in_channels=3, out_channels=32, kernel_size=(3, 3), bias_attr=bias)

    def forward(self, x):
        x = self.conv1(x)
        return x

bias_model = MyNet(bias=True)
no_bias_model = MyNet(bias=False)

inp = paddle.rand((128, 3, 224, 224))

def benchmark(model, iters=1000):
    for _ in range(50):
        model(inp)
        
#    paddle.device.cuda.synchronize()
    begin = time.time()
    for _ in range(iters):
        model(inp)
    
#    paddle.device.cuda.synchronize()
    return str(time.time() - begin)

print("no bias time: ", benchmark(no_bias_model))
print("bias time: ", benchmark(bias_model))

on V100

no bias time: 1.829711675643921
bias time: 3.8552277088165283

@jzhang533
Copy link
Contributor Author

因为paddle的Conv2D的实现的问题,短期内,为了能展示paddlefx的fuse的能力,也许可以构造一个含有bias的conv + bn的网络,来让这个PR能先合入。

@Asthestarsfalll
Copy link
Collaborator

一般来说bn之前的conv都不会设置bias吧,或许可以尝试fuse RepVGG这样的网络

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants