Optimize Gelu operator for caffe2 export

Summary: TIL ONNX->Caffe2 is very memory inefficient, it creates an intermediate blob for each intermediate output. So, the Gelu operator creates a lot of intermediate ops since it does a bunch of math. Fix is to use the caffe2 Gelu operator, so all that computation is captured in a single op. https://pxl.cl/HzGf Differential Revision: D16849396 fbshipit-source-id: 4903c614833ae4ad8a84c6eddc2382b2a24872f3
facebookresearch · Aug 16, 2019 · 6d6f1da · 6d6f1da
1 parent a170dd4
commit 6d6f1da
Showing 1 changed file with 15 additions and 5 deletions.
diff --git a/pytext/optimizer/activations.py b/pytext/optimizer/activations.py
@@ -20,11 +20,21 @@ class GeLU(nn.Module):
     """
 
     def forward(self, x):
-        return (
-            0.5
-            * x
-            * (1 + torch.tanh(math.sqrt(2 / math.pi) * (x + 0.044715 * (x * x * x))))
-        )
+        if torch.onnx.is_in_onnx_export():
+            # ONNX -> Caffe2 conversion will create an intermediate blob for
+            # each intermediate math output, which is very memory inefficient.
+            # We use the Gelu operator directly to reduce the memory footprint
+            # in the exported model.
+            return torch.ops._caffe2.Gelu(x, True)
+        else:
+            return (
+                0.5
+                * x
+                * (
+                    1
+                    + torch.tanh(math.sqrt(2 / math.pi) * (x + 0.044715 * (x * x * x)))
+                )
+            )
 
 
 def get_activation(name):