[Thrust] Increase static workspace size (#16937)

MasterJH5574 · web-flow · commit 63e0a0ff82eb · 2024-04-27T15:15:19.000-04:00
This PR increases the thrust workspace size, since in practice
we found that the current workspace size can still be insufficient.
Thrust sort may require larger workspace when the number of elements
being sorted is large (e.g., in Llama3 that is 128k).
diff --git a/python/tvm/relax/backend/dispatch_sort_scan.py b/python/tvm/relax/backend/dispatch_sort_scan.py
@@ -227,13 +227,13 @@ def estimate_thrust_workspace_size(self, call: relax.Call) -> int:
         int32_byte_per_elem = DataType("int32").bits // 8
         num_elem = reduce(mul, input_shape, 1)
         input_size = num_elem * input_byte_per_elem
-        # Most GPU algorithms take O(n) space or less, we choose 8N + 4MB as a safe estimation
+        # Most GPU algorithms take O(n) space or less, we choose 8N + 8MB as a safe estimation
         # for algorithm workspace.
         # The current thrust sort implementation may need extra int64 and int32 arrays
         # for temporary data, so we further add this part to the workspace.
         return (
             8 * input_size
-            + 4 * 1024 * 1024
+            + 8 * 1024 * 1024
             + num_elem * (int64_byte_per_elem + int32_byte_per_elem)
         )