Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions src/op/builtin.cc
Original file line number Diff line number Diff line change
Expand Up @@ -368,5 +368,10 @@ TIR_DEFINE_TL_BUILTIN(warp_reduce_bitor)
.set_attr<TCallEffectKind>("TCallEffectKind",
Integer(CallEffectKind::kOpaque));

// __ldg(BufferLoad | Buffer, idx?) -> value
// Treat as a pure call that returns the loaded value.
TIR_DEFINE_TL_BUILTIN(__ldg).set_num_inputs(-1).set_attr<TCallEffectKind>(
"TCallEffectKind", Integer(CallEffectKind::kPure));

} // namespace tl
} // namespace tvm
18 changes: 18 additions & 0 deletions src/op/builtin.h
Original file line number Diff line number Diff line change
Expand Up @@ -600,6 +600,24 @@ TVM_DLL const Op &warp_reduce_bitand();
*/
TVM_DLL const Op &warp_reduce_bitor();

/*!
* \brief tilelang intrinsic for CUDA read-only cache load (__ldg).
*
* This op allows users to explicitly request a non-coherent cached load
* from global memory on CUDA by emitting `__ldg(&ptr[idx])` for 32-bit
* element types on supported architectures. It provides a direct way to
* leverage the read-only data cache for performance-sensitive loads when
* the compiler cannot infer `const __restrict__` automatically.
*
* Usage from TVMScript:
* y[i] = T.__ldg(x[i])
*
* The op takes one argument preferred as a BufferLoad identifying the
* source element; alternatively, backends may support passing a Buffer and
* index expression.
*/
TVM_DLL const Op &__ldg();

Comment on lines +603 to +620
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, find where __ldg is implemented in the CUDA codegen
find . -type f -name "*.cc" -o -name "*.h" | head -50

Repository: tile-ai/tilelang

Length of output: 1164


🏁 Script executed:

# Search for __ldg implementation in the codebase
rg "__ldg" --type cc --type h -A 5 -B 2

Repository: tile-ai/tilelang

Length of output: 86


🏁 Script executed:

# Also search for references in the builtin.h file to understand context
rg "builtin\.h" --type cc -l | head -20

Repository: tile-ai/tilelang

Length of output: 86


🏁 Script executed:

# Search for __ldg in builtin files
rg "__ldg" --type-list | head -20

Repository: tile-ai/tilelang

Length of output: 449


🏁 Script executed:

# Search for __ldg with correct syntax
rg "__ldg" -A 5 -B 2

Repository: tile-ai/tilelang

Length of output: 8774


🏁 Script executed:

# Look for CUDA codegen implementations
fd "codegen.*cuda" --type f

Repository: tile-ai/tilelang

Length of output: 112


🏁 Script executed:

# Look at the full __ldg implementation in codegen_cuda.cc
sed -n '1,/codegen_cuda.cc/p' /dev/null
rg "__ldg" -A 20 src/target/codegen_cuda.cc

Repository: tile-ai/tilelang

Length of output: 1724


🏁 Script executed:

# Check if there are any dtype constraints in the CUDA codegen
rg "dtype" -B 5 -A 5 src/target/codegen_cuda.cc | grep -A 10 -B 10 "__ldg"

Repository: tile-ai/tilelang

Length of output: 1170


🏁 Script executed:

# Check Python builtin wrapper for dtype validation
cat -n tilelang/language/builtin.py | sed -n '1,100p'

Repository: tile-ai/tilelang

Length of output: 4187


🌐 Web query:

CUDA __ldg intrinsic supported data types element size constraints

💡 Result:

Short answer — native CUDA overloads of __ldg exist only for scalar and small vector built-ins; other types must be handled via a template wrapper (or cast) and the hardware performs loads in natural sizes (1/2/4/8 and up to 16‑byte transactions). Native overload list and constraints:

  • Native CUDA __ldg overloads (built‑in types): char, short, int, long long, unsigned char, unsigned short, unsigned int, unsigned long long, float, double and the vector variants int2/int4, uint2/uint4, float2/float4, double2. For other/user types use a template wrapper (e.g., template device T ldg(const T*) that calls __ldg for supported types and falls back to *ptr otherwise). [1]

  • Element/transaction size: GPU memory transactions ordinarily use 1, 2, 4, 8 byte accesses and can coalesce up to 16 bytes per thread (so user types should be sized and aligned to match 4/8/16‑byte boundaries for best behavior). __ldg maps to cacheable global loads (ld.global.nc) introduced on sm_35 and later. [2][3]

If you want exact behavior for a custom type, use the template overload approach (or memcpy/cast to a supported vector type of matching size/alignment) and ensure sizes are a multiple of 4 bytes (or aligned to natural alignment) to avoid unexpected misalignment/coalescing penalties. [1][2]


Remove or correct the "32-bit element types" constraint in the documentation.

The doc claims __ldg is for "32-bit element types on supported architectures," but CUDA's native __ldg overloads actually support 8, 16, 32, and 64-bit scalar types (char, short, int, long long, float, double) plus vector variants (int2/4, uint2/4, float2/4, double2). More importantly, the CUDA codegen in src/target/codegen_cuda.cc does not validate the element type at all—it will emit __ldg for any dtype, including unsupported ones like fp16 or custom structs. Either enforce type validation against CUDA's actual supported types or update the doc to remove the "32-bit" restriction.

🤖 Prompt for AI Agents
In src/op/builtin.h around lines 603-620, the docstring wrongly limits __ldg to
"32-bit element types"; update the comment to remove that constraint and either
(a) explicitly list CUDA-supported scalar/vector widths (8/16/32/64-bit and
vector variants) as supported where applicable and note backend-dependent
support, or (b) state that the current CUDA codegen does not validate element
types and may emit __ldg for any dtype; if you choose (a) also add a follow-up
task to enforce validation in src/target/codegen_cuda.cc so only supported types
are emitted.

} // namespace tl
} // namespace tvm

Expand Down
17 changes: 17 additions & 0 deletions src/target/codegen_cuda.cc
Original file line number Diff line number Diff line change
Expand Up @@ -2354,6 +2354,23 @@ void CodeGenTileLangCUDA::VisitExpr_(const CallNode *op, std::ostream &os) {
stream << ": \"l\"((void*)(" << global_buffer << "+" << global_addr
<< ")), \"r\"((int)" << guard << ")\n";
stream << ");\n";
} else if (op->op.same_as(tl::__ldg())) {
// Explicit read-only cached load. Preferred form: __ldg(BufferLoad(...)).
// Fallback form: __ldg(buffer, index)
const BufferLoadNode *bl = nullptr;
if (!op->args.empty()) {
bl = op->args[0].as<BufferLoadNode>();
}
if (bl == nullptr) {
LOG(FATAL) << "T.__ldg expects a BufferLoad as the first argument.";
}
const BufferNode *buffer = bl->buffer.get();
ICHECK_EQ(bl->indices.size(), 1)
<< "T.__ldg currently supports flattened 1D buffer accesses.";
PrimExpr base = bl->indices[0];
// Emit __ldg(&buffer_ref)
auto buffer_ref = this->GetBufferRef(op->dtype, buffer, base);
os << "__ldg(&(" << buffer_ref << "))";
Comment on lines +2357 to +2373
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, confirm the exact code at the specified lines
cat -n src/target/codegen_cuda.cc | sed -n '2357,2373p'

Repository: tile-ai/tilelang

Length of output: 905


🏁 Script executed:

# Search for other __ldg usage patterns in the codebase
rg '__ldg' --type cc --type h -B 2 -A 5

Repository: tile-ai/tilelang

Length of output: 86


🏁 Script executed:

# Check if there are existing dtype checks in similar CUDA operations
rg 'dtype.*bits\(\)' --type cc --type h -B 1 -A 1 | head -50

Repository: tile-ai/tilelang

Length of output: 86


🏁 Script executed:

# Search for CUDA capability checks and SM arch support in the codebase
rg 'SM_[0-9]' --type cc --type h | head -20

Repository: tile-ai/tilelang

Length of output: 86


🏁 Script executed:

# Look for TileLang CI configuration
fd -e yml -e yaml -e json | grep -i 'ci\|test' | head -20

Repository: tile-ai/tilelang

Length of output: 42


🌐 Web query:

CUDA __ldg supported data types and constraints official documentation

💡 Result:

Official NVIDIA docs state __ldg supports:

  • Scalar C/C++ fundamental integer and floating types: char/signed char, short, int, long, long long and unsigned variants, float, double. [1][2]
  • CUDA vector types (except 3-component vectors): char2/char4, short2/short4, int2/int4, longlong2, uchar2/uchar4, ushort2/ushort4, uint2/uint4, ulonglong2, float2/float4, double2. [1][2]
  • Extended half / bfloat types when the appropriate headers are included: __half, __half2 (cuda_fp16.h) and __nv_bfloat16, __nv_bfloat162 (cuda_bf16.h). [1][2]
  • Constraint: device support is required — older guides list compute capability ≥3.5, newer guides list ≥5.0; check the CUDA Programming Guide version for your toolkit / GPU. [1][2]

Sources: NVIDIA CUDA Programming Guide / C++ Language Extensions (see Read-Only Data Cache / __ldg entries). [1][2]


🏁 Script executed:

# Search for __ldg usage and other dtype checks without type specification
rg '__ldg|dtype\.bits\(\)|is_scalar' -A 3 -B 1

Repository: tile-ai/tilelang

Length of output: 50372


🏁 Script executed:

# Look at the full context around the __ldg implementation
sed -n '2340,2380p' src/target/codegen_cuda.cc

Repository: tile-ai/tilelang

Length of output: 1979


🏁 Script executed:

# Search for dtype validation patterns in codegen_cuda.cc
sed -n '1,100p' src/target/codegen_cuda.cc | grep -i 'dtype\|check'

Repository: tile-ai/tilelang

Length of output: 42


🏁 Script executed:

# Find other CUDA intrinsic handlers to see if they have dtype gating
rg 'LOG\(FATAL\)|ICHECK.*dtype' src/target/codegen_cuda.cc -B 2 -A 1 | head -60

Repository: tile-ai/tilelang

Length of output: 1603


🏁 Script executed:

# Look for any existing 32/64-bit checks in CUDA codegen
rg '\.bits\(\).*32|\.bits\(\).*64' -A 1 -B 1

Repository: tile-ai/tilelang

Length of output: 9010


🏁 Script executed:

# Check builtin.h and builtin.cc to understand the design intent
sed -n '30,50p' src/op/builtin.h | grep -A 15 "__ldg"

Repository: tile-ai/tilelang

Length of output: 42


🏁 Script executed:

# Look for any architecture baseline in CI config or comments
fd -e yml -e yaml . | head -5 | xargs grep -l 'cuda\|gpu\|compute_capability' 2>/dev/null | head -3

Repository: tile-ai/tilelang

Length of output: 42


🏁 Script executed:

# Check if there are baseline SM arch definitions in the codebase
rg 'compute_capability|sm_[0-9]|SM_[0-9]' --type cc --type h -i | head -10

Repository: tile-ai/tilelang

Length of output: 86


Remove misleading "fallback form" comment and add dtype validation for __ldg emission.

The comment at lines 2358–2359 claims a fallback form __ldg(buffer, index) exists, but the C++ codegen only accepts BufferLoad and FATALs otherwise. Additionally, there is no dtype gating—any scalar dtype flows into __ldg(...) without validation. While CUDA __ldg officially supports 8/16/32/64-bit scalar types and certain vector types, the current code emits calls for arbitrary dtypes, risking NVCC compile failures.

Update the comment to reflect the actual implementation, replace LOG(FATAL) with ICHECK for consistency, and add dtype validation:

 } else if (op->op.same_as(tl::__ldg())) {
-    // Explicit read-only cached load. Preferred form: __ldg(BufferLoad(...)).
-    // Fallback form: __ldg(buffer, index)
+    // Explicit read-only cached load. Expects a flattened 1D BufferLoad.
     const BufferLoadNode *bl = nullptr;
     if (!op->args.empty()) {
       bl = op->args[0].as<BufferLoadNode>();
     }
-    if (bl == nullptr) {
-      LOG(FATAL) << "T.__ldg expects a BufferLoad as the first argument.";
-    }
+    ICHECK(bl) << "T.__ldg expects a BufferLoad as the first argument.";
     const BufferNode *buffer = bl->buffer.get();
     ICHECK_EQ(bl->indices.size(), 1)
         << "T.__ldg currently supports flattened 1D buffer accesses.";
+    // Validate dtype: baseline support for 32/64-bit scalars.
+    ICHECK(op->dtype.is_scalar())
+        << "T.__ldg currently supports scalar element loads only, but got " << op->dtype;
+    ICHECK(op->dtype.bits() == 32 || op->dtype.bits() == 64)
+        << "T.__ldg only supports 32/64-bit scalar types for now, but got " << op->dtype;
     PrimExpr base = bl->indices[0];
     // Emit __ldg(&buffer_ref)
     auto buffer_ref = this->GetBufferRef(op->dtype, buffer, base);
     os << "__ldg(&(" << buffer_ref << "))";
🤖 Prompt for AI Agents
In src/target/codegen_cuda.cc around lines 2357 to 2373, the code currently
documents a nonexistent "fallback form" for __ldg, uses LOG(FATAL) for a bad arg
type, and emits __ldg for arbitrary dtypes; change the comment to state that
only BufferLoad is supported, replace the LOG(FATAL) with ICHECK(bl) (or
ICHECK_NE(bl, nullptr)) for consistency, then add a dtype validation before
emitting __ldg: allow only scalar dtypes with bit-width in {8,16,32,64} and
vectors whose element bit-width is one of those and with lanes in a
CUDA-supported set (e.g., 2,4,8,16 as appropriate), otherwise ICHECK(false) with
a clear message about unsupported dtype for __ldg; finally, if the dtype is
valid, proceed to get the buffer ref and emit "__ldg(&(<buffer_ref>))".

} else if (op->op.same_as(builtin::reinterpret())) {
DataType tgt_dtype = op->dtype;
DataType src_dtype = op->args[0]->dtype;
Expand Down
10 changes: 10 additions & 0 deletions src/target/codegen_hip.cc
Original file line number Diff line number Diff line change
Expand Up @@ -828,6 +828,16 @@ void CodeGenTileLangHIP::VisitExpr_(const CallNode *op, std::ostream &os) {
} else if (op->op.same_as(tl::pack_b16())) {
os << "__pack_half2(" << this->PrintExpr(op->args[0]) << ", "
<< this->PrintExpr(op->args[1]) << ")";
} else if (op->op.same_as(tl::__ldg())) {
// HIP fallback: regular load
const BufferLoadNode *bl = op->args[0].as<BufferLoadNode>();
ICHECK(bl) << "T.__ldg expects a BufferLoad as the first argument.";
ICHECK_EQ(bl->indices.size(), 1)
<< "T.__ldg currently supports flattened 1D buffer accesses.";
const BufferNode *buffer = bl->buffer.get();
PrimExpr base = bl->indices[0];
auto buffer_ref = this->GetBufferRef(op->dtype, buffer, base);
os << buffer_ref;
} else if (op->op.same_as(builtin::tvm_fill_fragment())) {
need_mma_h_ = true;
ICHECK_EQ(op->args.size(), 6U);
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
import tilelang
import tilelang.language as T
import tilelang.testing


@tilelang.testing.requires_cuda
def test_language_ldg_codegen():
N = 128

@T.prim_func
def main(
x: T.Tensor((N,), "float32"),
y: T.Tensor((N,), "float32"),
):
with T.Kernel(N, threads=32) as pid:
# Explicitly request read-only cache load for x[pid]
y[pid] = T.__ldg(x[pid]) + 1.0

# Compile for CUDA and retrieve generated CUDA source
kernel = tilelang.compile(main, out_idx=[1], target="cuda")
src = kernel.get_kernel_source()
print(src)
# Assert that codegen uses __ldg on CUDA backend
# We look for the intrinsic call with address-of argument
assert "__ldg(" in src, "Expected __ldg call in generated CUDA source"
assert "__ldg(&" in src or "__ldg(&(" in src, "Expected address-of form in __ldg call"

Comment on lines 6 to 27
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Avoid unconditional print(src) in tests (and consider a slightly tighter assertion).
Printing generated sources can bloat CI logs; keep it behind a debug flag or remove.

-    print(src)
+    # print(src)  # uncomment for debugging
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
@tilelang.testing.requires_cuda
def test_language_ldg_codegen():
N = 128
@T.prim_func
def main(
x: T.Tensor((N,), "float32"),
y: T.Tensor((N,), "float32"),
):
with T.Kernel(N, threads=32) as pid:
# Explicitly request read-only cache load for x[pid]
y[pid] = T.__ldg(x[pid]) + 1.0
# Compile for CUDA and retrieve generated CUDA source
kernel = tilelang.compile(main, out_idx=[1], target="cuda")
src = kernel.get_kernel_source()
print(src)
# Assert that codegen uses __ldg on CUDA backend
# We look for the intrinsic call with address-of argument
assert "__ldg(" in src, "Expected __ldg call in generated CUDA source"
assert "__ldg(&" in src or "__ldg(&(" in src, "Expected address-of form in __ldg call"
@tilelang.testing.requires_cuda
def test_language_ldg_codegen():
N = 128
@T.prim_func
def main(
x: T.Tensor((N,), "float32"),
y: T.Tensor((N,), "float32"),
):
with T.Kernel(N, threads=32) as pid:
# Explicitly request read-only cache load for x[pid]
y[pid] = T.__ldg(x[pid]) + 1.0
# Compile for CUDA and retrieve generated CUDA source
kernel = tilelang.compile(main, out_idx=[1], target="cuda")
src = kernel.get_kernel_source()
# print(src) # uncomment for debugging
# Assert that codegen uses __ldg on CUDA backend
# We look for the intrinsic call with address-of argument
assert "__ldg(" in src, "Expected __ldg call in generated CUDA source"
assert "__ldg(&" in src or "__ldg(&(" in src, "Expected address-of form in __ldg call"
🤖 Prompt for AI Agents
testing/python/language/test_tilelang_language_intrinsics_codegen.py lines 6-27:
the test currently unconditionally prints the generated CUDA source and uses a
loose assertion; remove the print(src) or wrap it with a debug flag (e.g., only
print when an environment variable like TILELANG_DEBUG is set) to avoid CI log
bloat, and tighten the assertion to more specifically match the expected
intrinsic form (for example assert any(token in src for token in ("__ldg(&",
"__ldg(&(")) or use a small regex check for "__ldg\s*\(\s*&" to ensure
address-of usage).


if __name__ == "__main__":
tilelang.testing.main()
1 change: 1 addition & 0 deletions tilelang/language/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -96,6 +96,7 @@
)
from .logical import any_of, all_of # noqa: F401
from .builtin import * # noqa: F401
from .builtin import __ldg as __ldg # noqa: F401
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Keep the explicit import, but drop the unused noqa.
Because __ldg starts with _, it won’t come from the star import; the explicit import is correct. Ruff indicates the # noqa: F401 is unused.

-from .builtin import __ldg as __ldg  # noqa: F401
+from .builtin import __ldg as __ldg
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
from .builtin import __ldg as __ldg # noqa: F401
from .builtin import __ldg as __ldg
🧰 Tools
🪛 Ruff (0.14.8)

99-99: Unused noqa directive (non-enabled: F401)

Remove unused noqa directive

(RUF100)

🤖 Prompt for AI Agents
In tilelang/language/__init__.py around line 99, keep the explicit import "from
.builtin import __ldg" but remove the trailing "# noqa: F401" comment since the
import is intentionally explicit (starts with underscore and isn't covered by
star imports) and Ruff flags the noqa as unused; update the line to a plain
explicit import without the noqa comment.


from .utils import index_to_coordinates # noqa: F401

Expand Down
29 changes: 29 additions & 0 deletions tilelang/language/builtin.py
Original file line number Diff line number Diff line change
Expand Up @@ -59,6 +59,35 @@ def create_list_of_mbarrier(*args: Any) -> Call:
raise TypeError("create_list_of_mbarrier expects a list or one or more arguments.")


def __ldg(load_or_buf: BufferLoad | tir.Buffer, index: PrimExpr | int | None = None) -> PrimExpr:
"""Explicitly load via CUDA read-only data cache.

Prefer calling with a BufferLoad: `T.__ldg(x[i])` emits `__ldg(&x[i])` on CUDA.
On non-CUDA backends, falls back to a regular load.

Args:
load_or_buf: A `BufferLoad` like `x[i]`, or a `Buffer`.
index: Optional index when passing a `Buffer` directly.

Returns:
PrimExpr: The loaded value.
"""
if isinstance(load_or_buf, BufferLoad):
dtype = load_or_buf.dtype
return tir.call_intrin(str(dtype), tir.op.Op.get("tl.__ldg"), load_or_buf)
if isinstance(load_or_buf, tir.Buffer):
if index is None:
raise ValueError("T.__ldg(Buffer, index) requires an index when passing a Buffer.")
idx = index
if isinstance(index, (list, tuple)):
if len(index) != 1:
raise ValueError("T.__ldg currently supports 1D flattened indices.")
idx = index[0]
bl = BufferLoad(load_or_buf, [idx])
return tir.call_intrin(str(load_or_buf.dtype), tir.op.Op.get("tl.__ldg"), bl)
raise TypeError("T.__ldg expects a BufferLoad or a Buffer.")

Comment on lines +62 to +89
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Normalize index to PrimExpr in __ldg(Buffer, index) to avoid type surprises.
Right now idx may stay a Python int and be passed into BufferLoad(...) without convert(...).

 def __ldg(load_or_buf: BufferLoad | tir.Buffer, index: PrimExpr | int | None = None) -> PrimExpr:
@@
     if isinstance(load_or_buf, tir.Buffer):
         if index is None:
             raise ValueError("T.__ldg(Buffer, index) requires an index when passing a Buffer.")
-        idx = index
+        idx = index
         if isinstance(index, (list, tuple)):
             if len(index) != 1:
                 raise ValueError("T.__ldg currently supports 1D flattened indices.")
             idx = index[0]
-        bl = BufferLoad(load_or_buf, [idx])
+        bl = BufferLoad(load_or_buf, [convert(idx)])
         return tir.call_intrin(str(load_or_buf.dtype), tir.op.Op.get("tl.__ldg"), bl)
     raise TypeError("T.__ldg expects a BufferLoad or a Buffer.")
🧰 Tools
🪛 Ruff (0.14.8)

80-80: Avoid specifying long messages outside the exception class

(TRY003)


84-84: Avoid specifying long messages outside the exception class

(TRY003)


88-88: Avoid specifying long messages outside the exception class

(TRY003)

🤖 Prompt for AI Agents
In tilelang/language/builtin.py around lines 62 to 89, the index `idx` passed to
BufferLoad can remain a Python int (or other Python literal); normalize it to a
PrimExpr before constructing BufferLoad by calling tir.convert(idx) (apply this
both for the single index and after extracting from a 1-element list/tuple) so
BufferLoad always receives a PrimExpr; raise the same errors unchanged and then
pass the converted index into BufferLoad and tir.call_intrin.


def get_mbarrier(*args):
"""Retrieve a memory barrier operation.

Expand Down
Loading