Skip to content

[BACKEND] Add preferred cluster fallback#9957

Open
lezcano wants to merge 1 commit into
mainfrom
fallback_cluster
Open

[BACKEND] Add preferred cluster fallback#9957
lezcano wants to merge 1 commit into
mainfrom
fallback_cluster

Conversation

@lezcano
Copy link
Copy Markdown
Contributor

@lezcano lezcano commented Apr 8, 2026

When running multiCTA kernels for numCTAs > 2, we are leaving some
perf on the table as GPUs may be able not use every single SM. This is
because SMs are grouped in TPCs (pairs of SMs) which then are grouped on
GPCs (sets of TPCs). Every SM is part of a TPC, but not every TPC is
part of a GPC of size 8, so if we launch a kernel with numCTAs == 16
where each CTA takes a full SM, this may only be run on GPCs with size
at least 8, leaving every other SM unused.

To account for this, NVIDIA exposes an API

CU_LAUNCH_ATTRIBUTE_PREFERRED_CLUSTER_DIMENSION

starting on SM100.

When invoking this API, we tell the GPU that we want to execute the
kernel with numCTAs if possible, but if not, use a fallback number of
CTAs and run it on less SMs.

Simplest use case to keep in mind for this feature:

for:
  a = tma_load a_desc, multicast=true
  b = tma_load b_desc, multicast=true
  acc += tcgen05_mma(a, b)
tma_store acc

In this PR, we add a pass that checks for cross-CTA data movement that
would make this invariant not hold. If all the ops in the kernel are
alright, then we apply the optimisation.

Then we change the lowerings with the following invariants:

nvgpu::ClusterCTAIdOp returns the global ctaId

In other words, this will always return the same number regardless of
whether the kernel was split or not.
For example, if we have a grid of 2 CGAs with 4 CTAs each and the second
launch is split into 2 launches, then we'll get

Launch 00 4CTAs: ClusterCTAIdOp: 0-3
Launch 10 2CTAs: ClusterCTAIdOp: 0-1
Launch 11 2CTAs: ClusterCTAIdOp: 2-3

In other words, this should be used to compute addresses in global
memory.

NVVM::ClusterId gives the relative cta_id wrt. the launched CGA size

This will depend on the runtime launch size of the program In the
example above

Launch 00 4CTAs: NVVM::ClusterId: 0-3
Launch 10 2CTAs: NVVM::ClusterId: 0-1
Launch 11 2CTAs: NVVM::ClusterId: 0-1

This should be used to generate masks for multicast, for example, or to
compute predicates.

Under this model the pid is invariant under the splits, so the program
semantics under this transformation don't vary.

@lezcano lezcano requested a review from ptillet as a code owner April 8, 2026 13:45
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 91e45fd1da

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread third_party/nvidia/backend/compiler.py
Comment on lines +91 to +93
if (isa<ttng::AsyncTMAReduceOp, ttng::AsyncTMAGatherOp,
ttng::AsyncTMAScatterOp>(op))
return unsupported();
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Reject cluster sync ops before enabling fallback

The safety walk only rejects a narrow set of ops and allows ttng.cluster_arrive/wait/barrier to pass through. With preferred-fallback enabled, a full-cluster barrier can be weakened into per-fallback-cluster barriers (e.g., 8→2 CTAs), changing synchronization semantics and potentially causing wrong results or hangs. These cluster sync ops should be treated as unsupported for fallback.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

@lezcano lezcano Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is the case, as we prove via the layout of the ops of the program that we don't do any cross-CTA work, so in these programs you should never need these ops...

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewers, can you have a look at this one?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lezcano does a cluster sync imply a cluster membar (similar to what happens at the CTA level)? If so then a kernel could rely on cluster synchronisation if it uses global memory as a scratchpad for example, even if there are no explicit cross-CTA layout conversions.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, right. I was also wondering about global atomics, whether we should ban those as well. cc @peterbell10

Copy link
Copy Markdown
Contributor

@peterbell10 peterbell10 Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not just atomics, but any access to global memory could be used to cross cta boundaries. e.g. each CTA stores to their own scratch, then does a cluster barrier and reads from its neighbours.

Copy link
Copy Markdown
Contributor

@peterbell10 peterbell10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't looked at the code, but I still can't see how this could possibly be correct. Can you give an example of a program where running it on fewer CTAs still gives you the same result?

@lezcano
Copy link
Copy Markdown
Contributor Author

lezcano commented Apr 8, 2026

03-matmul-multicta.py is an example of such a program. In general, the prototypical program that you would want to support would be a matmul TMA multicast. Omitting all the necessary synchronisation, something of the form:

for:
  a = tma_load a_desc, multicast=true
  b = tma_load b_desc, multicast=true
  acc += tcgen05_mma(a, b)
tma_store acc

@peterbell10
Copy link
Copy Markdown
Contributor

peterbell10 commented Apr 8, 2026

IIUC you're redefining tl.program_id from being the cluster id to being ctaid / num_ctas. This seems quite strange. If the user writes a program that doesn't rely on the entire program id running within a cluster, then why request a larger num_ctas at all? We could equally just run the program with a smaller cluster size. Both are bad though, because they break all the assumptions that the user has about a cluster being scheduled together.

My understand is that programs which use this fallback will usually have different code paths depending on the available cluster size, and dispatch to it at runtime. Without that, I don't really see the point of this.

@lezcano lezcano requested a review from peterbell10 April 8, 2026 15:21
Copy link
Copy Markdown
Contributor

@peterbell10 peterbell10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed with Mario offline. I think the semantics do make sense and are useful. Still have a few concerns about the details though.

Comment thread lib/Dialect/TritonNvidiaGPU/Transforms/PreferredClusterFallback.cpp Outdated
Comment thread third_party/nvidia/backend/driver.c Outdated
Comment on lines 544 to 552
// Convert ctaid to clusterid, which is the real program id
// Note that all cluster CTAs are distributed in the X dim
if (op.getDim() == ProgramIDDim::X) {
auto numCTAs = ttg::lookupNumCTAs(op);
if (numCTAs > 1) {
TritonLLVMOpBuilder b(loc, rewriter);
result = b.sdiv(result, b.i32_val(numCTAs));
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not quite convinced this works with CLC. Since the cta in cluster id is implemented as ctaid % numCTAs, I think this implies we will process the work for cancelled_ctaid // numCTAs + ctaid % numCTAs, where ctaid doesn't come from the clc result, so may be wrong.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, you are right. And actually solving the issue is a bit tricky with the current design, so I am just disabling it for now. It's not clear to me how to best fix it without having full control of the scheduler tbh.

Comment thread third_party/nvidia/lib/TritonNVIDIAGPUToLLVM/SPMDOpToLLVM.cpp Outdated
Comment thread third_party/nvidia/lib/NVGPUToLLVM/NVGPUToLLVMPass.cpp Outdated
@lezcano
Copy link
Copy Markdown
Contributor Author

lezcano commented Apr 9, 2026

review addressed.

@lezcano
Copy link
Copy Markdown
Contributor Author

lezcano commented Apr 9, 2026

I'm still running some benchmarks see how they look.

@lezcano lezcano marked this pull request as draft April 9, 2026 14:48
@lezcano
Copy link
Copy Markdown
Contributor Author

lezcano commented Apr 9, 2026

While running the benchmarks, I realised that there is a big issue with the approach. The user in gluon can inspect the layouts and perform constexpr computations using them. There is no way we can know whether these computations are invariant under layout changes.
In particular, I hit this in the function tcgen05_mma_barrier_count which computes statically the number of commits of tcgen05_commit.

On the other hand, hacking around that and lowering the barrier_init via a dynamic value I get the following results

 num_ctas  mode                 fallback  reqncta  ms      TFLOP/s  vs fixed
  --------  -------------------  --------  -------  ------  -------  --------
  2         fixed_cluster        0         yes      0.6474  1698.33  baseline
  2         preferred-mode       0         yes      0.6467  1700.18  +0.1%

  4         fixed_cluster        0         yes      0.7091  1550.58  baseline
  4         preferred_fallback   2         no       0.6862  1602.27  +3.3%

  8         fixed_cluster        0         yes      0.7808  1408.21  baseline
  8         preferred_fallback   2         no       0.6818  1612.71  +14.5%

  16        fixed_cluster        0         yes      0.8208  1339.60  baseline
  16        preferred_fallback   2         no       0.6803  1616.28  +20.6%

This is not autotuned, I'm using an optimal config for 2CTAs. So this shows that this method makes multicast viable.

If we wanted, we could hide this method behind a kernel flag that could be turned on by the user. It would be direct to generalise consan to support this mode I believe.

@ThomasRaoux
Copy link
Copy Markdown
Collaborator

While running the benchmarks, I realised that there is a big issue with the approach. The user in gluon can inspect the layouts and perform constexpr computations using them. There is no way we can know whether these computations are invariant under layout changes. In particular, I hit this in the function tcgen05_mma_barrier_count which computes statically the number of commits of tcgen05_commit.

On the other hand, hacking around that and lowering the barrier_init via a dynamic value I get the following results

 num_ctas  mode                 fallback  reqncta  ms      TFLOP/s  vs fixed
  --------  -------------------  --------  -------  ------  -------  --------
  2         fixed_cluster        0         yes      0.6474  1698.33  baseline
  2         preferred-mode       0         yes      0.6467  1700.18  +0.1%

  4         fixed_cluster        0         yes      0.7091  1550.58  baseline
  4         preferred_fallback   2         no       0.6862  1602.27  +3.3%

  8         fixed_cluster        0         yes      0.7808  1408.21  baseline
  8         preferred_fallback   2         no       0.6818  1612.71  +14.5%

  16        fixed_cluster        0         yes      0.8208  1339.60  baseline
  16        preferred_fallback   2         no       0.6803  1616.28  +20.6%

This is not autotuned, I'm using an optimal config for 2CTAs. So this shows that this method makes multicast viable.

If we wanted, we could hide this method behind a kernel flag that could be turned on by the user. It would be direct to generalise consan to support this mode I believe.

very interesting analysis and results. I'm also a bit concerned about changing the semantic without user knowing it. But exposing a flag that let the user opt in if we can clearly describe the rules that need to be followed + good sanitizer sounds like a reasonable solution.

That being said I would wait until we find use cases where this bring enough performance boost to justify the extra language feature to merge this.

@lezcano
Copy link
Copy Markdown
Contributor Author

lezcano commented Apr 10, 2026

Actually, I think that the issue is not as problematic as I initially thought.

The issue I found was not part of "we are using different layouts", there is not much that you can fuck up there as the gluon API is rather narrow.
This is part of the issues of "the cross-CTA op we support in this case is multicast", which I had forgotten about.

For multicast ops in this mode we naturally need to support it in the LLVM lowering, while this part we were handling in the frontend before. As such, the natural fix is to push this to a mbarrier_init-like op that lowers to the right count. If mbarrier_init accepted a dynamic initialisation we could reuse mbarrier_init together with a helper, but that's a different fix. For now, I have just made InitMmaBarrierOp be treated the same way as InitMBarrierOp across the codebase and that's that.

With this, I think this patch is safe to land really, as we cover all the necessary infra to support the only cross-CTA op we support in this mode, TMA multicast.

@lezcano lezcano marked this pull request as ready for review April 10, 2026 08:12
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 693e8db006

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread lib/Dialect/TritonNvidiaGPU/Transforms/PreferredClusterFallback.cpp
Copy link
Copy Markdown
Contributor

@peterbell10 peterbell10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Thomas that it would be good to show that the multicast setup gives a significant perf increase first, before introducing new complexity. I'm concerned that this will leak into user code and give surprising bugs.

mma_bar = mbarrier.allocate_mbarrier()
mma_bar_count: ttgl.constexpr = blackwell.tcgen05_mma_barrier_count([smemA, smemB], True)
mbarrier.init(mma_bar, count=mma_bar_count)
mbarrier.init_tcgen05_mma(mma_bar, [smemA, smemB])
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this represents a big abstraction leak. If the user calculates and sets the arrival count manually, they shouldn't get unexpected hangs.

Also, one could imagine a use case where you need to arrive on the same mbarrier from multiple different sources (mma, tma, manual arrive, etc..) whereas this limits you to only a single mma.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if you can mis arrives and tcgen05_commit s on the same barrier, but if you can there's no reason why this abstraction shouldn't allow it.

And sure, we need to use this op and not the other for this specific pattern, but exactly the same happens with things like having to pass the descriptors to tcgen05_commit or the multicast flag to mma. These patterns are tricky and there is so much you can represent natively at a language level...

Comment thread lib/Dialect/TritonNvidiaGPU/Transforms/PreferredClusterFallback.cpp Outdated
return_compiled=True,
)

assert compiled.metadata.preferred_cluster_fallback_ctas == 0
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we're now missing a test that exercises this PR positively?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, sorry, didn't add it after realising this one was off

@lezcano
Copy link
Copy Markdown
Contributor Author

lezcano commented Apr 10, 2026

I'll write the mma kernel generic on the cta layout and benchmark it, see how the multicast configs look lime vs the 2cta ones

@lezcano lezcano force-pushed the fallback_cluster branch 2 times, most recently from fa9405b to 3267a5c Compare April 14, 2026 19:27
@lezcano
Copy link
Copy Markdown
Contributor Author

lezcano commented Apr 15, 2026

I have tried to find some shapes for which the best multicast config is noticeably better than simply a 2CTA config but I haven't found anything. The configs are mostly competitive with each other within ~10-20TFLOPS. With this in mind, I'd say we put this in the back burner and revisit it when rubin comes, or when we have more multiCTA kernels to try it on.

cc @Mogball this might still be worth trying in your kernels just in case.

lezcano added a commit that referenced this pull request May 4, 2026
The helper we had was missing the necessary `two_ctas` flag to compute
the number of arrivals correctly for an mma that goes into a multicast
TMA. I changed the API all across

I think it would be much better if we just had a TTNG_InitMmaBarrierOp
as proposed in #9957, as this
would enable to get actual perf with multicast and would make the API
much cleaner, but for now we just get this helper.

What's nice is that this bug was found using the multicta consan :D
@lezcano lezcano force-pushed the fallback_cluster branch from 3267a5c to 7184511 Compare May 7, 2026 14:01
When running multiCTA kernels for `numCTAs > 2`, we are leaving some
perf on the table as GPUs may be able not use every single SM. This is
because SMs are grouped in TPCs (pairs of SMs) which then are grouped on
GPCs (sets of TPCs). Every SM is part of a TPC, but not every TPC is
part of a GPC of size 8, so if we launch a kernel with `numCTAs == 16`
where each CTA takes a full SM, this may only be run on GPCs with size
at least 8, leaving every other SM unused.

To account for this, NVIDIA exposes an API
```
CU_LAUNCH_ATTRIBUTE_PREFERRED_CLUSTER_DIMENSION
```
starting on SM100.

When invoking this API, we tell the GPU that we want to execute the
kernel with `numCTAs` if possible, but if not, use a fallback number of
CTAs and run it on less SMs.

In this PR, we add a pass that checks for cross-CTA data movement that
would make this invariant not hold. If all the ops in the kernel are
alright, then we apply the optimisation.

Then we change the lowerings with the following invariants:

In other words, this will always return the same number regardless of
whether the kernel was split or not.
For example, if we have a grid of 2 CGAs with 4 CTAs each and the second
launch is split into 2 launches, then we'll get
Launch 00 4CTAs: ClusterCTAIdOp: 0-3
Launch 10 2CTAs: ClusterCTAIdOp: 0-1
Launch 11 2CTAs: ClusterCTAIdOp: 2-3

In other words, this should be used to compute addresses in global
memory.

This will depend on the runtime launch size of the program In the
example above
Launch 00 4CTAs: NVVM::ClusterId: 0-3
Launch 10 2CTAs: NVVM::ClusterId: 0-1
Launch 11 2CTAs: NVVM::ClusterId: 0-1

This should be used to generate masks for multicast, for example, or to
compute predicates.

Under this model the pid is invariant under the splits, so the program
semantics under this transformation don't vary.

We also support CLC. All this is E2E tested in 03-matmul-multicta.py
@lezcano lezcano force-pushed the fallback_cluster branch from 7184511 to 5f1f03d Compare May 7, 2026 14:03
lezcano added a commit that referenced this pull request May 8, 2026
### Reviewers
This PR includes #10167 and
#10196 which I'll kill after
they are merged.

It is also separated in logical commits so that reviewing is simpler, as
I had to add a few optimisations to generate less code as it was taking
too long.

We left out an optimisation where we sliced the indices to avoid loading
the whole tensor from HBM when we know statically which rows we need to
load (e.g. if we just slice current_cta along a dimension). This should
help with compilation of num_ctas > 2 kernels as those take quite a bit
of time with the new model. This is not pressing as those programs are
not very performant without
#9957

### Idea
A CTA is modelled as a set of independent logical threads, as if we had
multiple
warp-specialised threads running in parallel.

Everything pretty much follows from that rule.

A multicast-layout barrier has one live barrier row, owned by the lead
CTA.
Every CTA in the barrier group may arrive / expect on that row, but only
the
lead CTA initializes, waits, and invalidates it.

This is the same model as several independent logical threads arriving
on one
barrier while only one logical thread waits on it.

### Shadow tables
The buffer and barrier tables are CTA-agnostic
We add a CTA dimension to go with every buffer/barrier/thread/mask
dimension:
```text
buffers                 | tensor  | <B x i64>
barriers                | tensor  | <K x i64>
barrierStates           | scratch | <Cbar x K x i64>
waiting                 | scratch | <Cbar x K x Cthr x i32>
writeVisibility         | scratch | <Cbuf x B x Cmask x i64>
readVisibility          | scratch | <Cbuf x B x Cthr x T x Cmask x i64>
writeTracking           | scratch | <Cbuf x B x Cbar x K x i8>
readTracking            | scratch | <Cbuf x B x Cbar x K x Cmask x i64>
outstandingCommits      | scratch | <C x B  x P x i8>
```

`Cbar`, `Cbuf`, `Cthr`, and `Cmask` are CTA dimensions qualifying
barriers,
buffers, threads, and thread masks respectively. C in outsatndingCommits
folds
the CTA dim for B and P, as there is no cross-CTA work in wgmma.
Each CTA dimension is placed
immediately before the dimension it qualifies. This keeps the multiCTA
lift
regular: a pre-existing dimension at position `pos` moves to `2 * pos`.

Aliasing happens per-CTA

### CTA Issuers and Receivers and their Representation

We need two pieces of CTA information:

1. Issuer predication: a 4-bit mask `m` such that the canonical issuer
of a
   group satisfies `(cta_id & m) == 0`.
2. Receiver sets: a runtime `cta_id`-dependent  `uint16_t` CTA bitmask 
(i.e., a `Value`) computed with the same lowering helpers used by the
    real implementation.

Example:
`mask == 0x1` in a 4-CTA kernel it means that:
CTA0 accessed CTA0 and CTA1
CTA2 accessed CTA2 and CTA3

### Cross-CTA memory effects

TMA and CLC multicast
  Their mask is the multicas group (all the CTAs in the case of CLC)

MMA / TMEMCopy, 2CTA
The lead CTA must observe all the inputs and write to all outputs of
both CTAs in the pair.
  In other words, the mask == 0x1
The idea here is that even though CTA0 and CTA1 collaborate, since it is
launched from CTA0
and its synchronisation is emitted from CTA0, we can model it as if CTA0
did all the work.

### Barrier semantics

TMA barriers and MMA completion barriers are dual.

```text
Operation                         Issuer                         Receiver
TMA, 1CTA, no multicast           cta_id                         cta_id
TMA, 2CTA, no multicast           cta_id                         cta_id & ~1
TMA, 1CTA, multicast              multicast-group leader         multicast-group
TMA, 2CTA, multicast              multicast-group leader         even cta_ids in the multicast-group

MMA, 1CTA, no multicast           cta_id                         0x0
MMA, 2CTA, no multicast           even CTA                       0x1
MMA, multicast                    see lowering                   broadcastBits_d
```

For TMA, barrier receivers are obtained by applying the barrier-leader
map to
the TMA data-receiver rows. The relevant lowering is
`third_party/nvidia/lib/TritonNVIDIAGPUToLLVM/LoadStoreOpToLLVM.cpp`.

For MMA completion, the multicast receiver mask is

```text
broadcastBits_d = getBlockBroadcastMask(d) | (twoCTAs ? 0x1 : 0x0)
```

using the same logic as
`third_party/nvidia/lib/TritonNVIDIAGPUToLLVM/DotOpToLLVM/MMAv5.cpp`.
Intuitively, an MMA completion waits for all CTAs that may write data
consumed
by the CTA performing the commit.

CLC is the 1-CTA TMA multicast special case whose multicast group
contains all
CTAs. The CLC layout makes this explicit through an all-zero
`cga_layout`.

For ordinary mbarrier ops, use the semantics in
`third_party/nvidia/lib/TritonNVIDIAGPUToLLVM/BarrierOpToLLVM.cpp`:

- `init`, `wait`, and `invalidate` are predicated on the barrier leader
CTA.
- `expect`, `arrive` are executed by each CTA.
- All operations target the leader barrier address.
- Therefore only leader barriers are live; non-leader barriers is as if
they didn't
exist and should never be accessed. In particular, non leader CTAs do
not block on
  waits.

A non-relaxed cluster barrier publishes all generic-proxy inflight
events to all
CTAs.

### Testing
I run this on `test_consan.py` and all the multicta tests / kernels that
we have, to make sure they all pass.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants